# CSCI 5525 Programming HW1

**Authors**: Meredith Bain
**consulted**: Anna Ton Nu

**Emails**: bain0074@umn.edu
**consulted**: tonnu016@umn.edu

**Submission.** Please insert your names and emails above, save your code in this notebook, and explain what you are doing along with your findings in text cells. You can think of it as a technical report with code. Before submission, please use `Kernel -> Restart & Run All` in the Jupyter menu to verify your code is runnable and save all outputs. Afterwards, you can either upload your raw notebook (`hw1.ipynb`) or an exported PDF version to the `Homework 1` assignment in Canvas.


In this homework, we will begin looking at diet data and formulate a diet prediction problem. Diet prediction is a historically important problem that helped shape optimization theory and machine learning. You can read about it here: [Dantzig Diet Optim Backstory](), and [Stigler Diet](https://en.m.wikipedia.org/wiki/Stigler_diet).

 The purpose of this homework is to:
  - Data Manipulation let you practice different techniques that you can use to preprocess raw data.
  - Prepare data for an ML objective
  - Create a Neural Net Model to solve several problems.

**Note**: You can use either local runtimes to complete this assignment, or a hosted runtime (with GPU) on Colab. The second option generally runs faster.  If using a runtime hosted on Colab, you can use the File Explorer pane on the left to upload the `epi_full.csv` file. Make sure to wait until the file finishes uploading before running the next code block.

## Prepare Data
The data is taken from kaggle: [Epicurious Dataset ML example](https://www.kaggle.com/datasets/hugodarwood/epirecipes), and there are several uses: [Epicurious Dataset ML example](https://www.kaggle.com/code/mathchi/2-lr-nb-knn-rf-predict-data-epicurious).  Feel free to examine and import ideas from kaggle uses. For convenience the dataset is included in the canvas.  Download from, upload to colab and load the data and understand the columns.  Each row is a meal, and the columns give: How good it tastes (Rating), Basic nutrition info (Macros = Calories, Carb, Fat, Protein), whether the meal is part of a named list (the X.type columns), Region information (e.g. Alabama or other place name), Ingredient names (Artichoke, beef, etc.), Meal Timing (Breakfast, Brunch, Lunch, Dinner), and several more.

Your job is to:

- group all columns that are food ingredients into ingredient vectors
- create a list of all possible daily menus by forming a new list which takes all combinations of Breakfast, Lunch and Dinner (BLD) options.
- For the $i^{th}$ daily menu BLD combination, add the three calory values, add the Macros, average the three ratings, and add the thee ingredient vectors.  The summed ingredient vector $x_i$ is our predictor.
- The objective is to predict acceptable daily menus.  To be acceptable, we need to keep within the Daily Recommended Intake values for Cals and Macros, and serve meals in the top 25% of meal ratings. Look up your DRI values here: [DRI calculator](https://www.omnicalculator.com/health/dri). Note that the values have Upper and Lower values to the range.  Acceptable would be to stay in that range for all values.
- Use the criteria above (ratings in upper 25% and meal plan within upper and lower DRI ranges) to create a target score vector $y$ by assigning to the $i^{th}$ meal plan $y_i=0$ if unacceptable and $y_i=1$ if acceptable.  
- Learn to predict $y_i$ from $x_i$ using an MLP in Pytorch.  A good tutorial is here: [Dive into Deep Learning: MLP](https://d2l.ai/chapter_multilayer-perceptrons/mlp.html)

```python
import pandas as pd
data = pd.read_csv('epi_full.csv')

```


Notes from class

1. table join is needed to join in what nutrients these foods have, on average
2. you can look up missing nutritional info externally if you want
    a. looks like carbs are missing for all entries
3. you may want to use a sparse tensor in pytorch to make stuff run faster
4. 2 layers: test the output, then map
5. Bayes optimal classification
    a = log((p(x|C1)p(C1)) / (p(x|C2)p(C2))) log-odds
    class posterior can be written as P(C1|x) = 1 / (1 + exp(-a)) = sigma(a)

    softmax is the n-version of sigmoid function (k-class) 

In [None]:
import pandas as pd
import numpy as np

"""
Below is data exploration scratch work to determine
what needed to be done to clean the data. This is where 
column names were exported to determine which were 
ingredients in need of carb lookups & macro aggregations 
later.
"""

data = pd.read_csv('epi_full.csv')
print(len(data))

# print(data["X3.ingredient.recipes"])

# data["alaska"].to_csv('alaska.csv')
# data["almond"].to_csv('almond.csv')
# data["X3.ingredient.recipes"].to_csv('X3.csv')

# data.iloc[2:14].to_csv('sample_rows.csv')


# with open('columns.csv', 'w') as f:
#     for col in data.columns:
#         print(col)
#         f.write(col + '\n')

# print(data[data["drink"] == 1]["title"])
# print(data[data["drink"] == 1]["drinks"])


# print(data[data["drinks"] == 1]["title"])
# print(data[data["drinks"] == 1]["drink"])
# print(data[data["drinks"] == 1]["title","drinks"])


### Explanation of work

To begin with, the column names from the epi_full dataset werre exported to a separate CSV, where each row represented a header name from the original file. Ingredient columns were manually tagged with a 1, so that a filter could be made for the columns that need carb information looked up externally.

The Kaggle dataset https://www.kaggle.com/datasets/utsavdey1410/food-nutrition-dataset?resource=download was used for carb information after being combined into a single CSV nutrition-facts-lookup.csv. Google Sheets was used to fuzzy match with a non-exact VLOOKUP, which was manually corrected where necessary. Where there was no match in the external table, the closest substitute available was used (for example, ground chicken for poultry sausage). 

Once the lookup information was complete, ingredients for each dish were collected in the epi_full dataset and macros and calories for each dish were calculated. 

Then, itertools was used to create all breakfast, lunch, and dinner combos, and calculate the daily macro/calories/rating and label for each one, stopping at ~2.1M combinations for the sake of time. This was the data that was fed into Pytorch (BLD.tsv).

The prediction from the model does not seem to work well. Predictions may have been improved by using all of the 122M+ data combinations -- since only a small proportion of the data fit the criteria, it is possible that there were not enough 1's in the dataset. Different loss functions may have yielded more accurate results.

In [None]:
# YOUR SOLUTION HERE
import pandas as pd
import numpy as np

#filter dataframe to ingredients
cols = pd.read_csv('columns.csv')
# print(cols)
ingredient_filter = ['title']
ingredient_filter += cols[cols['ingredient']==1]["Column Name"].to_list()

print(ingredient_filter)
all_ingredients = data.loc[:, data.columns.isin(ingredient_filter)]
all_ingredients.to_csv('ingredients.csv')
# print(ingredients)

#make df of nutrition facts to look up carbs below
nutrition_facts = pd.read_csv("combined-nutrition.csv", index_col=False)
nutrition_facts.index.name = None
print(nutrition_facts)

# print(data[data["title"] == "Beef Tenderloin with Garlic and Brandy "])






In [None]:
"""
this cell handles looking up the missing carb information
and appending it to the original dataset
"""

#get the ingredients for a dish
def get_ingredients(row):
    ingredients = []
    for ingredient in ingredient_filter:
        # print(ingredient)
        # print(row[ingredient].iloc[-1])
        if row[ingredient].iloc[-1] == 1:
            ingredients.append(ingredient)
    return ingredients

def create_ingredients_lists(data):
    ingredients_lookup = {}
    for title in data["title"]:
        print('title:',title)
        row = data[data["title"] == title]
        print(row)
        ingredients = get_ingredients(row)
        ingredients_lookup[title] = ingredients
    return ingredients_lookup

#lookup carb values of foods
def get_carbs(title_of_food, ingredients_lookup):
    carb_crosswalk = pd.read_csv('nutrition-facts-lookup.csv')
    carbs = 0
    for ingredient in ingredients_lookup[title_of_food]:
        # print(ingredient, carb_crosswalk[carb_crosswalk["title"] == ingredient]["manual match"].iloc[-1])
        lookup_value = carb_crosswalk[carb_crosswalk["title"] == ingredient]["manual match"].iloc[-1]
        carb = nutrition_facts[nutrition_facts["food"] == lookup_value]["Carbohydrates"].iloc[-1]
        # print(carb)
        # print(type(carb))
        carbs += float(carb)
    # print(carbs)   
    return carbs

# create a new column in the original dataset with the carb values. 
# all_ingredients = all_ingredients[:1]
all_ingredients.index.name = None
print(all_ingredients)

ingredients_lookup = create_ingredients_lists(all_ingredients)
print(ingredients_lookup)
carb_list = []
for title in all_ingredients["title"]:
    print(title)
    try:
        carbs = get_carbs(title, ingredients_lookup)
        carb_list.append(carbs)
    except:
        carb_list.append(0)

# print(carb_list)

# add in the carbs
data["carbs"] = carb_list
data.to_csv('carbs_added.csv')

data = pd.read_csv('carbs_added.csv')


In [None]:
"""
This cell sets the thresholds for determining whether a record
fits the criteria (1) or not (0)
"""

#calculate the top 25% of ratings
q1 = data["rating"].quantile(0.25)
print(q1)

carbs_q2 = data["carbs"].quantile(0.5)
carbs_q3 = data["carbs"].quantile(0.75)
carbs_q4 = data["carbs"].quantile(0.99)


protein_q2 = data["protein"].quantile(0.5)
protein_q3 = data["protein"].quantile(0.75)
protein_q4 = data["protein"].quantile(0.99)


print(f"middle carb range: {carbs_q2*3}-{carbs_q3*3}; {carbs_q4*3}")
print(f"middle protein range: {protein_q2*3}-{protein_q3*3}; {protein_q4*3}")


#set macro boundaries
calory_goal = 2285.2
calory_min = 0.9 * calory_goal
calory_max = 1.1 * calory_goal
carb_min = 215
carb_max = 371
fat_min = 51
fat_max = 89
protein_min = 40
protein_max = 141



In [None]:
"""
This cell creates the meal combinations and aggregate the nutrition information,
and evaluates whether each record meets the criteria
"""

#make the BLD combos
import itertools

breakfasts = data[data["breakfast"] == 1]["title"]
lunches = data[data["lunch"] == 1]["title"]
dinners = data[data["dinner"] == 1]["title"]

print("Creating triplets")
print(len(breakfasts), len(lunches), len(dinners))
triplets = list(itertools.product(breakfasts, lunches, dinners))
print(f"{len(triplets)} triplets created")

#look up calories, protien, fats, carbs, & ratings of a recipe
def get_macros(triplet):
    calories = carbs = fat = protein = rating = 0
    top_25 = True
    for title in triplet:
        # print(title)
        row = data[data["title"] == title]
        # print("row", row)
        calories += row["calories"].iloc[0]
        # print("calories:", calories)
        carbs += row["carbs"].iloc[-1]
        # print("carbs:", carbs)
        fat += row["fat"].iloc[-1]
        # print("fat:", fat)
        protein += row["protein"].iloc[-1]
        # print("protein:", protein)
        rating += row["rating"].iloc[-1]

        if row["rating"].iloc[-1] < q1:
            top_25 = False
        # print("rating:", rating)
        # print()
    
    rating = rating / 3

    if (not top_25) or (calories > calory_goal) or (protein < protein_min) or (carbs < carb_min) or (carbs > carb_max) or (fat < fat_min) or (fat > fat_max):
        y = 0
    else: 
        y = 1

    return calories, carbs, fat, protein, rating, y

i = 0


#I stopped this around 2.1M rows, since running the whole 122M+ rows would have taken 43 hours.
with open("BLD.tsv", "w") as f:
    f.write("combo\tcalories\tcarbs\tfat\tprotein\trating\ty\n")
    for combo in triplets:
        # print('combo', combo)
        if i % 100000 == 0:
            print(i)
            if i == 210_000_000: #running for the full 122M+ combos will take too long
                break
        i += 1
        calories, carbs, fat, protein, rating, y = get_macros(combo)
        # if y == 1:
        #     print("MEETS CRITERIA")
        f.write(f"{combo}\t{calories}\t{carbs}\t{fat}\t{protein}\t{rating}\t{y}\n")


In [None]:
""" 
this cell creates and trains the model using Pytorch
"""


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd

class SimpleNet(nn.Module):
    def __init__(self, input_size):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(input_size, 10)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(10, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

BLD = pd.read_csv('BLD.tsv', sep='\t')

# turn data into X matrices and y labels
X = BLD.iloc[:, 1:-1].values
# print(X[1])
y = BLD.iloc[:, -1].values   
# print(y[1])

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

input_size = X.shape[1]
print("input size:", input_size)
model = SimpleNet(input_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)


dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=512, shuffle=True)

print("training model...")
epochs = 100
for epoch in range(epochs):
    for inputs, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = nn.BCEWithLogitsLoss()
        # loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{epochs}]')#, Loss: {loss.item():.4f}')

print('Training finished!')

#predict
test_input = torch.tensor([[1000, 300, 60, 60, 5]], dtype = torch.float32)
output = model(test_input)
_, predicted = torch.max(output, 1)
print(f"prediction: {predicted.item()}")

