# Building a Multi-Output Model to Predict Macronutrient Profile

In this notebook we are going to improve on the current model to predict calores, and instead we will predict all 3 macronutrients. Here is how it will work:

- The user will input a recipe name.
- Then a list of suggested ingredients will appear, and the user can optionally edit this by either adding/subtracting ingredients or modifying quantities
    - This will be obtained for now by just running the cosine similarity with the recipe names in the data, and taking the ingredients from the top 5 or so results.
    - Quantities may be ignored if model can't handle numeric inputs, but LLMs can if going that route.
    - Being able to edit this could be a paid feature, i.e. changing quantities or adding/subtracting ingredients.
- The model will then take as input, the recipe name concatenated with the ingredients list. Perhaps with quantities as well if using an LLM.
    - Consider adding dietery type/preference here as well.
- Finally, a multi-output regression model will be used to predict the macronutrients, and we can take ratios by dividing each macro by the total calorie count. The aim here is that during training, the model will learn to optimize its parameters to minimize the error between the predicted and ground truth ratios derived from the normalized macronutrient values.

In [1]:
#keeping all imports at the top
import pandas as pd
import ast

for module_name in ['pandas',]:
    module = __import__(module_name)
    print(f"{module_name}: {module.__version__}")

pandas: 2.2.0


## EDA and Preprocessing

In [2]:
df = pd.read_csv('../recipes.csv')
df.columns

Index(['uri', 'label', 'image', 'source', 'url', 'shareAs', 'yield',
       'dietLabels', 'healthLabels', 'cautions', 'ingredientLines',
       'ingredients', 'calories', 'totalWeight', 'totalTime', 'cuisineType',
       'mealType', 'dishType', 'totalNutrients', 'totalDaily', 'digest',
       'tags'],
      dtype='object')

In [17]:
df['ingredientLines']

0        ['1 pound green beans, trimmed', '1 tablespoon...
1        ['1 1/2 lb green beans, stem ends trimmed', '1...
2        ['1 stick (8 tbsp.) unsalted, cultured butter'...
3        ['2 teaspoons walnut oil', '1 pound green bean...
4        ['1 pound green beans, trimmed', '2 teaspoons ...
                               ...                        
13267    ['* 2tablespoons olive oil', '* 1 large red be...
13268    ['Two 6-ounce cans white meat tuna packed in w...
13269    ['16 ounces low-sodium chunk light tuna, drain...
13270    ['1 can (3 ounces) tuna, drained', '1 tablespo...
13271    ['1 can (3 ounces) chunk light tuna in water, ...
Name: ingredientLines, Length: 13272, dtype: object

Note the ingredientLines column is all we need if we concatenate ingredients with the recipe name, but since we want to adjust the ingredients and quantity for the input, we will need to get the information from the ingredients column manually instead.

In [3]:
ingredients_col = df['ingredients'].apply(ast.literal_eval)

In [4]:
ingredients_col[0]

[{'text': '1 pound green beans, trimmed',
  'quantity': 1.0,
  'measure': 'pound',
  'food': 'green beans',
  'weight': 453.59237,
  'foodCategory': 'vegetables',
  'foodId': 'food_aceucvpau4a8v6atkx5eabxyoqdn',
  'image': 'https://www.edamam.com/food-img/891/89135f10639878a2360e6a33c9af3d91.jpg'},
 {'text': '1 tablespoon butter, (optional)',
  'quantity': 1.0,
  'measure': 'tablespoon',
  'food': 'butter',
  'weight': 14.2,
  'foodCategory': 'Dairy',
  'foodId': 'food_awz3iefajbk1fwahq9logahmgltj',
  'image': 'https://www.edamam.com/food-img/713/71397239b670d88c04faa8d05035cab4.jpg'},
 {'text': 'Coarse salt and ground pepper',
  'quantity': 0.0,
  'measure': None,
  'food': 'Coarse salt',
  'weight': 2.80675422,
  'foodCategory': 'Condiments and sauces',
  'foodId': 'food_a1vgrj1bs8rd1majvmd9ubz8ttkg',
  'image': 'https://www.edamam.com/food-img/694/6943ea510918c6025795e8dc6e6eaaeb.jpg'},
 {'text': 'Coarse salt and ground pepper',
  'quantity': 0.0,
  'measure': None,
  'food': 'groun

In [5]:
def get_ingredient_aspect(row, aspect):
    lst = []
    for j in range(len(row)):
        lst.append(row[j][aspect])
    return lst

food_ingredients = ingredients_col.apply(lambda row: get_ingredient_aspect(row, 'food'))
quantity_ingredients = ingredients_col.apply(lambda row: get_ingredient_aspect(row, 'quantity'))
measure_ingredients = ingredients_col.apply(lambda row: get_ingredient_aspect(row, 'measure')) # will need this to understand quantity

The `healthLabel` is the dietary type/preference like vegan, pescaterian, etc. We will only select few options though as it is a multilabel column, and the user can only select 1 for now, from: ['Mediterranean', 'Vegetarian', 'Vegan', 'Red-Meat-Free', 'Paleo', 'Pescatarian']. When reducing the healthLabels column from multilabel to categorical, we need to define a priority order, and if none of these are there then the dish is balanced, so we will add that as an option. In the future, some analysis on this column should be done to improve the priority order, rather than relying on domain knowledge. Alternatively, and option to select multiple could be implemented instead.

In [6]:
health_labels = df['healthLabels'].apply(ast.literal_eval)

Let's take a look at all unique values of health labels. If we had more data it might be worth it to just keep all of these health labels. 

In [7]:
unique_health_labels = []
for lst in health_labels:
    for health in lst:
        unique_health_labels.append(health)

print(set(unique_health_labels))


{'Mustard-Free', 'Low Potassium', 'DASH', 'Red-Meat-Free', 'Mollusk-Free', 'Kosher', 'Immuno-Supportive', 'Vegetarian', 'Shellfish-Free', 'Gluten-Free', 'Soy-Free', 'Sulfite-Free', 'Sugar-Conscious', 'Pork-Free', 'Kidney-Friendly', 'Low Sugar', 'Tree-Nut-Free', 'Wheat-Free', 'Celery-Free', 'Lupine-Free', 'Alcohol-Cocktail', 'Paleo', 'Mediterranean', 'Crustacean-Free', 'Dairy-Free', 'Alcohol-Free', 'Vegan', 'No oil added', 'Fish-Free', 'Peanut-Free', 'Keto-Friendly', 'Egg-Free', 'Sesame-Free', 'FODMAP-Free', 'Pescatarian'}


In [8]:
priority_order = ['Vegan', 'Vegetarian', 'Pescatarian', 'Paleo', 'Red-Meat-Free', 'Mediterranean']

In [9]:
def replace_with_priority(labels):
    for label in priority_order:
        if label in labels:
            return label
    return 'Balanced'  # Handle case where no label matches priority_order, in which case the diet is balanced

# Apply function to the multilabels series
priority_health_labels = health_labels.apply(replace_with_priority)

### Target Variable

Now let's get the macros and calories can be calcualted from that.

In [10]:
df.columns

Index(['uri', 'label', 'image', 'source', 'url', 'shareAs', 'yield',
       'dietLabels', 'healthLabels', 'cautions', 'ingredientLines',
       'ingredients', 'calories', 'totalWeight', 'totalTime', 'cuisineType',
       'mealType', 'dishType', 'totalNutrients', 'totalDaily', 'digest',
       'tags'],
      dtype='object')

In [11]:
nutrients = df['totalNutrients'].apply(ast.literal_eval)

In [12]:
nutrients[0].keys()

dict_keys(['ENERC_KCAL', 'FAT', 'FASAT', 'FATRN', 'FAMS', 'FAPU', 'CHOCDF', 'CHOCDF.net', 'FIBTG', 'SUGAR', 'PROCNT', 'CHOLE', 'NA', 'CA', 'MG', 'K', 'FE', 'ZN', 'P', 'VITA_RAE', 'VITC', 'THIA', 'RIBF', 'NIA', 'VITB6A', 'FOLDFE', 'FOLFD', 'FOLAC', 'VITB12', 'VITD', 'TOCPHA', 'VITK1', 'WATER'])

In [13]:
for nutrient in nutrients[1].keys():
    print(nutrients[1][nutrient])

    #just want to look at these more closely, but we don't need to look at the micronutrients
    if nutrients[0][nutrient]['label'] == 'Cholesterol':
        break

{'label': 'Energy', 'quantity': 331.96545205, 'unit': 'kcal'}
{'label': 'Fat', 'quantity': 15.008954821, 'unit': 'g'}
{'label': 'Saturated', 'quantity': 2.2059442775, 'unit': 'g'}
{'label': 'Trans', 'quantity': 0.0, 'unit': 'g'}
{'label': 'Monounsaturated', 'quantity': 9.9235888555, 'unit': 'g'}
{'label': 'Polyunsaturated', 'quantity': 2.19255406715, 'unit': 'g'}
{'label': 'Carbs', 'quantity': 47.8064322835, 'unit': 'g'}
{'label': 'Carbohydrates (net)', 'quantity': 29.2874412985, 'unit': 'g'}
{'label': 'Fiber', 'quantity': 18.518990985000002, 'unit': 'g'}
{'label': 'Sugars', 'quantity': 22.359966893000003, 'unit': 'g'}
{'label': 'Protein', 'quantity': 12.551760556500001, 'unit': 'g'}
{'label': 'Cholesterol', 'quantity': 0.0, 'unit': 'mg'}


Calories is fat x 9 + protein x 4 + carbs x 4 + fiber x 2

In [14]:
fat = 2.2059442775 + 9.9235888555 + 2.19255406715 #adding up all the different types of fat doesn't result in the total fat for some reason
(15.008954821*9 + 29.2874412985*4 + 12.551760556500001*4) + 18.518990985000002*2

339.47538277900003

For some reason when you add this up the calories isn't an exact math with the recorded calories in the calories column (which is the same as the Energy label here). We will just use the calculation for now instead.

In [32]:
def get_macros(nutrients_row):
    macros_dct = {}

    for nutrient in nutrients_row.keys():
        if nutrients_row[nutrient]['label'] == 'Fat':
            macros_dct['fat'] = nutrients_row[nutrient]['quantity']
        elif nutrients_row[nutrient]['label'] == 'Protein':
            macros_dct['protein'] = nutrients_row[nutrient]['quantity']
        elif nutrients_row[nutrient]['label'] == 'Carbohydrates (net)':
            macros_dct['carbs'] = nutrients_row[nutrient]['quantity']
        elif nutrients_row[nutrient]['label'] == 'Fiber':
            macros_dct['fiber'] = nutrients_row[nutrient]['quantity']

    return macros_dct

In [33]:
nutrients.apply(lambda row: get_macros(row))

0        {'fat': 12.559853307785998, 'carbs': 19.920021...
1        {'fat': 15.008954821, 'carbs': 29.2874412985, ...
2        {'fat': 93.62645482099998, 'carbs': 29.1207512...
3        {'fat': 9.997903214, 'carbs': 19.368394199, 'f...
4        {'fat': 11.359097481766614, 'carbs': 20.608816...
                               ...                        
13267    {'fat': 34.50517705361625, 'carbs': 110.728102...
13268    {'fat': 63.17689642683119, 'carbs': 2.60068561...
13269    {'fat': 30.10379341867, 'carbs': 164.929173333...
13270    {'fat': 31.519833673398182, 'carbs': 24.483712...
13271    {'fat': 18.53965, 'carbs': 25.43745, 'fiber': ...
Name: totalNutrients, Length: 13272, dtype: object

In [34]:
recipe_name = df['label']
#calories = df[]

relevant_cols_df = pd.concat([priority_health_labels, recipe_name, food_ingredients, quantity_ingredients, measure_ingredients], axis=1)
relevant_cols_df.head()