# Building a Vegan Likelihood Model

The goal is to build a model to either predict if a dish is vegan just from the recipe name, or create a scoring model to predict how likely or easily a recipe is or could be vegan.

Or rather, we can take the cosine similarity with a user input for a recipe name and out list of recipes from out database, and then use the top score to get the list of ingredients. Then our model can predict how likely the recipe would be vegan. Could just also list out potentially ingredients that would likely show up as non-vegan in this recipe to watch out for.

## EDA and Pre-processing

In [1]:
import pandas as pd
import ast

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
df = pd.read_csv('example_recipes.csv')

In [3]:
df.columns

Index(['uri', 'label', 'image', 'source', 'url', 'shareAs', 'yield',
       'dietLabels', 'healthLabels', 'cautions', 'ingredientLines',
       'ingredients', 'calories', 'totalWeight', 'totalTime', 'cuisineType',
       'mealType', 'dishType', 'totalNutrients', 'totalDaily', 'digest',
       'tags'],
      dtype='object')

In [4]:
ingredients_col = df['ingredients'].apply(ast.literal_eval)

In [5]:
ingredients_col[1][0]

{'text': '375g/13oz plain flour',
 'quantity': 375.0,
 'measure': 'gram',
 'food': 'flour',
 'weight': 375.0,
 'foodCategory': 'grains',
 'foodId': 'food_ahebfs0a985an4aubqaebbipra58',
 'image': 'https://www.edamam.com/food-img/b4c/b4c739e76a6f2172b7ad49d0aa41d5aa.jpg'}

In [6]:
dct = dict()
dct.keys()

dict_keys([])

In [7]:
#ingredients_col[1][0]['foodCategory']
dct[ingredients_col[1][0]['foodCategory']] = ingredients_col[1][0]['quantity']

In [8]:
dct = dict()
for i, row in ingredients_col.items():
    for j in range(len(row)):
        key = row[j]['foodCategory']
        value = row[j]['quantity']

        if key in dct.keys():
            dct[key] += value
        else:
            dct[key] = value

In [9]:
'text' in ingredients_col[0][0]

True

In [10]:
dct.keys()

dict_keys(['Eggs', 'Milk', 'Cheese', 'Dairy', 'grains', 'Condiments and sauces', 'condiments and sauces', 'Oils', 'vegetables', 'canned vegetables', 'milk', 'bread, rolls and tortillas', 'cured meats', 'fruit', 'canned grains', 'canned soup', 'bov', 'plant-based protein', 'sugars', 'quick breads and pastries', 'wines', '100% juice', 'water', 'yogurt', 'beer', 'ready-to-eat cereals', 'sugar syrups', 'Cured meats', 'crackers', 'savory snacks', 'liquors and cocktails', 'meats', 'sugar jam', 'Vegan products', 'candy', 'chocolate', None, 'canned fruit', 'non-dairy beverages', 'flavored water', 'cocktails and liquors', 'canned seafood', 'seafood', 'Poultry', 'sweetened beverages', 'pastries', 'frozen treats', 'coffee and tea', 'eggs', 'cooked grains', 'Plant-based protein', 'frozen grained based', 'mixed grains', 'sandwhiches', 'protein and nutritional powders', 'salads'])

In [47]:
ex_s = ''
ex_s_lst = ast.literal_eval(df.iloc[0]['ingredientLines'])
for s in ex_s_lst:
    ex_s += s + ', '

ex_s

'1 organic large egg, 1 teaspoon whole milk or water, 1 tablespoon cheddar cheese, shredded (you can use other types of cheese), 1 teaspoon butter or oil, '

In [55]:
', '.join(ex_s_lst)

'1 organic large egg, 1 teaspoon whole milk or water, 1 tablespoon cheddar cheese, shredded (you can use other types of cheese), 1 teaspoon butter or oil'

## Preprocessing for Transformer

Going to try to preprocess data to take the dish/recipe name (`label` column) and the `healthLabel` as input, and output the ingredients list as a long string. We can use a transformer for this to output the recipe's ingredients in long text form. 

The `healthLabel` will only be a select few options though, and the user can only select 1 for now, from: ['Mediterranean', 'Vegetarian', 'Vegan', 'Red-Meat-Free', 'Paleo', 'Pescatarian']. When reducing the healthLabels column from multilabel to categorical, we need to define a priority order, and if none of these are there then the dish is balanced, so we will add that as an option. In the future, some analysis on this column should be done to improve the priority order, rather than relying on domain knowledge. Alternatively, and option to select multiple could be implemented instead.

In [74]:
priority_order = ['Vegan', 'Vegetarian', 'Pescatarian', 'Paleo', 'Red-Meat-Free', 'Mediterranean']

In [87]:
health_labels = df['healthLabels'].apply(ast.literal_eval)
health_labels

0       [Sugar-Conscious, Low Potassium, Kidney-Friend...
1       [Sugar-Conscious, Low Potassium, Kidney-Friend...
2       [Sugar-Conscious, Low Potassium, Kidney-Friend...
3       [Sugar-Conscious, Low Potassium, Kidney-Friend...
4       [Vegetarian, Pescatarian, Egg-Free, Peanut-Fre...
                              ...                        
1195    [Keto-Friendly, Pescatarian, Mediterranean, Gl...
1196    [Pescatarian, Gluten-Free, Wheat-Free, Egg-Fre...
1197    [Sugar-Conscious, Keto-Friendly, Pescatarian, ...
1198    [Sugar-Conscious, Keto-Friendly, Pescatarian, ...
1199    [Sugar-Conscious, Pescatarian, Mediterranean, ...
Name: healthLabels, Length: 1200, dtype: object

In [88]:
def replace_with_priority(labels):
    for label in priority_order:
        if label in labels:
            return label
    return 'Balanced'  # Handle case where no label matches priority_order, in which case the diet is balanced

# Apply function to the multilabels series
diet_type = health_labels.apply(replace_with_priority)

In [89]:
diet_type.value_counts()

healthLabels
Vegetarian       400
Vegan            279
Pescatarian      209
Balanced         181
Red-Meat-Free     57
Paleo             42
Mediterranean     32
Name: count, dtype: int64

Now we have our dietary preference column. The recipe name is fine as is so next is the ingredients list which is our target variable.

In [90]:
recipe_name = df['label']

We need to just join these lists of strings with commas so they be user friendly to read. 

In [91]:
ingredients_lst = df['ingredientLines'].apply(ast.literal_eval)
ingredients_lst = ingredients_lst.apply(lambda x: ', '.join(x))
ingredients_lst

0       1 organic large egg, 1 teaspoon whole milk or ...
1       375g/13oz plain flour, Pinch salt, 225g/8oz bu...
2       1 cup asiago cheese, grated, 1 cup fontina che...
3       4 ounces, weight Cream Cheese, Softened, 1/2 c...
4       8 ounces elbow pasta, 1/4 cup unsalted butter,...
                              ...                        
1195    4 (6 ounce) tilapia fillets, salt, pepper, 1/2...
1196    * 1 Vegetable oil cooking spray, * 4 U.S.-farm...
1197    2 tbsp chopped red onion, 1 tbsp olive oil, 1 ...
1198    3 tablespoons unsalted butter, 2 tablespoons e...
1199    * 2 tilapia fillets (skinless, about 4 ounces ...
Name: ingredientLines, Length: 1200, dtype: object

Now we can make our dataframe for the modeling.

In [95]:
df2 = pd.concat([diet_type, recipe_name, ingredients_lst], axis = 1)
column_names = {'healthLabels': 'dietType', 'label': 'recipeName', 'ingredientLines': 'ingredientsList'}
df2 = df2.rename(columns=column_names)
df2.head()

Unnamed: 0,dietType,recipeName,ingredientsList
0,Vegetarian,Cheese Omelette,"1 organic large egg, 1 teaspoon whole milk or ..."
1,Vegetarian,Cheese straws,"375g/13oz plain flour, Pinch salt, 225g/8oz bu..."
2,Vegetarian,CHEESE GOOP,"1 cup asiago cheese, grated, 1 cup fontina che..."
3,Vegetarian,Pimento Cheese,"4 ounces, weight Cream Cheese, Softened, 1/2 c..."
4,Vegetarian,Five Cheese Skillet Mac and Cheese recipes,"8 ounces elbow pasta, 1/4 cup unsalted butter,..."


## Modeling

https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/conditional_text_generation_with_gpt.ipynb#scrollTo=YMg9nX6SczHe

https://gretel.ai/blog/conditional-text-generation-by-fine-tuning-gretel-gpt

In [113]:
import transformers
import torch
print("Transformers version:", transformers.__version__)
print("Torch version:", torch.__version__)

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from sklearn.model_selection import train_test_split

Transformers version: 4.37.2
Torch version: 2.2.0+cu118


In [102]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [122]:
input_feature = (df2['dietType'] + " " + df2['recipeName']).tolist()
output_feature = df2['ingredientsList'].tolist()

encoded_inputs = tokenizer(input_feature, output_feature)

In [None]:
# Define the transformer model architecture
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Split the dataset into train, validation, and test sets
train_inputs, test_inputs, train_labels, test_labels = train_test_split(encoded_inputs['input_ids'], encoded_inputs['labels'], test_size=0.2, random_state=42)
train_inputs, val_inputs, train_labels, val_labels = train_test_split(train_inputs, train_labels, test_size=0.2, random_state=42)

# Fine-tune the model
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    model.train()
    # Training loop

# Evaluation on validation set
model.eval()
# Evaluation loop

# Evaluation on test set
model.eval()
# Test loop

# Inference on new data
new_diet_type = "Vegetarian"
new_recipe_name = "Your new recipe name"
new_input = tokenizer(new_recipe_name, new_diet_type, return_tensors='pt', padding=True, truncation=True)
predicted_output = model.generate(new_input['input_ids'], max_length=100, num_return_sequences=1)

decoded_output = tokenizer.decode(predicted_output[0], skip_special_tokens=True)
print("Predicted ingredients:", decoded_output)