## The file is based on the Google colab. The file focus on creating a recommendation system based on an detected food vector.
## The food recipe dataset is from the Kaggle. Food Ingredients and Recipes Dataset with Images: https://www.kaggle.com/datasets/pes12017000148/food-ingredients-and-recipe-dataset-with-images



# **The Baseline Model: cosine similarity**





**1. Dataset** **download**





In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("pes12017000148/food-ingredients-and-recipe-dataset-with-images")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/pes12017000148/food-ingredients-and-recipe-dataset-with-images?dataset_version_number=1...


100%|██████████| 206M/206M [00:01<00:00, 125MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/pes12017000148/food-ingredients-and-recipe-dataset-with-images/versions/1


**2. Dataset** **loading**

In [3]:
import pandas as pd
import os
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

### data loading
data_path = "/root/.cache/kagglehub/datasets/pes12017000148/food-ingredients-and-recipe-dataset-with-images/versions/1"
csv_file = os.path.join(data_path, "Food Ingredients and Recipe Dataset with Image Name Mapping.csv")
df = pd.read_csv(csv_file)

In [None]:
print("Original Cleaned_Ingredients[5]:")
cleaned_ingredients = eval(df['Cleaned_Ingredients'].iloc[2])
Length_limit = 4
for i in range(0, len(cleaned_ingredients), Length_limit):
    print(cleaned_ingredients[i:i + Length_limit])

**3. Data** **cleaning**

In [None]:
### data cleaning
df = df.dropna(subset=['Cleaned_Ingredients', 'Title', 'Instructions'])
df = df.drop_duplicates(subset=['Title'])

**4. Main** **Body Part**

In [None]:
# Extract only the last word from the sentences in the list, ignoring the others
def ingredient_standarization(ingredient):
    words = ingredient.split()
    if len(words) > 1:
        return words[-1]
    elif words:
        return words[0]
    return ""

### Standardize the cleaned_ingredients column
ingre_std_list = []
for ingredients in df['Cleaned_Ingredients']:
    ingre_std = []
    for ingredient in eval(ingredients):  # convert a string list into a Python list
        ingre_std.append(ingredient_standarization(ingredient))  # use function to standardize every ingredient
    ingre_std_list.append(ingre_std)

df['Ingredients_list'] = ingre_std_list

### filter the null string
ingre_filter_list = []
for ingredients in df['Ingredients_list']:
    ingre_filtered = []
    for ingredient in ingredients:
        ### check if the string is not ' '
        if ingredient.strip():
            ingre_filtered.append(ingredient)
    ingre_filter_list.append(ingre_filtered)

df['Ingredients_list'] = ingre_filter_list

### collect all ingredients from the ingredients list
ingredient_list_total = set()
for ingredients in df['Ingredients_list']:
    for ingredient in ingredients:
        if ingredient.strip():
            ingredient_list_total.add(ingredient)


### use the ingredients as the labels of our MLB
mlb = MultiLabelBinarizer(classes=list(ingredient_list_total))
ingredient_vectors = mlb.fit_transform(df['Ingredients_list'])

# input vector from YOLO
YOLO_detected_ingredients = ['beef', 'corn', 'egg','cucumber','spinach','sugar','cabbage','garlic','Salt','cucumber','carrot']

# convert YOLO vector from char list to one-hot encoding
YOLO_vector_one_hot = mlb.transform([YOLO_detected_ingredients])[0]

# calculate the cosine similarity
cos_sim = cosine_similarity([YOLO_vector_one_hot], ingredient_vectors).flatten()

## exclude the recipe with cosine similarity 0
non_zero_indices = cos_sim > 0
filtered_cos_sim = cos_sim[non_zero_indices]
filtered_indices = non_zero_indices.nonzero()[0]

# sort recipes without 0 cosine similarity and get the top 5 recipes
top_5_recipe_idx = filtered_cos_sim.argsort()[-5:][::-1]
top_5_recipe_idx_filtered = filtered_indices[top_5_recipe_idx]
top_5_recipes = df.iloc[top_5_recipe_idx_filtered]

### output adjust function, adjust the output display
def output_adjust(text, line_length=160):
    words = text.split()
    text_adjust = ""
    line = ""
    for word in words:
        if len(line) + len(word) + 1 > line_length:
            text_adjust += line + "\n"
            line = word
        else:
            if line:
                line += " " + word
            else:
                line = word
    text_adjust += line
    return text_adjust

## print the output
print("Top 5 Recommended Recipes based on the food detection in fridge:")
for i, index in enumerate(top_5_recipe_idx_filtered, start=1):
    title = df.iloc[index]['Title']
    instructions = df.iloc[index]['Instructions']
    ingredients = df.iloc[index]['Cleaned_Ingredients']
    similarity = cos_sim[index]
    formatted_instructions = output_adjust(instructions)
    ingredients = output_adjust(ingredients)
    print(f"Top{i}. Title: {title}, Cosine Similarity: {similarity:.4f}\n")
    print(f"Ingredients:\n{ingredients}\n")
    # print(f"Instructions:\n{formatted_instructions}\n")


Top 5 Recommended Recipes based on the food detection in fridge:
Top1. Title: Spicy Shrimp and Vegetable Stir-Fry, Cosine Similarity: 0.3508

Ingredients:
['1/4 cup low-sodium soy sauce', '1/4 cup sake', '2 tablespoons sugar', '1 tablespoon dark (toasted) sesame oil', '1 tablespoon chopped garlic', '1 tablespoon
finely chopped or grated ginger', '1 cup large-diced red bell pepper', '1 cup large-diced green bell pepper', '1 cup large-diced onion', '1 cup cubed cabbage',
'1 cup sliced carrot', '1/2 teaspoon red pepper flakes', '24 large shrimp', 'shelled and deveined']

Top2. Title: 3-Ingredient Peanut Butter Cookies, Cosine Similarity: 0.3162

Ingredients:
['1 large egg', '1 cup creamy peanut butter', '1 cup sugar', 'Flaky sea salt (optional)']

Top3. Title: Stuffed Meatloaf, Cosine Similarity: 0.3162

Ingredients:
['1 large egg', '1/4 cup finely chopped onion', '1/4 cup seasoned breadcrumbs', '1 tablespoon ketchup', '1/2 teaspoon Worcestershire sauce', 'Salt', 'Freshly
ground pepper', 

In [None]:
print(" Cleaned ingredient list[5] using baseline model:")
ingredients = df['Ingredients_list'].iloc[5]
line_length = 8
for i in range(0, len(ingredients), line_length):
    print(ingredients[i:i + line_length])

 Cleaned ingredient list[5] using baseline model:
['bags', 'tequila', 'juice', 'nectar']


# **Primary model: NER-enhanced recommendation system**

# The Primary model use Named Entity Network to perform the food extraction process. The NER network the team use is Spacy's en-core-web-sm model.


In [6]:
import pandas as pd
import os
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import time

### GPU statement
try:
    spacy.require_gpu()
    print("GPU is available in the training.")
except:
    print("GPU is not available, using CPU instead.")

# load the Spacy model
NER_SPACY = spacy.load("en_core_web_sm")

# the list of words in noun but not food ingredients
words_not_food_in_noun = {
              # "cup", "teaspoon", "tablespoons", "cups", "ounce", "ounces", "pound", "pounds", "tsp",
              # "tbsp", "medium", "stick", "slices", "extract", "oz", "temperature", "room", "pieces",
              # "purpose", "total", "g", "ml", "lengthwise", "crosswise", "pinch", "½", "¼", "¾",
              # "package", "sprigs", "sticks", "halves", "inch", "bunch", "bunches", "parts", "quality",
              # "batch", "thermometer", "slice", "grind", "attachment", "wheel", "superfine", "quarts",
              # "baking", "ribbons", "dash", "layers", "cube", "skewers", "sheets", "moons", "round",
              # "rind", "top", "bottom", "height", "width", "diameter", "bias", "flameproof",
              # "rolling", "pin", "center", "frying", "mill", "maker", "heaping", "¼", "sticks",
              # "tops", "ends", "weights", "kitchen", "tool", "processor", "ramekins", "thickness",
              # "molds", "sheet", "paddle", "strip", "segments", "blocks", "form", "stand",
              # "lengths", "handle", "springform", "sheets", "platter", "knives", "blender", "mandoline",
              # "skillet", "strainer", "cast", "iron", "jar", "cans", "quart", "pint", "ruler",
              # "bottle", "mortar", "pestle", "sticks", "notes", "step", "layer", "loaf",
              # "pan", "batch", "inch",  "foil", "container", "tins", "shells",
              # "mold", "serving", "blender", "processor", "attachment", "form", "box",
              # "bowl", "spatula", "mixer", "wire", "frame", "plate", "sheet",
              # "nut", "oven", "tray", "cutter", "grater", "pot", "wok", "dough", "log", "weight",
              # "temperature", "reading", "minutes", "hours", "tables", "range", "measuring", "bag",
              # "envelope", "mat", "sprinkling", "tin", "edges", "ball", "circle", "seal", "quart",
              # "syrup", "slab", "bars", "kitchen", "molds", "ribs", "base", "lump", "skewer",
              # "thickness", "degree", "ladle", "grains", "milliliters", "gallon", "basting", "stem",
              # "tray", "handful", "frames", "loops", "bag", "label"
              }

# food ingredient word extracting
def food_word_extracting(ingredients_list, batch_size=32):
    food_extracted_output = []
    total_batch = (len(ingredients_list) +batch_size-1) // batch_size  ### calculate the total batch sizes
    start_time = time.time()

    for batch_idx in range(0, len(ingredients_list), batch_size):
        ### seperate the total batch into [0,32],[32,64]....
        batch = ingredients_list[batch_idx:batch_idx + batch_size]
        batch_docs = list(NER_SPACY.pipe(batch))
        batch_output = []

        ### the important steps, we ultilize the Spacy noun words extracting ability to extract nouns and propns from the sentences.
        ### one slice of the cake ----> "cake" will be extracted
        for doc in batch_docs:
            food_ing_words = []
            for token in doc:
                ### if the token is in the noun and propn and not in the word_not_food_in_noun list
                if token.pos_ in {"NOUN", "PROPN"} and token.text.lower() not in words_not_food_in_noun:
                    food_ing_words.append(token.text)
            batch_output.append(food_ing_words)
        food_extracted_output.extend(batch_output)

        # print the logs
        time_cost = time.time() - start_time
        if (batch_idx // batch_size+1) % 100 == 0:  # print log every 10 batchs
            print(f"Processed batch {batch_idx // batch_size+1}/{total_batch}. Time cost is : {time_cost:.2f} seconds")
    total_time = time.time() - start_time
    print(f"Word extracting process using NER completed in {total_time:.2f} seconds.")
    return food_extracted_output

# data loading
data_path = "/root/.cache/kagglehub/datasets/pes12017000148/food-ingredients-and-recipe-dataset-with-images/versions/1"
csv_file = os.path.join(data_path, "Food Ingredients and Recipe Dataset with Image Name Mapping.csv")
df = pd.read_csv(csv_file)

# data cleaning
df = df.dropna(subset=['Cleaned_Ingredients', 'Title', 'Instructions'])
df = df.drop_duplicates(subset=['Title'])

### Standardize the cleaned_ingredients column (the improving step: we use the Spacy ner network to do the word extracting)
ingre_cleaned_merge = []
for cleaned_ingredients in df['Cleaned_Ingredients']:
    # use eval function and ' ' to merge every word into a sentence with the form of python list
    combined_text = " ".join(eval(cleaned_ingredients))
    ingre_cleaned_merge.append(combined_text)

# use food_extracting function to extract the food words in the merged sentences
ingredients_list = food_word_extracting(ingre_cleaned_merge)
df['Ingredients_list'] = ingredients_list

# input vector from YOLO output
YOLO_detected_ingredients = ['beef', 'corn', 'egg', 'cucumber', 'spinach', 'sugar', 'cabbage', 'garlic', 'Salt', 'cucumber', 'carrot']

# collect all ingredients from the ingredients list
ingredient_list_total = set()
for ingredients in df['Ingredients_list']:
    for ingredient in ingredients:
        if ingredient.strip():
            ingredient_list_total.add(ingredient)

# merge the YOLO food ingredient into the ingredient_list_total
ingredient_list_total.update(YOLO_detected_ingredients)

### use the ingredients as the labels of our MLB
mlb = MultiLabelBinarizer(classes=list(ingredient_list_total))
ingredient_vectors = mlb.fit_transform(df['Ingredients_list'])

# convert YOLO vector into the one hot encoding
YOLO_vector_one_hot = mlb.transform([YOLO_detected_ingredients])[0]

# calculate the cosine similarity
cos_sim = cosine_similarity([YOLO_vector_one_hot], ingredient_vectors).flatten()

## exclude the recipe with cosine similarity 0
non_zero_indices = cos_sim > 0
filtered_cos_sim = cos_sim[non_zero_indices]
filtered_indices = non_zero_indices.nonzero()[0]

# sort recipes without 0 cosine similarity and get the top 5 recipes
top_5_recipe_idx = filtered_cos_sim.argsort()[-5:][::-1]
top_5_recipe_idx_filtered = filtered_indices[top_5_recipe_idx]
top_5_recipes = df.iloc[top_5_recipe_idx_filtered]

### output adjust function, adjust the output display
def output_adjust(text, line_length=160):
    words = text.split()
    text_adjust = ""
    line = ""
    for word in words:
        if len(line) + len(word) + 1 > line_length:
            text_adjust += line + "\n"
            line = word
        else:
            if line:
                line += " " + word
            else:
                line = word
    text_adjust += line
    return text_adjust

# # print the output
# print("Top 5 Recommended Recipes based on the food detection in fridge:")
# for i, index in enumerate(top_5_recipe_idx_filtered, start=1):
#     title = df.iloc[index]['Title']
#     instructions = df.iloc[index]['Instructions']
#     ingredients = df.iloc[index]['Cleaned_Ingredients']
#     similarity = cos_sim[index]
#     formatted_instructions = output_adjust(instructions)
#     ingredients = output_adjust(ingredients)
#     print(f"Top{i}. Title: {title}, Cosine Similarity: {similarity:.4f}\n")
#     print(f"Ingredients:\n{ingredients}\n")
#     print(f"Instructions:\n{formatted_instructions}\n")

# Print the output
print("Top 5 Recommended Recipes based on the food detection in fridge:")
for i, index in enumerate(top_5_recipe_idx_filtered, start=1):
    title = df.iloc[index]['Title']
    # use the processed ingredient list with extracted nouns and pronouns
    ingredients = df.iloc[index]['Ingredients_list']
    # measure the same ingredients
    yolo_set = set(YOLO_detected_ingredients)
    merged_ingre_set = set(ingredients)
    same_ingre = yolo_set.intersection(merged_ingre_set)
    same_ingre_string = ", ".join(same_ingre)

    print(f"Top {i}. Title: {title}, Cosine Similarity: {cos_sim[index]:.4f}\n")
    print(f"There are {len(same_ingre)} ingredients detected, which are the [{same_ingre_string}].\n")



GPU is available in the training.
Processed batch 100/416. Time cost is : 15.01 seconds
Processed batch 200/416. Time cost is : 27.59 seconds
Processed batch 300/416. Time cost is : 42.51 seconds
Processed batch 400/416. Time cost is : 55.45 seconds
Word extracting process using NER completed in 57.07 seconds.
Top 5 Recommended Recipes based on the food detection in fridge:
Top 1. Title: Bibimbap, Cosine Similarity: 0.3101

There are 5 ingredients detected, which are the [beef, corn, carrot, sugar, spinach].

Top 2. Title: Slow-Cooked Venison, Cosine Similarity: 0.2936

There are 5 ingredients detected, which are the [beef, carrot, garlic, sugar, Salt].

Top 3. Title: Spicy Shrimp and Vegetable Stir-Fry, Cosine Similarity: 0.2902

There are 4 ingredients detected, which are the [carrot, garlic, sugar, cabbage].

Top 4. Title: Vegetable Latkes, Cosine Similarity: 0.2828

There are 4 ingredients detected, which are the [carrot, corn, sugar, spinach].

Top 5. Title: Avocado Egg-in-a-Hole,

**Test the spaCy's effectiveness in extracting the food word terms**

In [None]:
print(" Cleaned ingredient list[5] Using SPACY:")
ingredients = df['Ingredients_list'].iloc[5]
line_length = 9
for i in range(0, len(ingredients), line_length):
    print(ingredients[i:i + line_length])

 Cleaned ingredient list[5] Using SPACY:
['chamomile', 'tea', 'bags', 'reposado', 'tequila', 'lemon', 'juice', 'agave', 'nectar']
