# Embeddings and FAISS Index for CookMate
In this notebook, we:
1. Load the cleaned Food.com dataset.
2. Build text representations for each recipe.
3. Compute embeddings using SentenceTransformers.
4. Build and save a FAISS index.
5. Test a simple recipe search function.

In [1]:
import pandas as pd
import numpy as np 
import os

In [2]:
df = pd.read_json("../data/cleaned/cleaned_recipes.json")
len(df), df.columns

(522513,
 Index(['recipe_id', 'title', 'ingredients_list', 'quantities_list',
        'steps_list', 'Calories', 'FatContent', 'CarbohydrateContent',
        'ProteinContent', 'RecipeCategory', 'Keywords'],
       dtype='object'))

In [3]:
N = len(df)
df_emb = df.iloc[:N].copy()
len(df_emb)

522513

In [4]:
def build_recipe_text(row):
    title = row["title"]
    ingredients = ", ".join(row["ingredients_list"])
    # We shorten the steps to avoid huge texts
    steps = " ".join(row["steps_list"][:5])
    text = f"Title: {title}. Ingredients: {ingredients}. Steps: {steps}"
    return text

texts = df_emb.apply(build_recipe_text, axis=1).tolist()
len(texts), texts[0][:500]

(522513,
 "Title: Low-Fat Berry Blue Frozen Dessert. Ingredients: blueberries, granulated sugar, vanilla yogurt, lemon juice. Steps: Toss 2 cups berries with sugar. Let stand for 45 minutes, stirring occasionally. Transfer berry-sugar mixture to food processor. Add yogurt and process until smooth. Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers' directions). Freeze uncovered until edges are solid but centre is soft.  Transfer to process")

### Embedding Model Choice

We selected **all-MiniLM-L6-v2** because:

- Fast inference on CPU
- 384-dim embeddings (small index size)
- Strong semantic performance on short texts
- Works well with normalized embeddings + inner product similarity

We normalize embeddings and use `faiss.IndexFlatIP` which allows cosine similarity search.

In [5]:
!pip3 install sentence-transformers



In [6]:
from sentence_transformers import SentenceTransformer

model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

README.md: 0.00B [00:00, ?B/s]

In [7]:
embeddings = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)
embeddings.shape

Batches:   0%|          | 0/8165 [00:00<?, ?it/s]

(522513, 384)

In [8]:
os.makedirs("../embeddings", exist_ok=True)
embeddings = embeddings.astype("float32")
np.save("../embeddings/recipe_embeddings.npy", embeddings)

id_mapping = df_emb[["recipe_id"]].reset_index(drop=True)
id_mapping.to_csv("../embeddings/id_mapping.csv", index=False)


In [9]:
!pip3 install faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [10]:
import faiss

dim = embeddings.shape[1]

# We used normalize_embeddings=True, so we can use inner-product similarity
index = faiss.IndexFlatIP(dim)

# Add vectors
index.add(embeddings)

index.ntotal

522513

In [14]:
faiss.write_index(index, "../embeddings/faiss_index.bin")

In [17]:
import os, sys

PROJECT_ROOT = os.path.abspath("..")
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)

print("PROJECT_ROOT: ", PROJECT_ROOT)

PROJECT_ROOT:  /Users/biancaleoveanu/CookMate-Recipe-Generator


In [18]:
from rag_pipeline.search import search_recipes

results = search_recipes(
    ingredients=["tomato", "pasta", "garlic"],
    diet="vegetarian",
    cuisine="Italian",
    k=3
)

results

[{'recipe_id': 115170,
  'title': 'Italian Zucchini',
  'ingredients_list': ['diced tomato',
   'dried oregano',
   'dried basil',
   'onion',
   'garlic clove',
   'zucchini',
   'salt'],
  'steps_list': ['Place all ingredients except for zucchini and salt in a saucepan.',
   'Cook over medium-high heat, stirring occasionally, until onion is tender.',
   'Add zucchini and cook until zucchini is just barely tender.  Do not overcook zucchini. Salt to taste.  Serve.'],
  'calories': 46.7,
  'fat': 0.4,
  'carbs': 10.6,
  'protein': 2.3},
 {'recipe_id': 266167,
  'title': 'Vegetables Italiana',
  'ingredients_list': ['zucchini', 'carrot', 'olive oil'],
  'steps_list': ['Saute vegetables in olive oil over medium-high heat until tender.',
   'Sprinkle Italian seasoning over vegetables.',
   'Enjoy!'],
  'calories': 52.4,
  'fat': 2.5,
  'carbs': 7.3,
  'protein': 1.3},
 {'recipe_id': 497610,
  'title': 'Italian Dinner Salad (Main Dish)',
  'ingredients_list': ['plum tomato',
   'mushrooms',

In [19]:
for r in results:
    print(r["title"])
    print("  ingredients:", ", ".join(r["ingredients_list"][:8]))
    print()

Italian Zucchini
  ingredients: diced tomato, dried oregano, dried basil, onion, garlic clove, zucchini, salt

Vegetables Italiana
  ingredients: zucchini, carrot, olive oil

Italian Dinner Salad (Main Dish)
  ingredients: plum tomato, mushrooms, zucchini, red bell pepper, mozzarella cheese, cider vinegar, olive oil, chicken broth

