## 04 — Pantry-Based Recipe Recommender (with Categories)

This notebook builds the recommendation engine for AppetIte.

Goals:

- Input: user pantry ingredients (text)
- Optional: requested category (e.g., "healthy", "quick", "dinner")
- Output: top-k recommended recipes from our dataset

We will:

1. Load the categorized recipe dataset:
   - `data/final/appetite_with_categories.csv`
2. Normalize text and tokenization
3. Build recipe embeddings using a Sentence-Transformers model
4. Save embeddings for reuse
5. Implement a hybrid scoring function:
   - ingredient overlap score
   - embedding similarity score
   - category filtering
6. Expose a `recommend_recipes()` function that we can later reuse in FastAPI.

In [1]:
!pip install sentence-transformers scikit-learn --quiet

In [2]:
import os
import re
import numpy as np
import pandas as pd

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
FINAL_DIR = "data/final"
INPUT_FILE = os.path.join(FINAL_DIR, "appetite_with_categories.csv")

EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
EMB_FILE = os.path.join(FINAL_DIR, "recipe_embeddings.npy")
META_FILE = os.path.join(FINAL_DIR, "recipe_metadata.csv")

INPUT_FILE, EMB_FILE, META_FILE

('data/final/appetite_with_categories.csv',
 'data/final/recipe_embeddings.npy',
 'data/final/recipe_metadata.csv')

### Load Categorized Recipe Dataset

We load:

- `Title`
- `ingredients_text`
- `target_text`
- `categories`

Categories are in a `|`-separated string, e.g.:

- `"healthy|vegetarian|quick"`

In [4]:
df = pd.read_csv(INPUT_FILE)
print("Shape:", df.shape)
df.head()

Shape: (13495, 5)


Unnamed: 0,Title,ingredients_text,target_text,Image_Name,categories
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",Title: Miso-Butter Roast Chicken With Acorn Sq...,miso-butter-roast-chicken-acorn-squash-panzanella,budget_friendly|dinner|high_protein|lunch|quick
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",Title: Crispy Salt and Pepper Potatoes\nInstru...,crispy-salt-and-pepper-potatoes-dan-kluger,budget_friendly|dinner|high_protein|lunch|quic...
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",Title: Thanksgiving Mac and Cheese\nInstructio...,thanksgiving-mac-and-cheese-erick-williams,indulgent|vegetarian
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",Title: Italian Sausage and Bread Stuffing\nIns...,italian-sausage-and-bread-stuffing-240559,high_protein|indulgent|lunch|quick
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",Title: Newton's Law\nInstructions: Stir togeth...,newtons-law-apple-bourbon-cocktail,indulgent|vegetarian


### Normalize Text & Tokenize Ingredients

We will:

- lowercase ingredients
- split into simple word tokens for overlap scoring
- keep `categories` as a list of strings

In [5]:
def normalize_text(x):
    if pd.isna(x):
        return ""
    return str(x).strip().lower()

df["ingredients_norm"] = df["ingredients_text"].apply(normalize_text)
df["target_norm"] = df["target_text"].apply(normalize_text)
df["title_norm"] = df["Title"].apply(normalize_text)

WORD_SPLIT_RE = re.compile(r"[,\s;:\(\)\[\]\.\-]+")

def to_word_set(text: str):
    if not isinstance(text, str):
        return set()
    words = [w.strip() for w in WORD_SPLIT_RE.split(text.lower()) if w.strip()]
    return set(words)

df["ingredients_words"] = df["ingredients_norm"].apply(to_word_set)

In [6]:
def parse_categories(cat_str):
    if not isinstance(cat_str, str) or not cat_str.strip():
        return []
    return [c.strip() for c in cat_str.split("|") if c.strip()]

df["categories_list"] = df["categories"].apply(parse_categories)

df[["Title", "ingredients_text", "categories", "categories_list"]].head()

Unnamed: 0,Title,ingredients_text,categories,categories_list
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",budget_friendly|dinner|high_protein|lunch|quick,"[budget_friendly, dinner, high_protein, lunch,..."
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",budget_friendly|dinner|high_protein|lunch|quic...,"[budget_friendly, dinner, high_protein, lunch,..."
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",indulgent|vegetarian,"[indulgent, vegetarian]"
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",high_protein|indulgent|lunch|quick,"[high_protein, indulgent, lunch, quick]"
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",indulgent|vegetarian,"[indulgent, vegetarian]"


### Build Recipe Embeddings

We use a lightweight Sentence-Transformers model:

- `all-MiniLM-L6-v2` (fast, good quality, M2-friendly)

For each recipe, we build a single embedding from:

> `"Title: {title} Ingredients: {ingredients_text}"`

(We skip full instructions for now to keep embeddings smaller and focused on ingredients.)

In [7]:
embed_model = SentenceTransformer(EMB_MODEL_NAME)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
def build_embedding_text(row):
    title = row["title_norm"]
    ing = row["ingredients_norm"]
    return f"Title: {title} Ingredients: {ing}"

df["embed_text"] = df.apply(build_embedding_text, axis=1)

df[["embed_text"]].head(3)

Unnamed: 0,embed_text
0,Title: miso-butter roast chicken with acorn sq...
1,Title: crispy salt and pepper potatoes Ingredi...
2,Title: thanksgiving mac and cheese Ingredients...


In [9]:
if os.path.exists(EMB_FILE) and os.path.exists(META_FILE):
    print("Embeddings already exist, loading from disk...")
    recipe_embeddings = np.load(EMB_FILE)
    meta_df = pd.read_csv(META_FILE)
else:
    print("Computing recipe embeddings...")
    texts = df["embed_text"].tolist()
    recipe_embeddings = embed_model.encode(texts, show_progress_bar=True)
    recipe_embeddings = np.array(recipe_embeddings)

    np.save(EMB_FILE, recipe_embeddings)
    meta_df = df[["Title", "ingredients_text", "target_text", "categories"]].copy()
    meta_df.to_csv(META_FILE, index=False)

    print("Saved embeddings to:", EMB_FILE)
    print("Saved metadata to:", META_FILE)

recipe_embeddings.shape

Computing recipe embeddings...


Batches:   0%|          | 0/422 [00:00<?, ?it/s]

Saved embeddings to: data/final/recipe_embeddings.npy
Saved metadata to: data/final/recipe_metadata.csv


(13495, 384)

### Define Recommendation Scoring

We combine two signals:

1. **Ingredient overlap score** (Jaccard-like)
2. **Embedding similarity** (cosine similarity between pantry & recipe embeddings)

Final score = α * IngredientOverlap + β * EmbeddingSimilarity

Where:
- α = 0.6 (more weight on actual overlapping ingredients)
- β = 0.4

In [10]:
def ingredient_overlap_score(pantry_words, recipe_words):
    if not pantry_words:
        return 0.0
    inter = pantry_words & recipe_words
    return len(inter) / float(len(pantry_words))

In [11]:
def build_pantry_embedding(pantry_ingredients: str):
    text = normalize_text(pantry_ingredients)
    embed_text = f"Ingredients: {text}"
    emb = embed_model.encode([embed_text])
    return emb[0]

### Category Filtering

If the user specifies a category (e.g., `"healthy"`, `"dessert"`),
we only consider recipes whose `categories_list` contains that label.

If `category=None`, we consider all recipes.

In [12]:
def filter_by_category(df_local, category: str | None):
    if category is None or not str(category).strip():
        return df_local.index.to_list()
    category = category.strip().lower()
    mask = df_local["categories_list"].apply(lambda cats: category in [c.lower() for c in cats])
    return df_local[mask].index.to_list()

### Main Recommender Function

`recommend_recipes(pantry_ingredients, top_k=5, category=None)`

Steps:

1. Normalize pantry ingredients
2. Compute pantry word set
3. Filter recipes by category (if given)
4. Compute ingredient overlap scores
5. Compute pantry embedding
6. Compute cosine similarity between pantry emb & recipe embeddings
7. Combine scores: `final_score = 0.6 * overlap + 0.4 * cosine`
8. Return top-k recipes with:

   - title
   - ingredients
   - categories
   - final_score

In [13]:
def recommend_recipes(pantry_ingredients: str, top_k: int = 5, category: str | None = None):
    pantry_norm = normalize_text(pantry_ingredients)
    pantry_words = to_word_set(pantry_norm)

    if not pantry_words:
        print("Warning: pantry ingredients empty or invalid, using only embeddings.")

    candidate_idx = filter_by_category(df, category)

    if not candidate_idx:
        print(f"No recipes found for category='{category}'. Returning empty list.")
        return []

    cand_embeddings = recipe_embeddings[candidate_idx]
    cand_df = df.iloc[candidate_idx].reset_index(drop=True)

    overlap_scores = []
    for _, row in cand_df.iterrows():
        score = ingredient_overlap_score(pantry_words, row["ingredients_words"])
        overlap_scores.append(score)
    overlap_scores = np.array(overlap_scores)

    pantry_emb = build_pantry_embedding(pantry_ingredients)
    pantry_emb = pantry_emb.reshape(1, -1)
    cos_sims = cosine_similarity(pantry_emb, cand_embeddings)[0]

    def min_max_norm(x):
        if np.allclose(x.max(), x.min()):
            return np.zeros_like(x)
        return (x - x.min()) / (x.max() - x.min())

    overlap_norm = min_max_norm(overlap_scores)
    cos_norm = min_max_norm(cos_sims)

    alpha = 0.6 
    beta = 0.4  
    final_scores = alpha * overlap_norm + beta * cos_norm

    cand_df = cand_df.copy()
    cand_df["overlap_score"] = overlap_scores
    cand_df["cosine_score"] = cos_sims
    cand_df["final_score"] = final_scores

    cand_df = cand_df.sort_values("final_score", ascending=False)

    top = cand_df.head(top_k)

    results = []
    for _, row in top.iterrows():
        results.append({
            "title": row["Title"],
            "ingredients_text": row["ingredients_text"],
            "categories": row["categories"],
            "final_score": float(row["final_score"]),
            "overlap_score": float(row["overlap_score"]),
            "cosine_score": float(row["cosine_score"])
        })

    return results

### Trying the Recommender

We test with a few example pantries:

- `"olive oil, onion, tomato, rice"`
- `"chicken, yogurt, garlic, lemon"`
- `"chocolate, butter, sugar"`

We can also specify categories like:

- `"healthy"`
- `"indulgent"`
- `"vegetarian"`
- `"dessert"`
- `"dinner"`

In [15]:
pantry = "olive oil, onion, tomato, rice"
recs = recommend_recipes(pantry_ingredients=pantry, top_k=5, category=None)

print("PANTRY:", pantry)
for i, r in enumerate(recs, 1):
    print(f"\n#{i}: {r['title']}")
    print("Categories:", r["categories"])
    print("Score:", round(r["final_score"], 3))
    print("Ingredients:", r["ingredients_text"][:200], "...")

PANTRY: olive oil, onion, tomato, rice

#1: Tomato and Parmesan Risotto
Categories: budget_friendly|high_protein|quick
Score: 0.931
Ingredients: 5 cups low-sodium chicken broth, 2 Tbsp. extra-virgin olive oil, plus more for drizzling, 1 medium onion, finely chopped, 3 garlic cloves, thinly sliced, 1 Tbsp. tomato paste, 2 cups cherry tomatoes,  ...

#2: Paella with Tomatoes and Eggs
Categories: dinner|healthy|high_protein|lunch|quick|vegetarian
Score: 0.93
Ingredients: 3 1/2 cups vegetable stock or water, plus more if needed, Large pinch saffron threads (optional), 1 pound fresh tomatoes, cored, cut into thick wedges, and seeded, Salt and pepper, 4 tablespoons olive ...

#3: Pumpkin Shrimp Curry
Categories: budget_friendly|dinner|high_protein|quick
Score: 0.924
Ingredients: 2 tablespoons olive oil, 1 cup sliced onion, 1 tablespoon minced ginger, 1 tablespoon minced garlic, 1 chopped plum tomato, 1 15-ounce can pumpkin purée, 2 cups vegetable broth, 1 cup unsweetened coco ...

#4: Stuffe

In [16]:
pantry = "chicken, yogurt, garlic, lemon"
recs = recommend_recipes(pantry_ingredients=pantry, top_k=5, category="healthy")

print("PANTRY:", pantry, "| Category: healthy")
for i, r in enumerate(recs, 1):
    print(f"\n#{i}: {r['title']}")
    print("Categories:", r["categories"])
    print("Score:", round(r["final_score"], 3))
    print("Ingredients:", r["ingredients_text"][:200], "...")

PANTRY: chicken, yogurt, garlic, lemon | Category: healthy

#1: Shawarma-Spiced Chicken Pita with Tahini-Yogurt Sauce
Categories: dinner|healthy|high_protein|lunch|quick
Score: 0.938
Ingredients: 1/2 teaspoon ground cumin, 1/4 teaspoon ground coriander, 1/4 teaspoon paprika, 1/8 teaspoon cayenne pepper, 1/8 teaspoon ground cinnamon, 4 tablespoons olive oil, divided, 1 1/2 teaspoons kosher salt ...

#2: Chicken-Lentil Soup With Jammy Onions
Categories: budget_friendly|healthy|high_protein|quick
Score: 0.917
Ingredients: 4 skin-on, bone-in chicken thighs, patted dry, Kosher salt, ¼ cup extra-virgin olive oil, 1 large onion, thinly sliced, 6 garlic cloves, thinly sliced, 1 cup red lentils, rinsed, 1 tsp. ground turmeri ...

#3: Chicken Zucchini Burgers
Categories: budget_friendly|healthy|high_protein|lunch
Score: 0.906
Ingredients: 1 cup plain Greek yogurt, 2 teaspoons lemon zest, 2 tablespoons fresh lemon juice, 2 garlic cloves, minced, 1 tablespoon extra-virgin olive oil, ½ teaspoon sea

In [20]:
pantry = "chocolate, butter, sugar"
recs = recommend_recipes(pantry_ingredients=pantry, top_k=5, category="dessert")

print("PANTRY:", pantry, "| Category: dessert")
for i, r in enumerate(recs, 1):
    print(f"\n#{i}: {r['title']}")
    print("Categories:", r["categories"])
    print("Score:", round(r["final_score"], 3))
    print("Ingredients:", r["ingredients_text"][:200], "...")

PANTRY: chocolate, butter, sugar | Category: dessert

#1: Chocolate-Cherry Tart
Categories: budget_friendly|dessert|high_protein|indulgent|lunch|vegetarian
Score: 0.985
Ingredients: 1/2 cup water, 1/2 cup sugar, 1 cup (packed) dried Bing (sweet) cherries, 1/3 cup kirsch (clear cherry brandy), 1 cup (2 sticks) unsalted butter, room temperature, 1/2 cup sugar, 1 large egg, 1 teaspo ...

#2: Chocolate Chunk Share Cookie
Categories: budget_friendly|dessert|high_protein|indulgent|quick|vegetarian
Score: 0.981
Ingredients: 6 tablespoons unsalted butter, softened, 3/4 cup (135g) brown sugar, 1/4 cup (55g) superfine sugar, 1 egg, 2 teaspoons vanilla extract, 1 cup (150g) plain (all-purpose) flour, 1/8 teaspoon baking soda ...

#3: Double-Chocolate Sandwich Cookies
Categories: budget_friendly|dessert|high_protein|indulgent|lunch|quick|vegetarian
Score: 0.971
Ingredients: 2 cups all-purpose flour, 1/2 cup unsweetened Dutch-process cocoa powder, 1/2 teaspoon baking powder, 1/2 teaspoon salt, 2 st

In [22]:
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = df["ingredients_text"].fillna("")

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)

joblib.dump(vectorizer, "model/recommender_vectorizer.pkl")

print("Saved recommender_vectorizer.pkl")

Saved recommender_vectorizer.pkl
