## 03 — Category Tagging for AppetIte Recipes

This notebook assigns high-level categories to each recipe in the AppetIte dataset.

We will:

- Load the cleaned recipe dataset
- Define simple rule-based category functions, using:
  - ingredients
  - instructions text
- Assign categories such as:
  - quick
  - healthy
  - indulgent
  - high_protein
  - budget_friendly
  - vegetarian
  - vegan
  - breakfast / lunch / dinner / dessert
- Save the final dataset with a `categories` column to:

`data/final/appetite_with_categories.csv`

These categories will be used later by:

- The recommender system
- FastAPI backend (filtering by category)
- Streamlit UI (category dropdown)
- Category-aware generation prompts

In [1]:
import os
import re
import pandas as pd
from collections import Counter

PROCESSED_DIR = "data/processed"
FINAL_DIR = "data/final"
os.makedirs(FINAL_DIR, exist_ok=True)

CLEAN_FILE = os.path.join(PROCESSED_DIR, "appetite_clean_full.csv")

CLEAN_FILE

'data/processed/appetite_clean_full.csv'

### Load Cleaned AppetIte Dataset

We work from the cleaned file created in the preprocessing notebook:

- `data/processed/appetite_clean_full.csv`

In [2]:
df = pd.read_csv(CLEAN_FILE)
print("Shape:", df.shape)
df.head()

Shape: (13495, 4)


Unnamed: 0,Title,ingredients_text,target_text,Image_Name
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",Title: Miso-Butter Roast Chicken With Acorn Sq...,miso-butter-roast-chicken-acorn-squash-panzanella
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",Title: Crispy Salt and Pepper Potatoes\nInstru...,crispy-salt-and-pepper-potatoes-dan-kluger
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",Title: Thanksgiving Mac and Cheese\nInstructio...,thanksgiving-mac-and-cheese-erick-williams
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",Title: Italian Sausage and Bread Stuffing\nIns...,italian-sausage-and-bread-stuffing-240559
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",Title: Newton's Law\nInstructions: Stir togeth...,newtons-law-apple-bourbon-cocktail


### Basic Text Normalization

To make rule-based tagging easier, we:

- lowercase ingredients and target texts
- strip whitespace

In [3]:
def normalize_text(x):
    if pd.isna(x):
        return ""
    return str(x).strip().lower()

df["ingredients_norm"] = df["ingredients_text"].apply(normalize_text)
df["target_norm"] = df["target_text"].apply(normalize_text)

df[["Title", "ingredients_norm", "target_norm"]].head()

Unnamed: 0,Title,ingredients_norm,target_norm
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",title: miso-butter roast chicken with acorn sq...
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",title: crispy salt and pepper potatoes\ninstru...
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",title: thanksgiving mac and cheese\ninstructio...
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round italian loaf, cut into...",title: italian sausage and bread stuffing\nins...
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",title: newton's law\ninstructions: stir togeth...


### Define Category Rules

We use **simple heuristics** and ingredient-based rules to assign categories.

Categories:

- `quick`: likely fast recipes (few steps, mentions "quick", "15 min", etc.)
- `healthy`: uses lean ingredients, avoids heavy butter/cream/frying
- `indulgent`: butter, cream, cheese, chocolate, sugar-heavy
- `high_protein`: meat, eggs, tofu, beans, lentils, cheese, yogurt
- `budget_friendly`: cheap staples (rice, potatoes, beans, lentils, eggs)
- `vegetarian`: no meat or fish (eggs/dairy allowed)
- `vegan`: no meat, fish, eggs, or dairy
- `breakfast`, `lunch`, `dinner`, `dessert`: meal-type tagging

In [4]:
# Keyword sets for different concepts

MEAT_WORDS = {
    "chicken", "beef", "pork", "bacon", "ham", "lamb", "turkey",
    "sausage", "prosciutto", "salami"
}

FISH_WORDS = {
    "fish", "salmon", "tuna", "shrimp", "prawn", "crab", "lobster", "cod", "trout"
}

DAIRY_WORDS = {
    "milk", "butter", "cheese", "yogurt", "cream", "whipped cream", "parmesan", "mozzarella"
}

EGG_WORDS = {
    "egg", "eggs", "egg yolk", "egg white"
}

PROTEIN_WORDS = MEAT_WORDS | FISH_WORDS | EGG_WORDS | {
    "tofu", "lentil", "lentils", "beans", "black beans", "kidney beans",
    "chickpeas", "garbanzo", "paneer", "tempeh", "edamame", "protein powder"
}

INDULGENT_WORDS = {
    "chocolate", "brownie", "fudge", "caramel",
    "butter", "cream", "cheese", "bacon",
    "sugar", "syrup", "ice cream", "frosting"
}

HEALTHY_WORDS = {
    "salad", "quinoa", "oats", "oatmeal", "kale", "broccoli",
    "spinach", "avocado", "brown rice", "lentils", "beans", "chickpeas",
    "olive oil", "greek yogurt"
}

FRY_WORDS = {
    "deep-fry", "deep fry", "fried", "frying"
}

BUDGET_STAPLES = {
    "rice", "potato", "potatoes", "pasta", "noodles", "lentils", "beans",
    "cabbage", "carrot", "onion", "egg", "eggs", "flour", "bread"
}

BREAKFAST_WORDS = {
    "pancake", "toast", "omelette", "omelet", "cereal", "oatmeal",
    "breakfast", "granola", "smoothie", "waffle"
}

LUNCH_WORDS = {
    "sandwich", "wrap", "burrito", "salad", "lunch", "bowl"
}

DINNER_WORDS = {
    "stew", "roast", "casserole", "dinner", "lasagna", "curry"
}

DESSERT_WORDS = {
    "cake", "cookie", "brownie", "pudding", "mousse", "ice cream",
    "tart", "pie", "dessert"
}

### Helper Functions

We tokenize ingredients and instructions into simple word sets and
define small helper functions to check for presence of certain keywords.

In [5]:
WORD_SPLIT_RE = re.compile(r"[,\s;:\(\)\[\]\.\-]+")

def to_word_set(text: str):
    if not isinstance(text, str):
        return set()
    words = [w.strip() for w in WORD_SPLIT_RE.split(text.lower()) if w.strip()]
    return set(words)

df["ingredients_words"] = df["ingredients_norm"].apply(to_word_set)
df["target_words"] = df["target_norm"].apply(to_word_set)

df[["ingredients_words", "target_words"]].head(3)

Unnamed: 0,ingredients_words,target_words
0,"{acorn, flakes, 1"", cup, medium, extra, onion,...","{rest, acorn, skin, arrange, flakes, instructi..."
1,"{1, kosher, black, in, about, pound, new, pota...","{foamy, of, knife, title, well, instructions, ..."
2,"{sharp, 1, paprika, ½, kosher, smoked, black, ...","{paprika, instructions, remaining, little, med..."


### Category Detection Rules

We now define functions that take a row (ingredients + text)
and output booleans for each category.

In [6]:
def is_vegetarian(row) -> bool:
    words = row["ingredients_words"]
    if words & MEAT_WORDS:
        return False
    if words & FISH_WORDS:
        return False
    return True

def is_vegan(row) -> bool:
    words = row["ingredients_words"]
    if words & MEAT_WORDS:
        return False
    if words & FISH_WORDS:
        return False
    if words & DAIRY_WORDS:
        return False
    if words & EGG_WORDS:
        return False
    return True

In [7]:
def is_high_protein(row) -> bool:
    words = row["ingredients_words"]
    return len(words & PROTEIN_WORDS) > 0

def is_indulgent(row) -> bool:
    words = row["ingredients_words"] | row["target_words"]
    # If several indulgent keywords present, mark as indulgent
    return len(words & INDULGENT_WORDS) >= 2

def is_healthy(row) -> bool:
    words = row["ingredients_words"] | row["target_words"]
    has_healthy = len(words & HEALTHY_WORDS) > 0
    has_fry = len(words & FRY_WORDS) > 0
    indulgent_hits = len(words & INDULGENT_WORDS)
    if has_healthy and not has_fry and indulgent_hits <= 1:
        return True
    return False

def is_budget_friendly(row) -> bool:
    words = row["ingredients_words"]
    staples_hits = len(words & BUDGET_STAPLES)
    expensive = {"truffle", "saffron", "lobster", "steak", "prosciutto"}
    if staples_hits >= 2 and len(words & expensive) == 0:
        return True
    return False

In [8]:
def is_quick(row) -> bool:
    txt = (row["Title"] if isinstance(row["Title"], str) else "") + " " + row["target_norm"]
    txt = txt.lower()
    if "quick" in txt or "easy" in txt:
        return True
    if "15 min" in txt or "15-minute" in txt or "20 min" in txt or "20-minute" in txt:
        return True
    return False

In [9]:
def meal_type_tags(row):
    words = row["ingredients_words"] | row["target_words"]
    tags = set()

    if words & BREAKFAST_WORDS:
        tags.add("breakfast")
    if words & LUNCH_WORDS:
        tags.add("lunch")
    if words & DINNER_WORDS:
        tags.add("dinner")
    if words & DESSERT_WORDS:
        tags.add("dessert")

    if not tags and len(words & {"cake", "cookie", "brownie", "pudding", "dessert"}) >= 1:
        tags.add("dessert")

    return tags

### Apply Category Rules

We compute a set of categories per row, then store them as a `|`-separated string:

`"healthy|vegetarian|quick"`

In [10]:
def assign_categories(row):
    cats = set()

    if is_quick(row):
        cats.add("quick")
    if is_healthy(row):
        cats.add("healthy")
    if is_indulgent(row):
        cats.add("indulgent")
    if is_high_protein(row):
        cats.add("high_protein")
    if is_budget_friendly(row):
        cats.add("budget_friendly")
    if is_vegetarian(row):
        cats.add("vegetarian")
    if is_vegan(row):
        cats.add("vegan")

    cats |= meal_type_tags(row)

    if cats:
        return "|".join(sorted(cats))
    else:
        return ""

df["categories"] = df.apply(assign_categories, axis=1)
df[["Title", "ingredients_text", "categories"]].head(10)

Unnamed: 0,Title,ingredients_text,categories
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",budget_friendly|dinner|high_protein|lunch|quick
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",budget_friendly|dinner|high_protein|lunch|quic...
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",indulgent|vegetarian
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",high_protein|indulgent|lunch|quick
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",indulgent|vegetarian
5,Warm Comfort,"2 chamomile tea bags, 1½ oz. reposado tequila,...",vegan|vegetarian
6,Apples and Oranges,"3 oz. Grand Marnier, 1 oz. Amaro Averna, Small...",vegetarian
7,Turmeric Hot Toddy,"¼ cup granulated sugar, ¾ tsp. ground turmeric...",indulgent|vegan|vegetarian
8,Instant Pot Lamb Haleem,"¾ cup assorted dals (such as chana dal, moong ...",budget_friendly|dinner|high_protein|lunch|quick
9,Spiced Lentil and Caramelized Onion Baked Eggs,"1 (14.5-ounce) can basic lentil soup, like Amy...",budget_friendly|high_protein|vegetarian


### Category Distribution

Let's inspect how many recipes fall into each category.

In [11]:
all_cats = []

for c in df["categories"]:
    if isinstance(c, str) and c.strip():
        all_cats.extend(c.split("|"))

cat_counts = Counter(all_cats)
cat_counts

Counter({'lunch': 9654,
         'vegetarian': 9257,
         'high_protein': 7505,
         'indulgent': 4727,
         'quick': 4462,
         'budget_friendly': 3980,
         'vegan': 3466,
         'healthy': 2187,
         'dessert': 2058,
         'dinner': 1956,
         'breakfast': 1306})

In [12]:
for cat, count in cat_counts.most_common():
    print(f"{cat:15s} : {count}")

lunch           : 9654
vegetarian      : 9257
high_protein    : 7505
indulgent       : 4727
quick           : 4462
budget_friendly : 3980
vegan           : 3466
healthy         : 2187
dessert         : 2058
dinner          : 1956
breakfast       : 1306


### Save Final Dataset With Categories

We save the updated dataset (including categories) to:

`data/final/appetite_with_categories.csv`

In [13]:
OUTPUT_FILE = os.path.join(FINAL_DIR, "appetite_with_categories.csv")

cols_to_save = [
    "Title",
    "ingredients_text",
    "target_text",
    "Image_Name",
    "categories"
]

df[cols_to_save].to_csv(OUTPUT_FILE, index=False)
print("Saved categorized dataset to:", OUTPUT_FILE)

Saved categorized dataset to: data/final/appetite_with_categories.csv


### Quick Sanity Check

View a few random recipes and their assigned categories.

In [14]:
df.sample(10)[["Title", "ingredients_text", "categories"]]

Unnamed: 0,Title,ingredients_text,categories
6813,Orange-Blossom-Honey Madeleines,"3/4 cup all-purpose flour, 1/2 teaspoon baking...",budget_friendly|dessert|high_protein|indulgent...
9966,Grilled Citrus Chicken Under a Brick,"1 cup fresh orange juice, 1/3 cup fresh lime j...",high_protein|lunch|quick
3543,Homemade Yellow Cake,"1/3 cup vegetable oil, plus more for pan, All-...",budget_friendly|dessert|high_protein|lunch|veg...
396,Apple Pie Smoothie,"1 Gala, Fuji, or other sweet apple, cored, see...",breakfast|dessert|healthy|vegetarian
11409,Poached Oysters and Artichokes with Champagne ...,2 tablespoons Champagne vinegar or white-wine ...,indulgent|quick|vegetarian
7868,Roulade au Chocolat Pour Julia,1 1/2 cups plus 2 tablespoons granulated sugar...,budget_friendly|dessert|high_protein|indulgent...
9209,Ricotta Cheesecake,3 tablespoons finely crushed amaretti (crisp I...,budget_friendly|dessert|high_protein|indulgent...
12535,Horseradish-Crusted Beef Tenderloin,1 (3 1/2-lb) trimmed center-cut beef tenderloi...,dinner|high_protein
2248,Sweet Potato Fritters with Poached Eggs and Av...,"2 small sweet potatoes, peeled and grated, 3 f...",budget_friendly|high_protein|lunch|vegetarian
1920,Cucumber and Tomato Tzatziki,3 cups plain yogurt (do not use low-fat or non...,lunch|vegetarian


In [18]:
import pandas as pd

df = pd.read_csv(CLEAN_FILE)
print(df.columns)
df.head()

Index(['Title', 'ingredients_text', 'target_text', 'Image_Name'], dtype='object')


Unnamed: 0,Title,ingredients_text,target_text,Image_Name
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",Title: Miso-Butter Roast Chicken With Acorn Sq...,miso-butter-roast-chicken-acorn-squash-panzanella
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",Title: Crispy Salt and Pepper Potatoes\nInstru...,crispy-salt-and-pepper-potatoes-dan-kluger
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",Title: Thanksgiving Mac and Cheese\nInstructio...,thanksgiving-mac-and-cheese-erick-williams
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",Title: Italian Sausage and Bread Stuffing\nIns...,italian-sausage-and-bread-stuffing-240559
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",Title: Newton's Law\nInstructions: Stir togeth...,newtons-law-apple-bourbon-cocktail


In [21]:
def get_category(ingredients):
    ing = ingredients.lower()

    if any(x in ing for x in ["chicken", "turkey"]):
        return "poultry"

    if any(x in ing for x in ["beef", "steak", "lamb"]):
        return "red_meat"

    if any(x in ing for x in ["salmon", "fish", "tuna", "shrimp"]):
        return "seafood"

    if any(x in ing for x in ["rice", "pasta", "noodle", "spaghetti"]):
        return "carbs"

    if any(x in ing for x in ["broccoli", "carrot", "spinach", "zucchini"]):
        return "vegetarian"

    if any(x in ing for x in ["milk", "cheese", "cream"]):
        return "dairy"

    if any(x in ing for x in ["chili", "pepper", "cumin", "garam"]):
        return "spicy"

    return "other"


# FIX: make safe for NaN / floats
df["ingredients_text"] = df["ingredients_text"].fillna("").astype(str)

# Apply category tagging
df["categories"] = df["ingredients_text"].apply(get_category)

# Save updated file
df.to_csv(CLEAN_FILE, index=False)

df.head()

Unnamed: 0,Title,ingredients_text,target_text,Image_Name,categories
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",Title: Miso-Butter Roast Chicken With Acorn Sq...,miso-butter-roast-chicken-acorn-squash-panzanella,poultry
1,Crispy Salt and Pepper Potatoes,"2 large egg whites, 1 pound new potatoes (abou...",Title: Crispy Salt and Pepper Potatoes\nInstru...,crispy-salt-and-pepper-potatoes-dan-kluger,spicy
2,Thanksgiving Mac and Cheese,"1 cup evaporated milk, 1 cup whole milk, 1 tsp...",Title: Thanksgiving Mac and Cheese\nInstructio...,thanksgiving-mac-and-cheese-erick-williams,dairy
3,Italian Sausage and Bread Stuffing,"1 (¾- to 1-pound) round Italian loaf, cut into...",Title: Italian Sausage and Bread Stuffing\nIns...,italian-sausage-and-bread-stuffing-240559,poultry
4,Newton's Law,"1 teaspoon dark brown sugar, 1 teaspoon hot wa...",Title: Newton's Law\nInstructions: Stir togeth...,newtons-law-apple-bourbon-cocktail,other


In [23]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

df = pd.read_csv(CLEAN_FILE)

df["ingredients_text"] = df["ingredients_text"].fillna("").astype(str)

df["categories"] = df["ingredients_text"].apply(get_category)

df["categories"] = df["categories"].fillna("other").astype(str)

df = df[df["ingredients_text"].str.strip() != ""]

X = df["ingredients_text"]
y = df["categories"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("clf", LogisticRegression(max_iter=300))
])

model.fit(X_train, y_train)

In [25]:
import joblib
joblib.dump(model.named_steps["tfidf"], "model/category_vectorizer.pkl")
joblib.dump(model.named_steps["clf"], "model/category_classifier.pkl")

print("Saved classifier and vectorizer into /model/")

Saved classifier and vectorizer into /model/
