## 01 — Data Preprocessing for AppetIte

This notebook prepares the raw recipe dataset for use in the AppetIte ML pipeline.

It performs the following tasks:

- Load the raw dataset  
- Parse ingredients lists  
- Build `ingredients_text` for BART input  
- Build `target_text` (Title + Instructions) for BART output  
- Clean invalid rows  
- Create train/val/test splits  
- Save processed CSV files into `data/processed/`

All output files are compatible with the training notebook.

In [36]:
import os
import ast
import pandas as pd
from sklearn.model_selection import train_test_split

### Load Raw Dataset

The dataset should be located at:

`data/raw/Food Ingredients and Recipe Dataset with Image Name Mapping.csv`



In [37]:
RAW_DATA_PATH = "data/raw/Food Ingredients and Recipe Dataset with Image Name Mapping.csv"

df = pd.read_csv(RAW_DATA_PATH)
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (13501, 6)


Unnamed: 0.1,Unnamed: 0,Title,Ingredients,Instructions,Image_Name,Cleaned_Ingredients
0,0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...","Pat chicken dry with paper towels, season all ...",miso-butter-roast-chicken-acorn-squash-panzanella,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher..."
1,1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",Preheat oven to 400°F and line a rimmed baking...,crispy-salt-and-pepper-potatoes-dan-kluger,"['2 large egg whites', '1 pound new potatoes (..."
2,2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",Place a rack in middle of oven; preheat to 400...,thanksgiving-mac-and-cheese-erick-williams,"['1 cup evaporated milk', '1 cup whole milk', ..."
3,3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",Preheat oven to 350°F with rack in middle. Gen...,italian-sausage-and-bread-stuffing-240559,"['1 (¾- to 1-pound) round Italian loaf, cut in..."
4,4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",Stir together brown sugar and hot water in a c...,newtons-law-apple-bourbon-cocktail,"['1 teaspoon dark brown sugar', '1 teaspoon ho..."


### Parse Ingredient Columns

`Ingredients` and `Cleaned_Ingredients` are stored as stringified Python lists.
We convert them back to lists using `ast.literal_eval`.

We also handle rows with invalid formats.

In [38]:
def safe_parse_list(x):
    if isinstance(x, list):
        return x
    if pd.isna(x):
        return None
    try:
        parsed = ast.literal_eval(x)
        return parsed if isinstance(parsed, list) else [parsed]
    except:
        return None

if "Unnamed: 0" in df.columns:
    df.drop(columns=["Unnamed: 0"], inplace=True)

df["parsed_ingredients"] = df["Ingredients"].apply(safe_parse_list)
df["parsed_cleaned_ingredients"] = df["Cleaned_Ingredients"].apply(safe_parse_list)
df[["Title", "parsed_ingredients", "parsed_cleaned_ingredients"]].head()

Unnamed: 0,Title,parsed_ingredients,parsed_cleaned_ingredients
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"[1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sa...","[1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sa..."
1,Crispy Salt and Pepper Potatoes,"[2 large egg whites, 1 pound new potatoes (abo...","[2 large egg whites, 1 pound new potatoes (abo..."
2,Thanksgiving Mac and Cheese,"[1 cup evaporated milk, 1 cup whole milk, 1 ts...","[1 cup evaporated milk, 1 cup whole milk, 1 ts..."
3,Italian Sausage and Bread Stuffing,"[1 (¾- to 1-pound) round Italian loaf, cut int...","[1 (¾- to 1-pound) round Italian loaf, cut int..."
4,Newton's Law,"[1 teaspoon dark brown sugar, 1 teaspoon hot w...","[1 teaspoon dark brown sugar, 1 teaspoon hot w..."


### Build BART Input: `ingredients_text`

We merge all ingredient tokens into one comma-separated string.

In [39]:
def build_ingredients_text(row):
    cleaned = row["parsed_cleaned_ingredients"]
    raw = row["parsed_ingredients"]

    ing_list = cleaned if cleaned else raw
    if not ing_list:
        return None

    return ", ".join(str(i).strip() for i in ing_list if str(i).strip())

df["ingredients_text"] = df.apply(build_ingredients_text, axis=1)
df["ingredients_text"].head()

0    1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...
1    2 large egg whites, 1 pound new potatoes (abou...
2    1 cup evaporated milk, 1 cup whole milk, 1 tsp...
3    1 (¾- to 1-pound) round Italian loaf, cut into...
4    1 teaspoon dark brown sugar, 1 teaspoon hot wa...
Name: ingredients_text, dtype: object

### Build Model Output: `target_text`

We combine the recipe Title and Instructions into one block of text.

In [40]:
def build_target_text(row):
    title = str(row.get("Title", "")).strip()
    instr = str(row.get("Instructions", "")).strip()

    if not instr:
        return None

    return f"Title: {title}\nInstructions: {instr}" if title else f"Instructions: {instr}"

df["target_text"] = df.apply(build_target_text, axis=1)
df["target_text"].head()

0    Title: Miso-Butter Roast Chicken With Acorn Sq...
1    Title: Crispy Salt and Pepper Potatoes\nInstru...
2    Title: Thanksgiving Mac and Cheese\nInstructio...
3    Title: Italian Sausage and Bread Stuffing\nIns...
4    Title: Newton's Law\nInstructions: Stir togeth...
Name: target_text, dtype: object

### Clean Dataset

We remove rows where:

- ingredients_text is missing  
- target_text is missing  
- target_text is too short (< 50 characters)

In [41]:
df_clean = df.dropna(subset=["ingredients_text", "target_text"]).copy()

MIN_TARGET_LEN = 50
df_clean = df_clean[df_clean["target_text"].str.len() >= MIN_TARGET_LEN]

print("Cleaned dataset shape:", df_clean.shape)
df_clean.head()

Cleaned dataset shape: (13495, 9)


Unnamed: 0,Title,Ingredients,Instructions,Image_Name,Cleaned_Ingredients,parsed_ingredients,parsed_cleaned_ingredients,ingredients_text,target_text
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...","Pat chicken dry with paper towels, season all ...",miso-butter-roast-chicken-acorn-squash-panzanella,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...","[1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sa...","[1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sa...","1 (3½–4-lb.) whole chicken, 2¾ tsp. kosher sal...",Title: Miso-Butter Roast Chicken With Acorn Sq...
1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",Preheat oven to 400°F and line a rimmed baking...,crispy-salt-and-pepper-potatoes-dan-kluger,"['2 large egg whites', '1 pound new potatoes (...","[2 large egg whites, 1 pound new potatoes (abo...","[2 large egg whites, 1 pound new potatoes (abo...","2 large egg whites, 1 pound new potatoes (abou...",Title: Crispy Salt and Pepper Potatoes\nInstru...
2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",Place a rack in middle of oven; preheat to 400...,thanksgiving-mac-and-cheese-erick-williams,"['1 cup evaporated milk', '1 cup whole milk', ...","[1 cup evaporated milk, 1 cup whole milk, 1 ts...","[1 cup evaporated milk, 1 cup whole milk, 1 ts...","1 cup evaporated milk, 1 cup whole milk, 1 tsp...",Title: Thanksgiving Mac and Cheese\nInstructio...
3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",Preheat oven to 350°F with rack in middle. Gen...,italian-sausage-and-bread-stuffing-240559,"['1 (¾- to 1-pound) round Italian loaf, cut in...","[1 (¾- to 1-pound) round Italian loaf, cut int...","[1 (¾- to 1-pound) round Italian loaf, cut int...","1 (¾- to 1-pound) round Italian loaf, cut into...",Title: Italian Sausage and Bread Stuffing\nIns...
4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",Stir together brown sugar and hot water in a c...,newtons-law-apple-bourbon-cocktail,"['1 teaspoon dark brown sugar', '1 teaspoon ho...","[1 teaspoon dark brown sugar, 1 teaspoon hot w...","[1 teaspoon dark brown sugar, 1 teaspoon hot w...","1 teaspoon dark brown sugar, 1 teaspoon hot wa...",Title: Newton's Law\nInstructions: Stir togeth...


### Train / Validation / Test Split

We create an 80/10/10 split.

In [42]:
RANDOM_STATE = 42

train_df, temp_df = train_test_split(
    df_clean,
    test_size=0.2,
    random_state=RANDOM_STATE,
    shuffle=True
)

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=RANDOM_STATE,
    shuffle=True
)

print("Train:", train_df.shape)
print("Val:  ", val_df.shape)
print("Test: ", test_df.shape)

Train: (10796, 9)
Val:   (1349, 9)
Test:  (1350, 9)


### Final Text Sanitization

HuggingFace tokenizers REQUIRE all fields to be strings with no NaN or objects.

We convert every ingredient and target field to a clean string here.

In [43]:
def clean_field(x):
    if x is None:
        return ""
    if isinstance(x, float):  # catches NaN
        return ""
    if isinstance(x, list):
        return ", ".join(map(str, x))
    return str(x)

for df_temp in [train_df, val_df, test_df]:
    df_temp["ingredients_text"] = df_temp["ingredients_text"].apply(clean_field)
    df_temp["target_text"] = df_temp["target_text"].apply(clean_field)

print("Sanitization complete.")

Sanitization complete.


### Save Processed CSV Files

We save the following files into:

`data/processed/`

- appetite_train.csv  
- appetite_val.csv  
- appetite_test.csv  
- appetite_clean_full.csv

In [44]:
PROCESSED_DIR = "data/processed"
os.makedirs(PROCESSED_DIR, exist_ok=True)

cols = ["Title", "ingredients_text", "target_text", "Image_Name"]

train_df[cols].to_csv(f"{PROCESSED_DIR}/appetite_train.csv", index=False)
val_df[cols].to_csv(f"{PROCESSED_DIR}/appetite_val.csv", index=False)
test_df[cols].to_csv(f"{PROCESSED_DIR}/appetite_test.csv", index=False)
df_clean[cols].to_csv(f"{PROCESSED_DIR}/appetite_clean_full.csv", index=False)

print("Processed files saved to:", PROCESSED_DIR)

Processed files saved to: data/processed
