# Recipe Data Cleaning & Preparation Pipeline

- **Authors:** Riyaadh Gani and Damilola Ogunleye
- **Project:** Food Recognition & Recipe LLM  
- **Purpose:** Clean and prepare recipe data for embedding generation and model training

---

## Overview

This notebook processes two major recipe datasets:
- **Food.com**: 231K recipes with reviews and nutrition data
- **RecipeNLG**: 2.2M recipes (500K sample used)

**Output:** 3 datasets ready for training:

1. `nutrition_lookup.csv`
- **Purpose:** Reference table for nutritional information
- **Usage:** Lookup table (not for training)
- **Rows:** ~231K nutrition entries

2. `clean_recipes.csv`
- **Purpose:** Single-turn recipe training data
- **Usage:** Baseline LSTM training
- **Rows:** ~4-5M prompt-response pairs
- **Format:** `prompt, response`

3. `conversational_training_data.csv`
- **Purpose:** Multi-turn conversational training data
- **Usage:** Conversational LSTM training
- **Rows:** ~10K-50K conversation pairs
- **Format:** `input, output` (with conversation history)

---
## Section 1: Setup & Configuration

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import ast
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Enable progress bars for pandas operations
tqdm.pandas()

print("‚úì Libraries imported successfully!")

‚úì Libraries imported successfully!


In [2]:
# Configuration
FOOD_COM_PATH = "datasets/kaggleFood"
RECIPENLG_PATH = "datasets/recipeNLG/RecipeNLG_dataset.csv"
OUTPUT_PATH = "datasets/Cleaned/clean_recipes.csv"

print(f"Configuration set:")
print(f"  - Output path: {OUTPUT_PATH}")

Configuration set:
  - Output path: datasets/Cleaned/clean_recipes.csv


---
## Section 2: Load Datasets

In [3]:
print("Loading datasets...\n")

# Load Food.com recipes
print("[1/3] Loading Food.com recipes...")
food_com_recipes = pd.read_csv(f"{FOOD_COM_PATH}/RAW_recipes.csv")
print(f"      ‚úì Loaded {len(food_com_recipes):,} recipes")

# Load Food.com interactions (reviews)
print("[2/3] Loading Food.com interactions...")
food_com_interactions = pd.read_csv(f"{FOOD_COM_PATH}/RAW_interactions.csv")
print(f"      ‚úì Loaded {len(food_com_interactions):,} interactions")

# Load RecipeNLG
print(f"[3/3] Loading RecipeNLG dataset...")
recipenlg_df = pd.read_csv(RECIPENLG_PATH)
print(f"      ‚úì Loaded {len(recipenlg_df):,} recipes")

print("\n‚úì All datasets loaded successfully!")

Loading datasets...

[1/3] Loading Food.com recipes...
      ‚úì Loaded 231,637 recipes
[2/3] Loading Food.com interactions...
      ‚úì Loaded 1,132,367 interactions
[3/3] Loading RecipeNLG dataset...
      ‚úì Loaded 2,231,142 recipes

‚úì All datasets loaded successfully!


### 2.1 Explore Data Structure

In [4]:
print("Dataset Schemas:\n")
print("Food.com Recipes:")
print(f"  Columns: {food_com_recipes.columns.tolist()}")
print(f"  Shape: {food_com_recipes.shape}\n")

print("RecipeNLG:")
print(f"  Columns: {recipenlg_df.columns.tolist()}")
print(f"  Shape: {recipenlg_df.shape}\n")

print("Food.com Interactions:")
print(f"  Columns: {food_com_interactions.columns.tolist()}")
print(f"  Shape: {food_com_interactions.shape}")

Dataset Schemas:

Food.com Recipes:
  Columns: ['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags', 'nutrition', 'n_steps', 'steps', 'description', 'ingredients', 'n_ingredients']
  Shape: (231637, 12)

RecipeNLG:
  Columns: ['Unnamed: 0', 'title', 'ingredients', 'directions', 'link', 'source', 'NER']
  Shape: (2231142, 7)

Food.com Interactions:
  Columns: ['user_id', 'recipe_id', 'date', 'rating', 'review']
  Shape: (1132367, 5)


In [5]:
# Display sample recipes
print("Sample Food.com Recipe:\n")
display(food_com_recipes.head(2))

print("\nSample RecipeNLG Recipe:\n")
display(recipenlg_df.head(2))

Sample Food.com Recipe:



Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6



Sample RecipeNLG Recipe:



Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."


## Nutrition Lookup Extraction 

In [7]:
tqdm.pandas()

# === Load raw Food.com data (from Kaggle dataset) ===
RAW_FOOD_PATH = "datasets/kaggleFood/RAW_recipes.csv"
raw_food = pd.read_csv(RAW_FOOD_PATH)

print(f"Loaded {len(raw_food):,} raw Food.com recipes")

# === Parse the nutrition column ===
# Each nutrition entry looks like: [138.4, 4.0, 5.0, 20.0, 10.0, 15.0, 3.0]
# Format: [calories, total_fat(g), sugar(g), sodium(mg), protein(g), sat_fat(g), carbs(g)]

def parse_nutrition(entry):
    """Convert Food.com nutrition string into a dict of numeric fields."""
    try:
        vals = ast.literal_eval(entry)
        return {
            "calories": float(vals[0]),
            "fat_g": float(vals[1]),
            "sugar_g": float(vals[2]),
            "sodium_mg": float(vals[3]),
            "protein_g": float(vals[4]),
            "sat_fat_g": float(vals[5]),
            "carbs_g": float(vals[6])
        }
    except Exception:
        return None

# Apply parsing
nutrition_parsed = raw_food["nutrition"].progress_apply(parse_nutrition)
nutrition_df = pd.DataFrame(list(nutrition_parsed))

# Attach IDs and titles for lookup
nutrition_df["id"] = raw_food["id"]
nutrition_df["title"] = raw_food["name"].str.lower().str.strip()

# === Clean and validate ===
nutrition_df = nutrition_df.dropna(subset=["calories", "protein_g", "carbs_g"])
nutrition_df = nutrition_df[nutrition_df["calories"] > 0]
nutrition_df = nutrition_df.drop_duplicates(subset=["title"])

# Reorder columns neatly
nutrition_df = nutrition_df[
    ["id", "title", "calories", "protein_g", "carbs_g", "fat_g", "sat_fat_g", "sugar_g", "sodium_mg"]
]

print(f"‚úÖ Parsed nutrition info for {len(nutrition_df):,} recipes")

# === Save to CSV for lookup ===
NUTRITION_OUTPUT_PATH = "datasets/Cleaned/nutrition_lookup.csv"
nutrition_df.to_csv(NUTRITION_OUTPUT_PATH, index=False)

print(f"üíæ Nutrition lookup saved to: {NUTRITION_OUTPUT_PATH}")
print("Sample rows:\n", nutrition_df.sample(5))


Loaded 231,637 raw Food.com recipes


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231637/231637 [00:01<00:00, 123623.94it/s]


‚úÖ Parsed nutrition info for 230,133 recipes
üíæ Nutrition lookup saved to: datasets/Cleaned/nutrition_lookup.csv
Sample rows:
             id                                          title  calories  \
195584  444694       spicy chicken with carrot and herb salad     424.3   
91146    40313                           georgia peach cooler     310.7   
101861  148628          ham with pineapple orange dijon glaze     150.3   
154664  209173         paula deen s easy squeeze honey butter    2703.8   
142423  471871  na me inspired roasted squash  vegan friendly     818.9   

        protein_g  carbs_g  fat_g  sat_fat_g  sugar_g  sodium_mg  
195584       57.0     22.0    8.0        7.0     30.0        7.0  
91146        19.0     17.0   14.0       28.0    130.0        5.0  
101861        0.0     12.0    0.0        0.0    150.0        1.0  
154664        6.0     24.0  421.0      238.0    278.0      133.0  
142423       23.0     50.0   43.0      119.0    169.0       26.0  


---
## Section 3: Define Cleaning Functions

These functions handle:
- Text normalization (lowercase, whitespace)
- HTML/URL removal
- List parsing and formatting
- Special character handling

In [8]:
def clean_text(text):
    """
    Clean text by removing HTML, URLs, and normalizing whitespace.
    
    Args:
        text: Input string to clean
    
    Returns:
        Cleaned lowercase string
    """
    if not isinstance(text, str):
        return ""
    
    text = text.lower()
    text = re.sub(r"<[^>]+>", "", text)  # Remove HTML tags
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    text = re.sub(r"[^a-z0-9\s.,!?\[\]\(\)\-'\"]", "", text)  # Keep essential punctuation
    text = re.sub(r"\s+", " ", text).strip()  # Normalize whitespace
    
    return text


def parse_list_string(s):
    """
    Safely parse string representation of Python lists.
    
    Args:
        s: String representation of a list (e.g., "['item1', 'item2']")
    
    Returns:
        Parsed list or empty list if parsing fails
    """
    if not isinstance(s, str):
        return []
    try:
        return ast.literal_eval(s)
    except:
        return []


def format_ingredients(ingredients):
    """
    Format ingredients list into a comma-separated string.
    
    Args:
        ingredients: List or string representation of ingredients
    
    Returns:
        Formatted string (e.g., "flour, sugar, eggs")
    """
    if isinstance(ingredients, str):
        ingredients = parse_list_string(ingredients)
    
    if isinstance(ingredients, list):
        return ", ".join([str(i).strip() for i in ingredients if i])
    
    return str(ingredients)


def format_directions(directions):
    """
    Format cooking directions with numbered steps.
    
    Args:
        directions: List or string representation of cooking steps
    
    Returns:
        Formatted string with numbered steps (e.g., "1. preheat oven 2. mix ingredients")
    """
    if isinstance(directions, str):
        directions = parse_list_string(directions)
    
    if isinstance(directions, list):
        return " ".join([f"{i+1}. {str(step).strip()}" for i, step in enumerate(directions) if step])
    
    return str(directions)


print("‚úì Cleaning functions defined!")

‚úì Cleaning functions defined!


### üí° Why We Clean and Normalize Text

Before any deep learning model can learn meaningful patterns, the input data must be **consistent, noise-free, and normalized**.  
This section applies the same principles we learned in COMP0220 about *input normalization for stable training*.

#### 1. Lowercasing & Whitespace Normalization
- Ensures that "Milk" and "milk" are treated as the same token, reducing vocabulary size and data sparsity.  
- Makes gradient updates more stable because the model doesn‚Äôt waste capacity learning redundant word forms.

#### 2. HTML / URL Removal
- Removes irrelevant tokens (e.g., `<div>`, `http://...`) that carry no semantic meaning for recipes or ingredients.  
- Prevents the tokenizer and embedding layers from assigning random weights to non-informative symbols.

#### 3. List Parsing & Formatting
- Converts ingredient lists or numbered steps into clean, uniform strings so the model can learn true relationships  
  between **ingredients ‚Üí actions ‚Üí outcomes** instead of formatting artifacts.

#### 4. Special Character Handling
- Normalizes punctuation, removes emojis or stray symbols that increase token noise.  
- Keeps the text distribution consistent, improving embedding quality and convergence speed.

In short: **clean, normalized text ‚Üí cleaner embeddings ‚Üí faster convergence ‚Üí better generalization.**  
This mirrors the data-normalization step used in CNNs (e.g., scaling pixel values) but applied here to language data.


---
## Section 4: Process Food.com Dataset

In [9]:
print("Processing Food.com recipes...\n")

# Select and rename columns
food_com_clean = food_com_recipes[["id", "name", "ingredients", "steps", "description"]].copy()
food_com_clean.rename(columns={"name": "title", "steps": "directions"}, inplace=True)

# Clean text fields
print("[1/4] Cleaning titles...")
food_com_clean["title"] = food_com_clean["title"].progress_apply(clean_text)

print("[2/4] Cleaning descriptions...")
food_com_clean["description"] = food_com_clean["description"].progress_apply(clean_text)

print("[3/4] Formatting ingredients...")
food_com_clean["ingredients"] = food_com_clean["ingredients"].progress_apply(format_ingredients)
food_com_clean["ingredients"] = food_com_clean["ingredients"].apply(clean_text)

print("[4/4] Formatting directions...")
food_com_clean["directions"] = food_com_clean["directions"].progress_apply(format_directions)
food_com_clean["directions"] = food_com_clean["directions"].apply(clean_text)

# Remove invalid entries
initial_count = len(food_com_clean)
food_com_clean = food_com_clean.dropna(subset=["title", "ingredients", "directions"])
food_com_clean = food_com_clean[
    (food_com_clean["title"].str.len() > 0) &
    (food_com_clean["ingredients"].str.len() > 0) &
    (food_com_clean["directions"].str.len() > 0)
]

print(f"\n‚úì Food.com processing complete!")
print(f"  - Recipes retained: {len(food_com_clean):,} / {initial_count:,}")
print(f"  - Removal rate: {(1 - len(food_com_clean)/initial_count)*100:.1f}%")

Processing Food.com recipes...

[1/4] Cleaning titles...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231637/231637 [00:00<00:00, 358934.63it/s]


[2/4] Cleaning descriptions...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231637/231637 [00:02<00:00, 110119.74it/s]


[3/4] Formatting ingredients...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231637/231637 [00:02<00:00, 84505.55it/s]


[4/4] Formatting directions...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231637/231637 [00:03<00:00, 67341.59it/s]



‚úì Food.com processing complete!
  - Recipes retained: 231,635 / 231,637
  - Removal rate: 0.0%


### 4.1 Process Reviews

In [10]:
print("Processing Food.com reviews...\n")

# Clean review text
print("[1/2] Cleaning review text...")
food_com_interactions["review"] = food_com_interactions["review"].progress_apply(clean_text)

# Filter out very short reviews
food_com_interactions = food_com_interactions[food_com_interactions["review"].str.len() > 10]

# Get first review for each recipe
print("[2/2] Merging reviews with recipes...")
first_reviews = food_com_interactions.groupby("recipe_id")["review"].first().reset_index()

food_com_with_reviews = pd.merge(
    food_com_clean,
    first_reviews,
    left_on="id",
    right_on="recipe_id",
    how="left"
)

review_count = food_com_with_reviews["review"].notna().sum()
print(f"\n‚úì Reviews processed!")
print(f"  - Recipes with reviews: {review_count:,} / {len(food_com_with_reviews):,}")
print(f"  - Coverage: {(review_count/len(food_com_with_reviews))*100:.1f}%")

Processing Food.com reviews...

[1/2] Cleaning review text...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1132367/1132367 [00:15<00:00, 74661.84it/s]


[2/2] Merging reviews with recipes...

‚úì Reviews processed!
  - Recipes with reviews: 231,236 / 231,635
  - Coverage: 99.8%


---
## Section 5: Process RecipeNLG Dataset

In [11]:
print("Processing RecipeNLG recipes...\n")

# Select relevant columns
recipenlg_clean = recipenlg_df[["title", "ingredients", "directions"]].copy()

# Clean text fields
print("[1/3] Cleaning titles...")
recipenlg_clean["title"] = recipenlg_clean["title"].progress_apply(clean_text)

print("[2/3] Formatting ingredients...")
recipenlg_clean["ingredients"] = recipenlg_clean["ingredients"].progress_apply(format_ingredients)
recipenlg_clean["ingredients"] = recipenlg_clean["ingredients"].apply(clean_text)

print("[3/3] Formatting directions...")
recipenlg_clean["directions"] = recipenlg_clean["directions"].progress_apply(format_directions)
recipenlg_clean["directions"] = recipenlg_clean["directions"].apply(clean_text)

# Remove invalid entries
initial_count = len(recipenlg_clean)
recipenlg_clean = recipenlg_clean.dropna(subset=["title", "ingredients", "directions"])
recipenlg_clean = recipenlg_clean[
    (recipenlg_clean["title"].str.len() > 0) &
    (recipenlg_clean["ingredients"].str.len() > 0) &
    (recipenlg_clean["directions"].str.len() > 0)
]

print(f"\n‚úì RecipeNLG processing complete!")
print(f"  - Recipes retained: {len(recipenlg_clean):,} / {initial_count:,}")
print(f"  - Removal rate: {(1 - len(recipenlg_clean)/initial_count)*100:.1f}%")

Processing RecipeNLG recipes...

[1/3] Cleaning titles...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2231142/2231142 [00:05<00:00, 387785.66it/s]


[2/3] Formatting ingredients...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2231142/2231142 [00:27<00:00, 82453.48it/s]


[3/3] Formatting directions...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2231142/2231142 [00:30<00:00, 73136.48it/s]



‚úì RecipeNLG processing complete!
  - Recipes retained: 2,231,129 / 2,231,142
  - Removal rate: 0.0%


---
## Section 6: Create Prompt-Response Pairs

We create three types of conversational pairs:
1. **Ingredient-based**: "I have X ingredients, what can I make?"
2. **Recipe request**: "How do I make X?"
3. **Review inquiry**: "What do people think about X?" (Food.com only)

In [12]:
def create_recipe_pairs(df, include_reviews=False):
    """
    Generate prompt-response pairs from recipe data.
    
    Args:
        df: DataFrame with columns [title, ingredients, directions, review (optional)]
        include_reviews: Whether to create review-based pairs
    
    Returns:
        DataFrame with columns [prompt, response]
    """
    prompts = []
    responses = []
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Creating pairs"):
        # Type 1: Ingredient-based query
        prompt1 = f"i have these ingredients: {row['ingredients']}. what can i make?"
        response1 = f"you could make {row['title']}. here are the instructions: {row['directions']}"
        prompts.append(prompt1)
        responses.append(response1)
        
        # Type 2: Recipe name query
        prompt2 = f"how do i make {row['title']}?"
        response2 = f"to make {row['title']}, you'll need: {row['ingredients']}. then follow these steps: {row['directions']}"
        prompts.append(prompt2)
        responses.append(response2)
        
        # Type 3: Review query (if available)
        if include_reviews and "review" in df.columns and pd.notna(row.get("review")):
            prompt3 = f"what do people think about {row['title']}?"
            response3 = row["review"]
            prompts.append(prompt3)
            responses.append(response3)
    
    return pd.DataFrame({"prompt": prompts, "response": responses})


print("Creating prompt-response pairs...\n")

print("[1/2] Processing Food.com recipes (with reviews)...")
food_com_pairs = create_recipe_pairs(food_com_with_reviews, include_reviews=True)
print(f"      ‚úì Created {len(food_com_pairs):,} pairs")

print("[2/2] Processing RecipeNLG recipes...")
recipenlg_pairs = create_recipe_pairs(recipenlg_clean, include_reviews=False)
print(f"      ‚úì Created {len(recipenlg_pairs):,} pairs")

print(f"\n‚úì Total pairs created: {len(food_com_pairs) + len(recipenlg_pairs):,}")

Creating prompt-response pairs...

[1/2] Processing Food.com recipes (with reviews)...


Creating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 231635/231635 [00:08<00:00, 27858.20it/s]


      ‚úì Created 694,506 pairs
[2/2] Processing RecipeNLG recipes...


Creating pairs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2231129/2231129 [01:03<00:00, 35220.75it/s]


      ‚úì Created 4,462,258 pairs

‚úì Total pairs created: 5,156,764


---
## Section 7: Combine & Finalize Dataset

In [13]:
print("Combining and finalizing dataset...\n")

# Combine all pairs
final_df = pd.concat([food_com_pairs, recipenlg_pairs], ignore_index=True)
print(f"[1/5] Combined datasets: {len(final_df):,} total pairs")

# Remove missing values
final_df = final_df.dropna()
print(f"[2/5] After dropping NaN: {len(final_df):,} pairs")

# Remove duplicates
final_df = final_df.drop_duplicates()
print(f"[3/5] After dropping duplicates: {len(final_df):,} pairs")

# Filter out very short responses (likely errors)
final_df = final_df[final_df["response"].str.len() > 20]
print(f"[4/5] After filtering short responses: {len(final_df):,} pairs")

# Shuffle dataset
final_df = final_df.sample(frac=1, random_state=42).reset_index(drop=True)
print(f"[5/5] Dataset shuffled")

print(f"\n‚úì Final dataset ready with {len(final_df):,} prompt-response pairs!")

Combining and finalizing dataset...

[1/5] Combined datasets: 5,156,764 total pairs
[2/5] After dropping NaN: 5,156,764 pairs
[3/5] After dropping duplicates: 5,156,758 pairs
[4/5] After filtering short responses: 5,155,414 pairs
[5/5] Dataset shuffled

‚úì Final dataset ready with 5,155,414 prompt-response pairs!


---
## Section 8: Data Validation & Quality Checks

In [14]:
print("Running quality checks...\n")

# Calculate statistics
prompt_lengths = final_df["prompt"].str.len()
response_lengths = final_df["response"].str.len()

print("Dataset Statistics:")
print(f"  Total pairs: {len(final_df):,}")
print(f"\nPrompt Statistics:")
print(f"  Mean length: {prompt_lengths.mean():.1f} characters")
print(f"  Median length: {prompt_lengths.median():.1f} characters")
print(f"  Min length: {prompt_lengths.min()} characters")
print(f"  Max length: {prompt_lengths.max()} characters")
print(f"\nResponse Statistics:")
print(f"  Mean length: {response_lengths.mean():.1f} characters")
print(f"  Median length: {response_lengths.median():.1f} characters")
print(f"  Min length: {response_lengths.min()} characters")
print(f"  Max length: {response_lengths.max()} characters")

# Check for issues
print(f"\nQuality Checks:")
print(f"  ‚úì No missing values: {final_df.isnull().sum().sum() == 0}")
print(f"  ‚úì No duplicates: {final_df.duplicated().sum() == 0}")
print(f"  ‚úì All prompts non-empty: {(final_df['prompt'].str.len() > 0).all()}")
print(f"  ‚úì All responses non-empty: {(final_df['response'].str.len() > 0).all()}")

Running quality checks...

Dataset Statistics:
  Total pairs: 5,155,414

Prompt Statistics:
  Mean length: 150.0 characters
  Median length: 66.0 characters
  Min length: 16 characters
  Max length: 10132 characters

Response Statistics:
  Mean length: 673.5 characters
  Median length: 533.0 characters
  Min length: 21 characters
  Max length: 16332 characters

Quality Checks:
  ‚úì No missing values: True
  ‚úì No duplicates: True
  ‚úì All prompts non-empty: True
  ‚úì All responses non-empty: True


---
## Section 9: Save Output

In [15]:
print(f"Saving to {OUTPUT_PATH}...\n")

final_df.to_csv(OUTPUT_PATH, index=False)

file_size_mb = os.path.getsize(OUTPUT_PATH) / (1024**2)

print("=" * 80)
print("‚úì SUCCESS! Dataset saved successfully.")
print("=" * 80)
print(f"\nOutput Details:")
print(f"  File: {OUTPUT_PATH}")
print(f"  Size: {file_size_mb:.2f} MB")
print(f"  Rows: {len(final_df):,}")
print(f"  Columns: {len(final_df.columns)}")
print(f"\nReady for:")
print(f"  ‚úÖ Embedding generation (Word2Vec, BERT, etc.)")
print(f"  ‚úÖ LSTM/RNN training")
print(f"  ‚úÖ Transformer fine-tuning (GPT-2, T5)")
print(f"  ‚úÖ RAG system integration")

Saving to datasets/Cleaned/clean_recipes.csv...

‚úì SUCCESS! Dataset saved successfully.

Output Details:
  File: datasets/Cleaned/clean_recipes.csv
  Size: 4073.08 MB
  Rows: 5,155,414
  Columns: 2

Ready for:
  ‚úÖ Embedding generation (Word2Vec, BERT, etc.)
  ‚úÖ LSTM/RNN training
  ‚úÖ Transformer fine-tuning (GPT-2, T5)
  ‚úÖ RAG system integration


---
## Section 10: Preview Final Output

In [16]:
print("Sample Prompt-Response Pairs:\n")
print("=" * 80)

for i in range(5):
    print(f"\n[Pair {i+1}]")
    print(f"PROMPT:\n  {final_df.iloc[i]['prompt']}")
    print(f"\nRESPONSE:\n  {final_df.iloc[i]['response'][:300]}...")
    print("-" * 80)

Sample Prompt-Response Pairs:


[Pair 1]
PROMPT:
  i have these ingredients: 1 c. salad dressing, 1 c. sour cream, 1 (10 oz.) pkg. frozen spinach, thawed and well drained, 12 c. onion, chopped, 12 c. parsley, chopped, 1 tsp. salt, 12 tsp. pepper. what can i make?

RESPONSE:
  you could make stadium spinach dip. here are the instructions: 1. combine all ingredients mix well. 2. chill. 3. serve with vegetables carrot sticks, celery sticks, broccoli, cauliflower, etc. 4. makes 3 cups....
--------------------------------------------------------------------------------

[Pair 2]
PROMPT:
  how do i make asparagus puff ring?

RESPONSE:
  to make asparagus puff ring, you'll need: 34 cup water, 6 tablespoons butter, 34 cup all-purpose flour, 12 teaspoon salt, 3 large eggs, 14 cup grated parmesan cheese, divided, 1 pound fresh asparagus, cut into 1-inch pieces, 14 cup diced onion, 2 tablespoons butter, 2 tablespoons all-purpose flour, ...
---------------------------------------------------------

---
## Section 11: Generate Conversational Training Data (NEW)

This section creates a **3rd dataset** for multi-turn conversational LSTM training.

**What it does:**
- Takes `clean_recipes.csv` + `nutrition_lookup.csv`
- Generates multi-turn dialogue pairs
- Adds nutritional reasoning
- Outputs: `conversational_training_data.csv`

**Objectives:**
- Enables your LSTM to have natural conversations
- Teaches follow-up questions ("would you like to know...")
- Adds nutritional judgments ("this is healthy because...")

In [17]:
tqdm.pandas()

print("=" * 80)
print("GENERATING CONVERSATIONAL TRAINING DATA")
print("=" * 80)
print("\n‚öôÔ∏è  Configuration: Using 100,000 recipe sample")
print("   Estimated processing time: 20-30 minutes")
print("   Expected output: ~600,000 conversational pairs")

# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def get_nutritional_judgment(calories, protein, fat, carbs, sugar):
    """Generate nutritional reasoning based on macronutrient values."""
    if pd.isna(calories) or calories == 0:
        return "this is a moderate option."
    
    if protein > 20 and fat < 15 and sugar < 10:
        return "this is a healthy choice as seen by the high protein content and low fat. its great for muscle building and satiety."
    elif protein > 15 and fat < 25 and carbs < 50:
        return "this is a balanced meal with good protein and moderate macros. its suitable for most diets."
    elif carbs > 50 and fat > 20:
        return "this is a nice treat to enjoy occasionally. its indulgent but can fit into a balanced diet in moderation."
    elif calories < 200:
        return "this is a light and healthy option, perfect for a snack or side dish."
    elif calories > 500:
        return "this is a hearty and filling meal. enjoy it when you need substantial energy."
    elif carbs < 20 and fat > 30:
        return "this is a keto-friendly option with low carbs and high fat. great for low-carb diets."
    elif sugar > 30:
        return "this is a sweet treat. enjoy it as a dessert or special occasion food."
    else:
        return "this is a moderate option that fits most dietary needs."


def create_conversational_pairs(row):
    """Create multi-turn conversation pairs from a single recipe."""
    ingredients = row.get('ingredients', '')
    title = row.get('title', '')
    directions = row.get('directions', '')
    calories = row.get('calories', 0)
    protein = row.get('protein_g', 0)
    fat = row.get('fat_g', 0)
    carbs = row.get('carbs_g', 0)
    sugar = row.get('sugar_g', 0)
    
    if not title or not ingredients or not directions:
        return []
    
    judgment = get_nutritional_judgment(calories, protein, fat, carbs, sugar)
    pairs = []
    
    # Multi-turn conversation
    turn1_input = f"[INGREDIENTS] {ingredients}"
    turn1_output = f"i see youve got {ingredients}. i would suggest making {title}. would you like to know how to make it?"
    pairs.append({"input": turn1_input, "output": turn1_output})
    
    turn2_input = f"{turn1_input} [HISTORY] system: {turn1_output} user: yes"
    turn2_output = f"okay heres how you can make it: {directions}. would you like to know the macronutrient information?"
    pairs.append({"input": turn2_input, "output": turn2_output})
    
    if not pd.isna(calories):
        turn3_input = f"{turn2_input} system: {turn2_output} user: yes"
        turn3_output = f"this has {calories:.1f} calories, {protein:.1f}g protein, {fat:.1f}g fat, {carbs:.1f}g carbs, and {sugar:.1f}g sugar. {judgment}"
        pairs.append({"input": turn3_input, "output": turn3_output})
    
    # Single-turn variants
    pairs.append({"input": f"how do i make {title}?", "output": f"to make {title}, youll need: {ingredients}. then follow these steps: {directions}"})
    
    if not pd.isna(calories):
        pairs.append({"input": f"what are the macronutrients in {title}?", "output": f"{title} has {calories:.1f} calories, {protein:.1f}g protein, {fat:.1f}g fat, {carbs:.1f}g carbs, and {sugar:.1f}g sugar. {judgment}"})
    
    pairs.append({"input": f"i have {ingredients}. what can i make?", "output": f"you could make {title}. here are the instructions: {directions}"})
    
    return pairs


# ============================================================================
# STEP 1: LOAD DATA WITH SAMPLING
# ============================================================================

print("\n[1/5] Loading existing datasets...")
clean_recipes = pd.read_csv("datasets/Cleaned/clean_recipes.csv")
print(f"      ‚úì Loaded {len(clean_recipes):,} total recipe pairs")

# Apply 50K sampling
SAMPLE_SIZE = 50000
if len(clean_recipes) > SAMPLE_SIZE:
    clean_recipes = clean_recipes.sample(n=SAMPLE_SIZE, random_state=42).reset_index(drop=True)
    print(f"      ‚úì Sampled {len(clean_recipes):,} recipes for processing")
    print(f"      ‚ÑπÔ∏è  This reduces processing time to 20-30 minutes")
else:
    print(f"      ‚ÑπÔ∏è  Using all {len(clean_recipes):,} recipes (less than {SAMPLE_SIZE:,})")

try:
    nutrition_lookup = pd.read_csv("datasets/Cleaned/nutrition_lookup.csv")
    print(f"      ‚úì Loaded {len(nutrition_lookup):,} nutrition entries")
except FileNotFoundError:
    print("      ‚ö† nutrition_lookup.csv not found, continuing without nutrition data")
    nutrition_lookup = None


# ============================================================================
# STEP 2: EXTRACT RECIPE INFORMATION
# ============================================================================

print("\n[2/5] Extracting recipe information...")
recipe_data = []

for _, row in tqdm(clean_recipes.iterrows(), total=len(clean_recipes), desc="Processing"):
    prompt = row['prompt']
    response = row['response']
    
    if "i have these ingredients:" in prompt:
        ingredients = prompt.split("i have these ingredients:")[1].split(".")[0].strip()
    elif "i have" in prompt and "what can i make" in prompt:
        ingredients = prompt.split("i have")[1].split("what can i make")[0].strip().rstrip('.')
    else:
        continue
    
    if "you could make" in response and "here are the instructions:" in response:
        parts = response.split("you could make")[1].split("here are the instructions:")
        if len(parts) == 2:
            title = parts[0].strip().rstrip('.')
            directions = parts[1].strip()
            recipe_data.append({"title": title, "ingredients": ingredients, "directions": directions})
    elif "to make" in response and "you'll need:" in response:
        parts = response.split("to make")[1].split("you'll need:")
        if len(parts) == 2:
            title = parts[0].strip().rstrip(',')
            rest = parts[1]
            if "then follow these steps:" in rest:
                ing_dir = rest.split("then follow these steps:")
                ingredients_alt = ing_dir[0].strip().rstrip('.')
                directions = ing_dir[1].strip()
                recipe_data.append({"title": title, "ingredients": ingredients_alt, "directions": directions})

recipes_df = pd.DataFrame(recipe_data)
print(f"      ‚úì Extracted {len(recipes_df):,} unique recipes")


# ============================================================================
# STEP 3: MERGE WITH NUTRITION DATA
# ============================================================================

print("\n[3/5] Merging with nutrition data...")
if nutrition_lookup is not None:
    merged = recipes_df.merge(nutrition_lookup, on='title', how='left')
    with_nutrition = merged['calories'].notna().sum()
    without_nutrition = merged['calories'].isna().sum()
    print(f"      ‚úì Recipes with nutrition: {with_nutrition:,}")
    print(f"      ‚ö† Recipes without nutrition: {without_nutrition:,}")
    print(f"      (Recipes without nutrition will still be included)")
else:
    merged = recipes_df
    merged['calories'] = np.nan
    merged['protein_g'] = 0
    merged['fat_g'] = 0
    merged['carbs_g'] = 0
    merged['sugar_g'] = 0
    print(f"      ‚ö† No nutrition data available")


# ============================================================================
# STEP 4: GENERATE CONVERSATIONAL PAIRS
# ============================================================================

print("\n[4/5] Generating conversational pairs...")
all_pairs = []
for _, row in tqdm(merged.iterrows(), total=len(merged), desc="Creating conversations"):
    pairs = create_conversational_pairs(row)
    all_pairs.extend(pairs)

conversational_df = pd.DataFrame(all_pairs)
initial_count = len(conversational_df)
conversational_df = conversational_df.drop_duplicates()
duplicates_removed = initial_count - len(conversational_df)
conversational_df = conversational_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"      ‚úì Generated {initial_count:,} total pairs")
print(f"      ‚úì Removed {duplicates_removed:,} duplicates")
print(f"      ‚úì Final count: {len(conversational_df):,} unique pairs")


# ============================================================================
# STEP 5: SAVE OUTPUT
# ============================================================================

print("\n[5/5] Saving conversational training data...")
output_path = "datasets/Cleaned/conversational_training_data.csv"
conversational_df.to_csv(output_path, index=False)

file_size_mb = os.path.getsize(output_path) / (1024**2)

print("\n" + "=" * 80)
print("‚úì SUCCESS! CONVERSATIONAL TRAINING DATA GENERATED")
print("=" * 80)
print(f"\nDataset Statistics:")
print(f"  Input recipes: {SAMPLE_SIZE:,}")
print(f"  Total training pairs: {len(conversational_df):,}")
print(f"  File size: {file_size_mb:.2f} MB")
print(f"  Output location: {output_path}")

print(f"\nPair Types:")
multi_turn = conversational_df['input'].str.contains('[HISTORY]').sum()
single_turn = len(conversational_df) - multi_turn
print(f"  Multi-turn conversations: {multi_turn:,}")
print(f"  Single-turn queries: {single_turn:,}")

print(f"\nThis dataset is ready for:")
print(f"  ‚úÖ LSTM training on multi-turn conversations")
print(f"  ‚úÖ Learning conversational flow patterns")
print(f"  ‚úÖ Nutritional reasoning and judgments")

# ============================================================================
# PREVIEW SAMPLES
# ============================================================================

print("\n" + "=" * 80)
print("SAMPLE CONVERSATIONAL PAIRS:")
print("=" * 80)
for i in range(min(3, len(conversational_df))):
    print(f"\n[Sample {i+1}]")
    print(f"INPUT:")
    print(f"  {conversational_df.iloc[i]['input'][:150]}...")
    print(f"\nOUTPUT:")
    print(f"  {conversational_df.iloc[i]['output'][:150]}...")
    print("-" * 80)

print("\n‚úì You now have 3 datasets ready for LSTM training!")
print("  1. nutrition_lookup.csv (reference table)")
print("  2. clean_recipes.csv (single-turn training)")
print("  3. conversational_training_data.csv (multi-turn training)")

GENERATING CONVERSATIONAL TRAINING DATA

‚öôÔ∏è  Configuration: Using 100,000 recipe sample
   Estimated processing time: 20-30 minutes
   Expected output: ~600,000 conversational pairs

[1/5] Loading existing datasets...
      ‚úì Loaded 5,155,414 total recipe pairs
      ‚úì Sampled 50,000 recipes for processing
      ‚ÑπÔ∏è  This reduces processing time to 20-30 minutes
      ‚úì Loaded 230,133 nutrition entries

[2/5] Extracting recipe information...


Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50000/50000 [00:00<00:00, 58360.42it/s]


      ‚úì Extracted 23,751 unique recipes

[3/5] Merging with nutrition data...
      ‚úì Recipes with nutrition: 9,921
      ‚ö† Recipes without nutrition: 13,830
      (Recipes without nutrition will still be included)

[4/5] Generating conversational pairs...


Creating conversations: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 23751/23751 [00:00<00:00, 37255.90it/s]


      ‚úì Generated 114,834 total pairs
      ‚úì Removed 3,387 duplicates
      ‚úì Final count: 111,447 unique pairs

[5/5] Saving conversational training data...

‚úì SUCCESS! CONVERSATIONAL TRAINING DATA GENERATED

Dataset Statistics:
  Input recipes: 50,000
  Total training pairs: 111,447
  File size: 79.25 MB
  Output location: datasets/Cleaned/conversational_training_data.csv

Pair Types:
  Multi-turn conversations: 56,897
  Single-turn queries: 54,550

This dataset is ready for:
  ‚úÖ LSTM training on multi-turn conversations
  ‚úÖ Learning conversational flow patterns
  ‚úÖ Nutritional reasoning and judgments

SAMPLE CONVERSATIONAL PAIRS:

[Sample 1]
INPUT:
  [INGREDIENTS] 1 kg chicken wings, 2 tablespoons sesame oil, 3 garlic cloves, 2 cm ginger, 14 cup soy sauce, 12 cup honey, 14 cup ketjap manis (sweet s...

OUTPUT:
  okay heres how you can make it: 1. cut chicken wings into 3 segments at joints, we use everything. 2. heat a deep wok for 1 minute on high. 3. pour oi...
----

---
## Section 12: Next Steps

### Immediate Actions:
1. **Verify Outputs:** Inspect `clean_recipes.csv` and `conversational_training_data.csv`.
2. **Create splits**: Split into train/validation/test sets
3. **Generate embeddings**: Use Word2Vec, GloVe, or BERT

### Model Training:
1. **Dummy Model**: Build LSTM/RNN from scratch
2. **Core Model**: Fine-tune GPT-2 or T5
3. **Benchmark**: Compare against GPT-4/Claude

### Integration:
- Connect object detection outputs to LLM prompts
- Implement RAG for context-aware responses
- Add nutritional data extraction

---

Personal Notes:

- Can we make the model more conversational by using scraping socials (e.g. reddit subreddits, twitter etc.)
- Can we make the model more informed rather than hardcoding macronutrient sentiments?