# Dataset Comparison: Food.com vs RecipeNLG

The goal of this notebook is to compare two candidate datasets for the CookMate RAG pipeline:
1. **Food.com Recipes & Reviews**
2. **RecipeNLG Dataset**

We evaluate both datasets in terms of:
* Ingredient quality
* Instructions/steps quality
* Data structure and ease of cleaning
* Dataset size
* Missing values
* Suitability for RAG indexing (FAISS/Chroma)
* Nutrition availability

## Food.com dataset

In [1]:
import pandas as pd

df_food = pd.read_csv("../data/raw/recipes.csv")
df_food.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09T21:46:00Z,Make and share this Low-Fat Berry Blue Frozen ...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"c(""Toss 2 cups berries with sugar."", ""Let stan..."
1,39,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29T13:12:00Z,Make and share this Biryani recipe from Food.com.,"c(""https://img.sndimg.com/food/image/upload/w_...",...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,"c(""Soak saffron in warm milk for 5 minutes and..."
2,40,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05T19:52:00Z,This is from one of my first Good House Keepi...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"c(""Into a 1 quart Jar with tight fitting lid, ..."
3,41,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,This dish is best prepared a day in advance to...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"c(""Drain the tofu, carefully squeezing out exc..."
4,42,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19T06:19:00Z,Make and share this Cabbage Soup recipe from F...,"""https://img.sndimg.com/food/image/upload/w_55...",...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"c(""Mix everything together and bring to a boil..."


In [17]:
df_food['RecipeIngredientParts'].iloc[0]

'c("blueberries", "granulated sugar", "vanilla yogurt", "lemon juice")'

In [18]:
df_food['RecipeIngredientQuantities'].iloc[0]

'c("4", "1/4", "1", "1")'

In [19]:
df_food['RecipeInstructions'].iloc[0]

'c("Toss 2 cups berries with sugar.", "Let stand for 45 minutes, stirring occasionally.", "Transfer berry-sugar mixture to food processor.", "Add yogurt and process until smooth.", "Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers\' directions). Freeze uncovered until edges are solid but centre is soft.  Transfer to processor and blend until smooth again.", "Return to pan and freeze until edges are solid.", "Transfer to processor and blend until smooth again.", \n"Fold in remaining 2 cups of blueberries.", "Pour into plastic mold and freeze overnight. Let soften slightly to serve.")'

In [20]:
len(df_food)

522517

In [21]:
df_food.isnull().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82545
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   751
Keywords                       17237
RecipeIngredientQuantities         3
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
R

### Observations
After loading the Food.com dataset, it shows the following characteristics:
* **Structure**
    - Columns include: Name, RecipeIngredientParts, RecipeIngredientQuantities, RecipeInstructions, nutrition columns (Calories, FatContent, ProteinContent, etc.) and metadata.
    - The dataset has **522,517 recipes**, which is large but still manageable for embeddings and FAISS indexing.
* **Ingredients**
    - RecipeIngredientParts contains ingredient names in R-style vector format. These can be cleaned by removing `c(` and `)` and parsing into a Python list.
    - RecipeIngredientQuantities contains corresponding amounts in the same format.
    - This split into **names** and **quantities** is extremely useful.
* **Instructions**
    - RecipeInstructions also comes in R-style list syntax.
    - Steps are real, coherent, multi-step cooking instructions with excellent structure for RAG.
* **Nutrition**
    - Contains complete nutrition per recipe: Calories, FatContent, ProteinContent, etc.
    - This makes nutrition estimates very easy.
* **Missing Values**
    - Missing values exist mostly in ratings, review count and servings.
    - These are **not relevant for our project**.

Therefore, this first dataset has clean and high-quality text for ingrediets and instructions, includes nutrition data, is very suitable for RAG and requires **moderate cleaning**.

## RecipeNLG dataset

In [2]:
df_nlg=pd.read_csv("../data/raw/RecipeNLG_dataset.csv")
df_nlg.head()

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [9]:
df_nlg['ingredients'].iloc[0]

'["1 c. firmly packed brown sugar", "1/2 c. evaporated milk", "1/2 tsp. vanilla", "1/2 c. broken nuts (pecans)", "2 Tbsp. butter or margarine", "3 1/2 c. bite size shredded rice biscuits"]'

In [11]:
df_nlg['directions'].iloc[0]

'["In a heavy 2-quart saucepan, mix brown sugar, nuts, evaporated milk and butter or margarine.", "Stir over medium heat until mixture bubbles all over top.", "Boil and stir 5 minutes more. Take off heat.", "Stir in vanilla and cereal; mix well.", "Using 2 teaspoons, drop and shape into 30 clusters on wax paper.", "Let stand until firm, about 30 minutes."]'

In [13]:
len(df_nlg)

2231142

In [15]:
df_nlg.isnull().sum()

Unnamed: 0     0
title          1
ingredients    0
directions     0
link           0
source         0
NER            0
dtype: int64

In [16]:
df_nlg['directions'].apply(len).describe()

count    2.231142e+06
mean     5.051099e+02
std      4.524093e+02
min      5.000000e+00
25%      2.210000e+02
50%      3.710000e+02
75%      6.410000e+02
max      1.497900e+04
Name: directions, dtype: float64

### Observations
After loading the RecipeNLG dataset, it shows the following characteristics:
* **Structure**
    - Columns include: title, ingredients, direction, NER.
    - Almost **no missing values**.
    - The dataset has **2,231,142 recipes**, which is very large.
* **Ingredients**
    - ingredients column is a stringified Python list.
    - These are usable, but include quantities mixed with names.
* **Instructions**
    - directions column is also a stringified list.
    - Many instructions are extremely long, having mean length of 505 characters and maximum length of 14,979 characters. 
    - This indicates inconsistent quality.
* **NER Column**
    - The NER column provides normalized ingredient names.
    - This is useful for diet checking or constraint checking.
* **Downsides**
    - No nutrition information.
    - Very inconsistent instruction quality.
    - Text is "heavier" and less structured for RAG.
    - Embedding 2.2 million recipes is extremely slow and memory-heavy.
    - Retrieval on such a huge dataset would require complex optimization.


Therefore, this is a very large dataset with mixed-quality instructions and no nutrition data, it is harder to clean and it is heavy for FAISS indexing.

## Final Dataset Selection
After comparing both datasets, we will use **Food.com** dataset because it has better ingredient representation, high-quality instructions, nutrition included, a reasonable size for embeddings and it requires less cleaning.
We cannot use RecipeNLG because it has no nutrition fields, the ingredient lists are less clean, the instructions vary in quality, the dataset is too large for fast indexing and the cleaning overhead is high.
This makes Food.com the optimal choice for CookMate's RAG pipeline.