# Food.com Dataset Overview
This notebook explores the structure and quality of the Food.com dataset.  
The goal is to understand the columns, inspect sample values, identify cleaning requirements, and prepare for the dataset cleaning.

## 1. Load the Food.com Dataset
We begin by loading the `recipes.csv` file from the `data/raw/` directory.  
This file contains over 520,000 recipes with metadata, ingredients, quantities, instructions, and nutrition information.

In [4]:
import pandas as pd 

df = pd.read_csv("../data/raw/recipes.csv")

## 2. Dataset Structure

The output will helps us understand:
- total number of rows  
- number of columns  
- which columns contain missing values  
- datatypes (object, float, etc.)  
- memory usage  

This is important to determine whether:
- the dataset is clean or messy
- certain columns need parsing or conversion
- columns are ready for embedding or require preprocessing

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    522517 non-null  int64  
 1   Name                        522517 non-null  object 
 2   AuthorId                    522517 non-null  int64  
 3   AuthorName                  522517 non-null  object 
 4   CookTime                    439972 non-null  object 
 5   PrepTime                    522517 non-null  object 
 6   TotalTime                   522517 non-null  object 
 7   DatePublished               522517 non-null  object 
 8   Description                 522512 non-null  object 
 9   Images                      522516 non-null  object 
 10  RecipeCategory              521766 non-null  object 
 11  Keywords                    505280 non-null  object 
 12  RecipeIngredientQuantities  522514 non-null  object 
 13  RecipeIngredie

- The dataset contains **522,517 recipes**, which is large and excellent for RAG diversity.
- There are **28 columns** total.
- The important columns for our project (`RecipeIngredientParts`, `RecipeIngredientQuantities`, and `RecipeInstructions`) have **zero missing values**, which is ideal.
- Several columns have missing values (e.g., `AggregatedRating`, `ReviewCount`, `RecipeServings`, `RecipeYield`), but these are **not important** for recipe generation.
- Nutrition columns (`Calories`, `FatContent`, `CarbohydrateContent`, etc.) all contain **522,517 non-null values** → perfect for nutrition calculation.
- Many columns use `object` type because they contain stringified lists — these will require parsing.
- Memory usage is ~112 MB, manageable for loading and cleaning.

The dataset is very complete for the features we need and requires cleaning only in text or list-formatted fields.

## 3. Column Names Overview
The goal is to identify exactly which fields we will keep for RAG and recipe generation.

In [6]:
df.columns

Index(['RecipeId', 'Name', 'AuthorId', 'AuthorName', 'CookTime', 'PrepTime',
       'TotalTime', 'DatePublished', 'Description', 'Images', 'RecipeCategory',
       'Keywords', 'RecipeIngredientQuantities', 'RecipeIngredientParts',
       'AggregatedRating', 'ReviewCount', 'Calories', 'FatContent',
       'SaturatedFatContent', 'CholesterolContent', 'SodiumContent',
       'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent',
       'RecipeServings', 'RecipeYield', 'RecipeInstructions'],
      dtype='object')

This output lists all available columns so we can identify which ones matter for CookMate:
- recipe title  
- ingredient names  
- ingredient quantities  
- instructions  
- nutrition  
- category/tags  
- metadata we can ignore (e.g., AuthorId, ReviewCount)

## 4. Inspect Ingredient Names

In [7]:
df['RecipeIngredientParts'].iloc[0]

'c("blueberries", "granulated sugar", "vanilla yogurt", "lemon juice")'

We will need to clean this by:
- removing the `c(` and trailing `)`
- splitting into a Python list of strings
- lowercasing and stripping extra characters

## 5. Inspect Ingredient Quantities 

In [8]:
df['RecipeIngredientQuantities'].iloc[0]

'c("4", "1/4", "1", "1")'

This is extremely useful because:
- ingredient names and quantities are paired  
- we can transform both lists into a structured ingredient list for the recipe card

We will parse this into a Python list.

## 6. Inspect Instructions

In [9]:
df['RecipeInstructions'].iloc[0]

'c("Toss 2 cups berries with sugar.", "Let stand for 45 minutes, stirring occasionally.", "Transfer berry-sugar mixture to food processor.", "Add yogurt and process until smooth.", "Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers\' directions). Freeze uncovered until edges are solid but centre is soft.  Transfer to processor and blend until smooth again.", "Return to pan and freeze until edges are solid.", "Transfer to processor and blend until smooth again.", \n"Fold in remaining 2 cups of blueberries.", "Pour into plastic mold and freeze overnight. Let soften slightly to serve.")'

These instructions look realistic and multi-step.  
This is good for RAG because the retrieved text will give the LLM real-world cooking context.

We will convert these into a clean Python list of steps.

## 7. Missing Values Analysis
We check how many values are missing in each column. 

In [10]:
df.isnull().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82545
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   751
Keywords                       17237
RecipeIngredientQuantities         3
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
R

- All nutrition columns (`Calories`, `ProteinContent`, etc.) are **complete** (0 missing)
- Ingredient parts have **0 missing**
- Instructions have **0 missing**
- Missing values appear mainly in:
  - ratings
  - review count
  - servings  
  These are **not critical** for our project.

This confirms the dataset is reliable for recipe generation and RAG.

## 8. Instructions Length Summary
We compute basic statistics of instruction length to understand recipe complexity.

In [11]:
df['RecipeInstructions'].apply(len).describe()

count    522517.000000
mean        594.262506
std         432.607030
min           2.000000
25%         314.000000
50%         495.000000
75%         754.000000
max       12709.000000
Name: RecipeInstructions, dtype: float64

The values show:
- Minimum length ≈ 2 characters  
- Median length ≈ 495 characters  
- Maximum length ≈ 12,709 characters  

This tells us:
- Most recipes have detailed, multi-step instructions.
- A few recipes have extremely long text.
- Overall, instructions are detailed enough to support high-quality RAG retrieval.

This confirms the dataset is suitable for our use case.