# Exploratory Data Analysis (EDA)

This notebook performs an exploratory data analysis (EDA) on the `recipes.csv` dataset. The analysis includes loading the data, inspecting its structure, and visualizing key insights.

In [3]:
# Import necessary libraries
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import os
import ast

In [4]:
print(os.getcwd())

d:\Programming\ai_ds_bootcamp\nutrition-ai-assistent\notebooks


## Load and Inspect the Dataset

In this section, we load the dataset and inspect its structure, including the total number of recipes, column names, and a preview of the first few rows.

In [5]:
# Load the dataset
df = pd.read_csv("../data/raw/RecipeNLG_dataset.csv")

# Display basic information about the dataset
print(f"Total recipes: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print(df.head(3))

Total recipes: 2231142
Columns: ['Unnamed: 0', 'title', 'ingredients', 'directions', 'link', 'source', 'NER']
   Unnamed: 0                  title  \
0           0    No-Bake Nut Cookies   
1           1  Jewell Ball'S Chicken   
2           2            Creamy Corn   

                                         ingredients  \
0  ["1 c. firmly packed brown sugar", "1/2 c. eva...   
1  ["1 small jar chipped beef, cut up", "4 boned ...   
2  ["2 (16 oz.) pkg. frozen corn", "1 (8 oz.) pkg...   

                                          directions  \
0  ["In a heavy 2-quart saucepan, mix brown sugar...   
1  ["Place chipped beef on bottom of baking dish....   
2  ["In a slow cooker, combine all ingredients. C...   

                                              link    source  \
0   www.cookbooks.com/Recipe-Details.aspx?id=44874  Gathered   
1  www.cookbooks.com/Recipe-Details.aspx?id=699419  Gathered   
2   www.cookbooks.com/Recipe-Details.aspx?id=10570  Gathered   

                      

## Check for Missing Values

Here, we check for missing values in the dataset to understand data quality and identify columns that may require cleaning.

In [6]:
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())


Missing values:
Unnamed: 0     0
title          1
ingredients    0
directions     0
link           0
source         0
NER            0
dtype: int64


## Duplicates

In [7]:
print("Duplicate rows:", df.duplicated(subset=['title', 'ingredients', 'directions']).sum())


Duplicate rows: 0


In [8]:
# Check for exact duplicates in titles and ingredients
duplicates = df.duplicated(subset=['title', 'NER']).sum()
print(f"Exact duplicates: {duplicates}")

Exact duplicates: 27992


## Metric and Imperial 

In [9]:
import re

# Imperial keywords
imperial_units = r'\b(cup|cups|oz|ounce|ounces|lb|lbs|pound|pounds|teaspoon|tsp|tablespoon|tbsp|inch|inches)\b'

# Metric keywords
metric_units = r'\b(gram|grams|g|ml|milliliter|milliliters|liter|liters|l|kg|kilogram|kilograms|cm|centimeter)\b'

def detect_unit_system(text):
    text = text.lower()
    imp_count = len(re.findall(imperial_units, text))
    met_count = len(re.findall(metric_units, text))
    
    if imp_count > 0 and met_count == 0:
        return 'Imperial'
    elif met_count > 0 and imp_count == 0:
        return 'Metric'
    elif imp_count > 0 and met_count > 0:
        return 'Mixed'
    else:
        return 'Unknown/Unitless'

# Apply to a sample first (2.2M rows is a lot for regex)
df['unit_system'] = df['ingredients'].astype(str).apply(detect_unit_system)

# Let's see the distribution
print(df['unit_system'].value_counts())



unit_system
Imperial            1978716
Unknown/Unitless     154649
Mixed                 83321
Metric                14456
Name: count, dtype: int64


In [10]:
def check_temperature_system(text):
    # Matches "350 F" or "350 degrees" vs "180 C"
    if re.search(r'\d+\s*(f|fahrenheit)', text, re.I):
        return 'Fahrenheit'
    if re.search(r'\d+\s*(c|celsius)', text, re.I):
        return 'Celsius'
    return 'None'

df['temp_system'] = df['directions'].astype(str).apply(check_temperature_system)

In [11]:
# drop source and link
df = df.drop(columns=['source', 'link'])

In [12]:
print(df.head(4))

   Unnamed: 0                  title  \
0           0    No-Bake Nut Cookies   
1           1  Jewell Ball'S Chicken   
2           2            Creamy Corn   
3           3          Chicken Funny   

                                         ingredients  \
0  ["1 c. firmly packed brown sugar", "1/2 c. eva...   
1  ["1 small jar chipped beef, cut up", "4 boned ...   
2  ["2 (16 oz.) pkg. frozen corn", "1 (8 oz.) pkg...   
3  ["1 large whole chicken", "2 (10 1/2 oz.) cans...   

                                          directions  \
0  ["In a heavy 2-quart saucepan, mix brown sugar...   
1  ["Place chipped beef on bottom of baking dish....   
2  ["In a slow cooker, combine all ingredients. C...   
3  ["Boil and debone chicken.", "Put bite size pi...   

                                                 NER       unit_system  \
0  ["brown sugar", "milk", "vanilla", "nuts", "bu...          Imperial   
1  ["beef", "chicken breasts", "cream of mushroom...  Unknown/Unitless   
2  ["frozen cor

In [17]:
# Load the dataset
df_health = pd.read_csv("../data/raw/healthy_eating_dataset.csv")

In [18]:
# drop image_url and is_healthy
df_health = df_health.drop(columns=['image_url', 'is_healthy'])

In [19]:
# Display basic information about the dataset
print(f"Total recipes: {len(df_health)}")
print(f"Columns: {df_health.columns.tolist()}")
print(df_health.head(3))

Total recipes: 2000
Columns: ['meal_id', 'meal_name', 'cuisine', 'meal_type', 'diet_type', 'calories', 'protein_g', 'carbs_g', 'fat_g', 'fiber_g', 'sugar_g', 'sodium_mg', 'cholesterol_mg', 'serving_size_g', 'cooking_method', 'prep_time_min', 'cook_time_min', 'rating']
   meal_id      meal_name  cuisine meal_type diet_type  calories  protein_g  \
0        1      Kid Pasta   Indian     Lunch      Keto       737       52.4   
1        2   Husband Rice  Mexican     Lunch     Paleo       182       74.7   
2        3  Activity Rice   Indian     Snack     Paleo       881       52.9   

   carbs_g  fat_g  fiber_g  sugar_g  sodium_mg  cholesterol_mg  \
0     43.9   34.3     16.8     42.9       2079              91   
1    144.4    0.1     22.3     38.6        423               7   
2     97.3   18.8     20.0     37.5       2383             209   

   serving_size_g cooking_method  prep_time_min  cook_time_min  rating  
0             206        Grilled             47             56     4.4  
1  