# Preprocessing data with ecv and nutritional metrics 

### Objective

The objective of this notebook is to preprocess the data through the following five steps:

1. Handling Missing Data
2. Exploratory Data Analysis (EDA)
3. Data Visualization and PCA
4. Outlier Detection
5. Bias Detection


In [1]:
import pandas as pd
import requests
from pathlib import Path
import json
import sys

from pathlib import Path
import sys

project_root = Path.cwd().parent
sys.path.append(str(project_root))

from src.utils import *

In [2]:
DATA_DIR = Path(project_root / "data")
RECIPES_FILE = DATA_DIR / "all_recipes_clean.json"

In [3]:
with open(RECIPES_FILE, "r", encoding="utf-8") as f:
    recipes = json.load(f)

df = pd.DataFrame(recipes)

## 1. Handling missing data 

In [4]:
col_missing = pd.DataFrame({
    "contain missing data": df.isna().sum() != 0,
    "missing_frequency": df.isna().mean() * 100
})
display(col_missing)

Unnamed: 0,contain missing data,missing_frequency
title,False,0.0
url,False,0.0
rating,False,0.0
ingredients,False,0.0
total_ecv,False,0.0
total_kcal,False,0.0
total_protein,False,0.0
total_fat,False,0.0
is_vege,False,0.0


## 2. Exploratory Data Analysis (EDA)

In [5]:
print("Shape (rows, cols):", df.shape)
print("Number of elements:", df.size)
print(df.dtypes)
print(df.describe())
print(df["ingredients"])

Shape (rows, cols): (2049, 9)
Number of elements: 18441
title             object
url               object
rating           float64
ingredients       object
total_ecv        float64
total_kcal       float64
total_protein    float64
total_fat        float64
is_vege            int64
dtype: object
            rating    total_ecv    total_kcal  total_protein    total_fat  \
count  2049.000000  2049.000000   2049.000000    2049.000000  2049.000000   
mean      3.566227     4.365994   1265.643879      79.710559    85.482191   
std       1.823196     8.337738   1824.189617     107.385338   161.573932   
min       0.000000    -0.079133    -59.333333      -0.721111    -6.170000   
25%       3.300000     0.280811    284.836364      20.468586    12.927273   
50%       4.400000     1.341192    584.200000      40.771010    25.334545   
75%       4.800000     4.250361   1373.316667      94.759394    75.062500   
max       5.000000    70.944551  13396.533333     872.433766  1330.613333   

           