# Explore here

## 1) Import required libraries

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2) Load dataset

In [47]:
recipes_df = pd.read_csv('../data/csv/epi_r.csv')

## 3) Visualize dataset information

In [48]:
print(f'🔹 Dataset shape: {recipes_df.shape}')
print(f'🔹 Dataset information: ')
recipes_df.describe()

🔹 Dataset shape: (20052, 680)
🔹 Dataset information: 


Unnamed: 0,rating,calories,protein,fat,sodium,#cakeweek,#wasteless,22-minute meals,3-ingredient recipes,30 days of groceries,...,yellow squash,yogurt,yonkers,yuca,zucchini,cookbooks,leftovers,snack,snack week,turkey
count,20052.0,15935.0,15890.0,15869.0,15933.0,20052.0,20052.0,20052.0,20052.0,20052.0,...,20052.0,20052.0,20052.0,20052.0,20052.0,20052.0,20052.0,20052.0,20052.0,20052.0
mean,3.714467,6322.958,100.160793,346.8775,6225.975,0.000299,5e-05,0.000848,0.001346,0.000349,...,0.001247,0.026332,5e-05,0.000299,0.014861,0.00015,0.000349,0.001396,0.000948,0.022741
std,1.340829,359046.0,3840.318527,20456.11,333318.2,0.017296,0.007062,0.029105,0.036671,0.018681,...,0.035288,0.160123,0.007062,0.017296,0.121001,0.012231,0.018681,0.037343,0.030768,0.14908
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.75,198.0,3.0,7.0,80.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.375,331.0,8.0,17.0,294.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.375,586.0,27.0,33.0,711.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,30111220.0,236489.0,1722763.0,27675110.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Feature required before proceeding
As dataset contains several dimensions (680 colums), selecting the most relevant features is essential for visualization and preprocessing:

The projects further proceeds dividing dataset into 2 separate dataframes before rejoining:
- **Dataframe with nutritional values and diet labels**
- **Dataframe with most 150 frequent ingredients**

#### Creating Nutrition Dataframe

In [49]:
nutrition_df = recipes_df[['title', 'calories', 'protein', 'fat', 'sodium', 'vegan', 'vegetarian', 'wheat/gluten-free', 'no sugar added']]
nutrition_df.head()

##

Unnamed: 0,title,calories,protein,fat,sodium,vegan,vegetarian,wheat/gluten-free,no sugar added
0,"Lentil, Apple, and Turkey Wrap",426.0,30.0,7.0,559.0,0.0,0.0,0.0,0.0
1,Boudin Blanc Terrine with Red Onion Confit,403.0,18.0,23.0,1439.0,0.0,0.0,0.0,0.0
2,Potato and Fennel Soup Hodge,165.0,6.0,7.0,165.0,0.0,0.0,0.0,0.0
3,Mahi-Mahi in Tomato Olive Sauce,,,,,0.0,0.0,0.0,0.0
4,Spinach Noodle Casserole,547.0,20.0,32.0,452.0,0.0,1.0,0.0,0.0


To complete macronutrient profile of each recipe, the **carbohydrates feature** (not present in dataframe) shall be derived (knowing that C 4 kcal/g, P 4 kcal/g, F 9 kcal/g and Kcal=P+C+F)

In [50]:
nutrition_df = nutrition_df.dropna()
nutrition_df['carbohydrates'] = ((nutrition_df['calories'] - (nutrition_df['protein'] * 4) - (nutrition_df['fat'] * 9)) / 4)
# Aply .clip(lower=0) to avoid negative values
nutrition_df['carbohydrates'] = nutrition_df['carbohydrates'].clip(lower=0).astype('float')
# Reorder columns
nutrition_df = nutrition_df[[ 'title', 'calories', 'protein', 'carbohydrates', 'fat', 'sodium', 'vegetarian', 'wheat/gluten-free', 'no sugar added']]
nutrition_df.head()

Unnamed: 0,title,calories,protein,carbohydrates,fat,sodium,vegetarian,wheat/gluten-free,no sugar added
0,"Lentil, Apple, and Turkey Wrap",426.0,30.0,60.75,7.0,559.0,0.0,0.0,0.0
1,Boudin Blanc Terrine with Red Onion Confit,403.0,18.0,31.0,23.0,1439.0,0.0,0.0,0.0
2,Potato and Fennel Soup Hodge,165.0,6.0,19.5,7.0,165.0,0.0,0.0,0.0
4,Spinach Noodle Casserole,547.0,20.0,44.75,32.0,452.0,1.0,0.0,0.0
5,The Best Blts,948.0,19.0,40.25,79.0,1042.0,0.0,0.0,0.0


To derive nutrition tags (High Protein, Low Carb, Is Balanced), macronutrient dimensions are used (calories, protein, carbohydrates, fat)
- **High Protein**: 40% or more of calories coming from protein.
- **Low Carb**: 10% or less of calories coming from carbohydrates.
- **Is Balanced**: about 40% calories coming from protein, about 30% calories coming from carbohydrates, about 30% calories coming from fat .