# Recipes from food.com - comparison of cuisines
## Preprocessing

Kaggle data set: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions/data

For the current purpose, we only use 'RAW_recipes.csv'

In [39]:
import pandas as pd
import numpy as np
import ast           

In [55]:
food = pd.read_csv('data/RAW_recipes.csv')

In [56]:
food.columns

Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
       'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients'],
      dtype='object')

We use only 
- 'tags' (to filter by cuisine) and
- 'nutrition' to compute parts

### Data cleaning

In [57]:
food.drop_duplicates(subset = 'id', inplace = True)

column 'nutrition' contains nutrition info as text

In [58]:
# converting the string lists into real lists
food['nutrition_list'] = food['nutrition'].apply(ast.literal_eval)

In [59]:
# extracting calories (0), fat (PDV) (1), protein (PDV) (4) and carbohydrates (6)
# PDV stands for “percentage of daily value”
food['calories']     = food['nutrition_list'].str.get(0)
food['fats_pdv']     = food['nutrition_list'].str.get(1)
food['carbs_pdv']    = food['nutrition_list'].str.get(6)
food['proteins_pdv'] = food['nutrition_list'].str.get(4)

Since the values for fat, protein and carbs is in percentage of daily volume, we need to recover the actual values.
The FDA states specific values for daily intake (DV=100%), specifically 78 g fat, 275 g carbs, and 50 g protein for adults and children over the age of four.

$$
  g_{f} = \frac{fat\ PDV}{100}\cdot 78\,g, \qquad
  g_{c} = \frac{carbs\ PDV}{100}\cdot 275\,g, \qquad
  g_{p} = \frac{proteins\ PDV}{100}\cdot 50\,g,
$$


In [60]:
food['carbs_g']    = food['carbs_pdv']*275.0/100.0
food['fats_g']     = food['fats_pdv']*78.0/100.0
food['proteins_g'] = food['proteins_pdv']*50.0/100.0

# to renormalize we need the total grams
food['total_g']= food['fats_g']+food['carbs_g']+food['proteins_g']

# later we want to classify low carb and low fat, energy ratios is the standard
food['E_carbs']    = 4 * food['carbs_g']
food['E_fats']     = 9 * food['fats_g']
food['E_proteins'] = 4 * food['proteins_g']

# to normalize we need the total energy
food['E_total']    = food['E_fats'] + food['E_carbs'] + food['E_proteins']

# normalize to percentage
food['carbs_perc']    = food['carbs_g'] / food['total_g']
food['fats_perc']     = food['fats_g'] / food['total_g']
food['proteins_perc'] = food['proteins_g'] / food['total_g']

# normalize to percentage
food['E_carbs_perc']      = food['E_carbs'] / food['E_total']
food['E_fats_perc']       = food['E_fats'] / food['E_total']
food['E_proteins_perc']   = food['E_proteins'] / food['E_total']

## ILR Coordinates

To use euclidean methods on ternary data we use isometric log-ratio coordinates. The transform is

$$ z_1 = \sqrt{\frac{1}{2}}\ln\frac{x_1}{x_2}, \qquad z_2=\sqrt{\frac{1}{6}}\ln\frac{x_1x_2}{x_3^2} $$

where $x_1,x_2,x_3>0$.

## Back transform
$$ 
x_i = \frac{\exp(y_i)}{\sum_{k=1}^3\exp(y_k)},\qquad
y_1 =   \frac{z_1}{\sqrt2} +\frac{z_2}{\sqrt6},\qquad
y_2 = - \frac{z_1}{\sqrt2} +\frac{z_2}{\sqrt6},\qquad
y_3 = - \frac{2z_2}{\sqrt6}
$$

In [61]:
# x_1 = carb, x_2 = fat, x_3 = protein

# the IRL transform cannot handle zero percent values (they are mapped to infinity)
# drop everything smaller than ε 
eps = 0.001

mask = (food[['fats_perc', 'carbs_perc', 'proteins_perc']] > eps).all(axis=1)
food_nz = food.loc[mask].copy()

# transforming ternary coordinates to isometric log-ration
food_nz['z1'] = (1.0/np.sqrt(6.0))*np.log(food_nz['carbs_perc']/food_nz['fats_perc'])
food_nz['z2'] = (1.0/np.sqrt(6.0))*np.log((food_nz['fats_perc']*food_nz['carbs_perc'])/food_nz['proteins_perc']**2)

### Saving preprocessed data

In [62]:
food_nz.to_csv('data/processed/food.csv')