# Captain Cook: the fabulous recipes explorator



Objectives:

- Create our own JSON map to plot informations about the recipes by region more specifically
- Make the map more interactive and correct the colormap issue
- Finish the ingredients list cleaning
- Use statistical properties of the English language or Levenshtein distance
- Create a user friendly recipe finder 


Bonus:

- Try to compute missing nutritional informations
- Find meaningful substitutions for ingredients

In [1]:
# Basic imports
import re
import os.path
import numpy as np
import scipy as sp
import pandas as pd

# Map-related imports
import json
import branca
import folium
from pandas.io.json import json_normalize
from IPython.core.display import display, HTML

# Plot-related imports
import seaborn as sns
import ipywidgets as widgets
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual

# NLP-related imports
import nltk
nltk.download('punkt');
nltk.download('averaged_perceptron_tagger');

[nltk_data] Downloading package punkt to /home/tim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/tim/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# General parameters
%matplotlib inline
plt.style.use('seaborn')#switch to seaborn style
plt.rcParams["figure.figsize"] = [16,10]

DATA_FOLDER = './data/'

# 1. Data Loading
  
The Data has been fetched and cleaned with `BASH`scripts, please look in the *dataCleaning* section to understand how this was achieved.  

**Home made fetched dataset:**

In [3]:
# Importing ingredients to Pandas DF
allrecipes_df = pd.read_csv(DATA_FOLDER + 'allrecipes.csv', sep='\t',  header=None, encoding = "utf-8")
allrecipes_df.columns = ['ID', 'Region', 'Title', 'Ingredients', 'kcal', 'carb', 'fat', 'protein', 'sodium', 'cholesterol']

# Bug?? need to convert into numeric somes, TODO EFFICIENT WAY TO DO THIS???
allrecipes_df['kcal'] = pd.to_numeric(allrecipes_df['kcal'], errors='coerce')
allrecipes_df['carb'] = pd.to_numeric(allrecipes_df['carb'], errors='coerce') / 1000.0 # convert to g
allrecipes_df['fat'] = pd.to_numeric(allrecipes_df['fat'], errors='coerce') / 1000.0 # convert to g
allrecipes_df['protein'] = pd.to_numeric(allrecipes_df['protein'], errors='coerce')
allrecipes_df['sodium'] = pd.to_numeric(allrecipes_df['sodium'], errors='coerce') / 1000.0
allrecipes_df['cholesterol'] = pd.to_numeric(allrecipes_df['cholesterol'], errors='coerce')

# Remove any rows which isn't properly formatted
allrecipes_df = allrecipes_df.dropna()

# Remove any duplicated lines
allrecipes_df = allrecipes_df.drop_duplicates().set_index('ID')

# Printing
allrecipes_df.head(5)

Unnamed: 0_level_0,Region,Title,Ingredients,kcal,carb,fat,protein,sodium,cholesterol
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
b9705d990df6857f20756fc996a54b63,us,Traditional Indiana Persimmon Pudding,2 cups persimmon pulp |2 eggs |1 cup white sug...,278.0,53.9,3.8,7.8,0.224,35.0
4658708d644d7b446d843fed5ddf60c4,us,Fish Tacos,1 cup all-purpose flour |2 tablespoons cornsta...,409.0,43.0,18.8,17.3,0.407,54.0
beed004e2a1772ba0db9da913f54122e,us,Wisconsin Slow Cooker Brats,8 bratwurst |2 (12 fluid ounce) cans or bottle...,377.0,12.8,27.4,13.8,1.046,69.0
96353c72421bd74096277c6cf8b17097,us,Buffalo Chicken Wing Sauce,2/3 cup hot pepper sauce (such as Frank&#39;s ...,104.0,0.4,11.6,0.2,0.576,31.0
ee659a6a5e69834b60744cc3e103729e,us,Minnesota's Favorite Cookie,"1 cup butter, softened |1 1/2 cups brown sugar...",140.0,14.9,8.7,1.5,0.076,22.0


In [4]:
# Importing descriptions to Pandas DF
allrecipes_desc_df = pd.read_csv(DATA_FOLDER + 'allrecipes_desc.csv', sep='£',  header=None, encoding = "utf-8",  engine='python')
allrecipes_desc_df.columns = ['ID', 'Description']

# Remove any duplicated lines
allrecipes_desc_df = allrecipes_desc_df.drop_duplicates().set_index('ID')

allrecipes_desc_df.head(5)

Unnamed: 0_level_0,Description
ID,Unnamed: 1_level_1
b9705d990df6857f20756fc996a54b63,Preheat the oven to 350 degrees F (175 degree...
4658708d644d7b446d843fed5ddf60c4,"To make beer batter: In a large bowl, combine..."
beed004e2a1772ba0db9da913f54122e,"Place bratwurst, beer, onion, and ketchup in ..."
96353c72421bd74096277c6cf8b17097,"Combine the hot sauce, butter, vinegar, Worce..."
ee659a6a5e69834b60744cc3e103729e,Preheat oven to 350 degrees F (175 degrees C)...


In [5]:
print("Number of recipes:", len(allrecipes_df.index.unique()))

Number of recipes: 15894


**Provided Dataset**

This dataset was provided with the assignment and cleaned with the provided `Perl` scripts. 

Thanks to the scripts, we obtain two datasets:

1. `cleaned_ing.csv` contains the list of ingredients for each recipe,
2. `cleaned_nutri.csv` contains the corresponding nutritional values.

Our objective is to merge these two sets to obtain a unique set with all useful informations.

In [6]:
# Importing ingredients to Pandas DF
ing_df = pd.read_csv(DATA_FOLDER + 'cleaned_ing.csv', sep='\t',  header=None, encoding = "utf-8")
ing_df.columns = ['ID', 'Title', 'Ingredients']

# Importing nutritional values to Pandas DF
nutri_df = pd.read_csv(DATA_FOLDER + 'cleaned_nutri.csv', sep='\t',  header=None, encoding = "utf-8")
nutri_df.columns = ['ID', 'kcal', 'carb', 'fat', 'protein', 'sodium', 'cholesterol']

# Merging
ing_df = ing_df.set_index('ID')
nutri_df = nutri_df.set_index('ID')
provided_df = ing_df.merge(nutri_df, on='ID', how='inner')

# Drop NaNs and duplicate lines
provided_df = provided_df.dropna().drop_duplicates()

provided_df.head()

Unnamed: 0_level_0,Title,Ingredients,kcal,carb,fat,protein,sodium,cholesterol
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
38e1b80017526d6e59ed3f986c35a43a,T.G.I. Friday's Jack Daniels Sauce Recipe #10265,1 teaspoon onion powder|1 tablespoon Tabasco s...,?,?,?,?,?,?
a3636a4dab434fe21fbcdceba7d6fcf2,Simple Peanut Squash Recipe,1 butternut squash|2 tablespoons brown sugar|1...,536,86.4,23.6,6.5,483,61
117f3c214e9de550a157ce5ee1f1cceb,Hash Brown Breakfast Casserole Recipe,"1 lb ground sausage (""hot"" or ""sage"" flavored)...",660.4,24.7,47.3,32.5,1248.0,251.6
fde8f280a690fb8bc77c10a7193db08b,Basic Homemade Country Sausage Recipe,2 pounds lean pork|1/2 pound pork fatback|3 te...,?,?,?,?,?,?
714df642f50b9ae489d285e16b59bf7b,Spinach Frittata Recipe,1 cup fresh spinach|2 egg whites|1 egg yolk|1/...,?,?,?,?,?,?


We can observe that some nutritional values are missing, which can be solved either by removing the lines or by trying to calculate these values from the given ingredients.

As trying to calculate the values from ingredients with different units (i.e. grams, cups, tbsp, etc) requires a set of informations that we do not have, we decided to leave these lines as they are for now. 

In [7]:
# Bug?? need to convert into numeric somes, TODO EFFICIENT WAY TO DO THIS???
provided_df['kcal'] = pd.to_numeric(provided_df['kcal'], errors='coerce')
provided_df['carb'] = pd.to_numeric(provided_df['carb'], errors='coerce')
provided_df['fat'] = pd.to_numeric(provided_df['fat'], errors='coerce')
provided_df['protein'] = pd.to_numeric(provided_df['protein'], errors='coerce')
provided_df['sodium'] = pd.to_numeric(provided_df['sodium'], errors='coerce')
provided_df['cholesterol'] = pd.to_numeric(provided_df['cholesterol'], errors='coerce')

# Insert Region column to match the other DF
provided_df.insert(loc=1, column='Region', value=np.nan)
provided_df.head(5)

Unnamed: 0_level_0,Title,Region,Ingredients,kcal,carb,fat,protein,sodium,cholesterol
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
38e1b80017526d6e59ed3f986c35a43a,T.G.I. Friday's Jack Daniels Sauce Recipe #10265,,1 teaspoon onion powder|1 tablespoon Tabasco s...,,,,,,
a3636a4dab434fe21fbcdceba7d6fcf2,Simple Peanut Squash Recipe,,1 butternut squash|2 tablespoons brown sugar|1...,536.0,86.4,23.6,6.5,483.0,61.0
117f3c214e9de550a157ce5ee1f1cceb,Hash Brown Breakfast Casserole Recipe,,"1 lb ground sausage (""hot"" or ""sage"" flavored)...",660.4,24.7,47.3,32.5,1248.0,251.6
fde8f280a690fb8bc77c10a7193db08b,Basic Homemade Country Sausage Recipe,,2 pounds lean pork|1/2 pound pork fatback|3 te...,,,,,,
714df642f50b9ae489d285e16b59bf7b,Spinach Frittata Recipe,,1 cup fresh spinach|2 egg whites|1 egg yolk|1/...,,,,,,


In [8]:
print("Number of recipes:", len(provided_df.index.unique()))

Number of recipes: 31376


In [9]:
# Concatenate the 2 DF and drop any duplicated lines, it is possible since some data come from the same website!
recipes_df = allrecipes_df.append(provided_df, sort=False).drop_duplicates()
recipes_df['Region'] = recipes_df['Region'].astype('category')

recipes_df.head()

Unnamed: 0_level_0,Region,Title,Ingredients,kcal,carb,fat,protein,sodium,cholesterol
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
b9705d990df6857f20756fc996a54b63,us,Traditional Indiana Persimmon Pudding,2 cups persimmon pulp |2 eggs |1 cup white sug...,278.0,53.9,3.8,7.8,0.224,35.0
4658708d644d7b446d843fed5ddf60c4,us,Fish Tacos,1 cup all-purpose flour |2 tablespoons cornsta...,409.0,43.0,18.8,17.3,0.407,54.0
beed004e2a1772ba0db9da913f54122e,us,Wisconsin Slow Cooker Brats,8 bratwurst |2 (12 fluid ounce) cans or bottle...,377.0,12.8,27.4,13.8,1.046,69.0
96353c72421bd74096277c6cf8b17097,us,Buffalo Chicken Wing Sauce,2/3 cup hot pepper sauce (such as Frank&#39;s ...,104.0,0.4,11.6,0.2,0.576,31.0
ee659a6a5e69834b60744cc3e103729e,us,Minnesota's Favorite Cookie,"1 cup butter, softened |1 1/2 cups brown sugar...",140.0,14.9,8.7,1.5,0.076,22.0


In [10]:
print("Number of total recipes:", len(recipes_df.index.unique()))

Number of total recipes: 46999


In [11]:
len(recipes_df[recipes_df['Region']=='italian'])/365

6.841095890410959

We see that the total number of recipes is enough to eat italian recipes everyday for almost 7 years!!

# 2. Ingredient Parsing

In this part we are trying to get a list of ingredients for each recipe. This list should be clean, which means it should contain only the names of the ingredients and no other informations, like quantities.

To do this, first we cleaned the list of ingredients by applying a low-case and by removing a set of words chosen manually (contained in `black_list`), then we used the natural language processing library `nltk` to remove words different from nouns.

In [12]:
# Copy for test
recipes_copy = recipes_df.copy()

# lowercase to be insensitive
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].str.lower()

# Coerce filtering, removing any occurence of these words as a first filter
black_list = ['inches','inch','medium','pounds','pound','ounces','ounces','fluid','ground','tablespoons','tablespoon','cups','cup','teaspoons','teaspoon', 'all-purpose', '\(.*\)']
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].replace(black_list, '', regex=True)

# Remove non alphabetic values expect of '|' which is the seperating char
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].str.replace('[^a-zA-Z ]+', ' ')

# Retrieve list of ingredients in overall
keywords_list = recipes_copy['Ingredients'].str.split(" ", expand=True).stack().unique()

In [13]:
### Retrieve bad ingredients

# NLP to identify only verbs
tokens = nltk.word_tokenize(' '.join(keywords_list))
tagged = nltk.pos_tag(tokens)

# Fetching the list of non correct word
gray_list = [word for word,pos in tagged if not(pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]

# Further filtering by removing gray_listed word with regex
ingredient_serie = recipes_copy['Ingredients'].replace(gray_list, '')

# Retrieve list of ingredients in overall
keywords_list = ingredient_serie.str.split(" ", expand=True).stack().unique()

In [14]:
# NLP to identify only nouns
tokens = nltk.word_tokenize(' '.join(keywords_list))
tagged = nltk.pos_tag(tokens)
nouns = [word for word,pos in tagged if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]

# We need to remove word that a smaller than 3 letters, as we suppose they are not ingredients
ing_list = [item for item in nouns if len(item) > 3]

At this point, we have a list of ingredients contained in `ing_list`, which can be used to filter our dataset. Unfortunately, as we are going to see below, some ingredients are not spelled correctly while others are not ingredients at all.

In [15]:
# Take original ingredients list and split each word to count recurrencies
ing_ds = recipes_copy['Ingredients'].str.split(" ", expand=True) \
                                        .stack().value_counts()  \
                                        .to_frame(name='count')  \
                                        .reset_index()

# Keeping only the ingredient in the previous list
ing_ds = ing_ds[ing_ds['index'].isin(ing_list)].reset_index(drop=True)

#ing_ds.sort_values(by='index') # if you want to see similar words
ing_ds.head(21)
#ing_ds['index'].to_csv('ing_list')

Unnamed: 0,index,count
0,salt,22841
1,pepper,20391
2,butter,13037
3,onion,11935
4,flour,11890
5,taste,10342
6,water,10221
7,powder,9551
8,milk,7421
9,sauce,7414


In [42]:
ing_ds[ing_ds['index'] == 'pioneer']

Unnamed: 0,index,count
2208,pioneer,1


We can see that the words `powder`, `taste` and `sauce` are contained in the ten most recurring words, although they are not ingredients. These words should then be parsed by hand and removed to obtain a list that is ingredients-only.

We can also notice that some ingredients are duplicated due to different spellings (i.e. `onion` and `onions`). 

We tried to implement a way to merge similar words by finding a metric that calculates the distance between words. As such the similar words should have close values given by the metric.

In [17]:
### Retrieve similar names

# We can create a space with N dimensions
# Each letter of a word is mapped to its corresponding integer in this space
# Similar words will lie closely in this space

# Convert ingredient's distribution to list  
ing_ds_list = ing_ds['index'].values.tolist()
# print("\033[1mbar before sort:\033[0m", ing_ds_list)

# Looking for the longest word/ingredient
N = len(sorted(ing_ds_list, key=len)[-1])
# print("\033[1m\nLongest word is:\033[0m", N, " long")

# For each word in the list, we append the NULL element ASCII to have the same number of elements
converted_ing_list = [item + chr(0) * (N - len(item)) for item in ing_ds_list]
# print("\033[1m\n converted_ing_list after padding:\033[0m\n", converted_ing_list)

# Convert into each spatiales ASCII -> Numpy matrix
word_matrix = np.array([[ord(char) for char in string] for string in converted_ing_list])
# print("\033[1m\n converted_ing_list after ASCII int conversion:\033[0m\n", word_matrix)

In [18]:
# Compute the distance between each row 
# Idea: use backwards propagation to calculate the optimal weights
w = [10, 4, 3, 2, 1, 1, 1, 1, 1 ,1 ,1 ,1 ,1 , 1, 1]
#distance_matrix = sp.spatial.distance.cdist(word_matrix, word_matrix, 'wminkowski', p=2, w=w)
distance_matrix = sp.spatial.distance.cdist(word_matrix, word_matrix, 'euclidean')

# print("\033[1m\nDistance of the matrix define by converted_ing_list:\033[0m\n", distance_matrix)

In [19]:
# Thresholding <-> if the distance is small enough words are the same!
normed_dist = (distance_matrix < 60).astype(int)
# print("\033[1m\nDistance of the matrix thresholded:\033[0m\n", normed_dist)

# The list has been sorted
# if we take the first non-zero value for each row we get the matching word
vec = normed_dist.argmax(axis=0)
# print("\033[1m\nIndex of corresponding words in sorted [converted_ing_list]:\033[0m\n", vec)

# Foo after TODO name
deconverted_ing_list = [converted_ing_list[i].replace(chr(0), '') for i in vec]
# print("\033[1m deconverted_ing_list operation:\033[0m", deconverted_ing_list)

In [20]:
# Result
ing_dict = dict(zip(ing_ds_list, deconverted_ing_list))

ing_dict

{'salt': 'salt',
 'pepper': 'pepper',
 'butter': 'pepper',
 'onion': 'onion',
 'flour': 'onion',
 'taste': 'onion',
 'water': 'onion',
 'powder': 'pepper',
 'milk': 'salt',
 'sauce': 'onion',
 'chicken': 'chicken',
 'cloves': 'pepper',
 'cream': 'onion',
 'eggs': 'salt',
 'juice': 'onion',
 'vanilla': 'chicken',
 'lemon': 'onion',
 'vinegar': 'chicken',
 'beef': 'salt',
 'cinnamon': 'cinnamon',
 'onions': 'pepper',
 'tomatoes': 'cinnamon',
 'bell': 'salt',
 'parsley': 'chicken',
 'slices': 'pepper',
 'pieces': 'pepper',
 'boneless': 'cinnamon',
 'rice': 'salt',
 'tomato': 'pepper',
 'bread': 'onion',
 'potatoes': 'cinnamon',
 'celery': 'pepper',
 'basil': 'onion',
 'cilantro': 'cinnamon',
 'cumin': 'onion',
 'frozen': 'pepper',
 'chocolate': 'chocolate',
 'peppers': 'chicken',
 'skinless': 'cinnamon',
 'clove': 'onion',
 'margarine': 'chocolate',
 'lime': 'salt',
 'cheddar': 'chicken',
 'corn': 'salt',
 'breast': 'pepper',
 'orange': 'pepper',
 'sweet': 'onion',
 'pork': 'salt',
 'carr

As we can see, this method is not accurate for now. We would need more time to optimize the weights to use and to filter non-ingredient words.

As an alternative option, we envision to clean the ingredient list by hand.

This algorithm does not take into account the statistical relevance of letters in the english language, but only alphabetical closeness. 

**Laste minute update:** We actually can use two different strategies here:
- Check if the word exists in the English dictionnary, if there is a word which exists also but only one letter is changing then we combine them (i.e. `onion` and `onions`).
- We can implement the [Levenshtein distance](https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance) and apply it for each word in the list sorted alphabetically with a of moving window (thus we avoid useless computing)

In [21]:
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

# 3. Cooking time study-case

In this part we would like to analyze the cooking time of the recipes to be able to classify which regions have the highest and lowest cooking time.

In [22]:
# Extract all timing from recipes
timing_df = allrecipes_desc_df['Description'].str.extractall(r'(\d+) minutes|(\d+) hour|hours')
timing_df.columns = ['minutes', 'hours']

#Replace Nan by 0 and switch to int type
timing_df = timing_df.fillna(0).astype(int)

#Sum the number of minutes to get the recipe time
timing_df['Time (min)'] = timing_df['minutes']+timing_df['hours']*60

timing_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,minutes,hours,Time (min)
ID,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b9705d990df6857f20756fc996a54b63,0,0,2,120
b9705d990df6857f20756fc996a54b63,1,15,0,15
beed004e2a1772ba0db9da913f54122e,0,0,4,240
beed004e2a1772ba0db9da913f54122e,1,5,0,5
ee659a6a5e69834b60744cc3e103729e,0,12,0,12


In [23]:
# Sum the total amount of time for each recipe
time_recipe = timing_df.groupby('ID').agg('sum')
time_recipe = time_recipe.drop(['minutes','hours'], axis=1)

time_recipe.head()

Unnamed: 0_level_0,Time (min)
ID,Unnamed: 1_level_1
00035a69b44a9dd1f88f2bb5faced261,90
000d31e632cab9e6902f05196354a007,660
0015417d2473d92a56da28883a27aff3,120
001cf1a5a0d1914f958cb2c823df6121,8
001f5efe07f4c72b4aaf846ec7616aba,13


# 4. Merging
Finally, we can merge everything to a single DataFrame to use it for Visualization

In [24]:
# Merging Cooking Time
cleaned_df = recipes_df.merge(time_recipe, on='ID', how='left')

# Cleaning ingredient and ingredient substition
# This is not yet implemented but we are close to achieve this

cleaned_df.sample(5)

Unnamed: 0_level_0,Region,Title,Ingredients,kcal,carb,fat,protein,sodium,cholesterol,Time (min)
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
aa74448cebb7fbe2d8f33c0dc21e4be5,,Baked Steak Fingers Recipe,"Tenderized Beef Round Steak, 2 lb|Flour, white...",154.5,4.8,4.9,21.3,62.8,51.7,
701340efcebf118926ef72e87574928f,canadian,Canadian Cedar Planked Salmon,24x8x1 inch untreated cedar plank |6 (4 ounce)...,388.0,4.6,30.9,22.8,0.068,66.0,735.0
9c0a78daec7734d6d7b25ad3b436b25d,,Homemade Candy Bars Recipe,"1/2 cup butter, softened|1 cup white sugar|3 e...",266.0,37.2,12.4,3.8,109.0,42.0,
bf5020e1771d839310864f35e8f8e5ae,mexican,Spicy Turkey Tacos,1 pound shredded cooked turkey meat |1/3 cup c...,251.0,20.2,6.4,25.3,0.922,59.0,180.0
127a1b0aee619182ec6094b505688fd9,,Ricotta Gnocchi Recipe,Gnocchi:|1 (8 ounce) container ricotta cheese|...,442.0,27.1,26.0,22.4,952.0,141.0,


# 5. Analysis

This part presents some basic statistical analysis of the data.

First we analyse the data by region and observe *mean*, *median*, *min* and *max* for each nutritional value.

In [25]:
# Some classic analysis
stats_regions = cleaned_df.groupby('Region')
stats_regions = stats_regions.agg({'kcal' : ['mean', 'median', 'min', 'max'],
                                       'carb' : ['mean', 'median', 'min', 'max'],
                                       'fat' : ['mean', 'median', 'min', 'max'],
                                       'protein' : ['mean', 'median', 'min', 'max'],
                                       'sodium' : ['mean', 'median', 'min', 'max'],
                                       'cholesterol' : ['mean', 'median', 'min', 'max'],
                                       'Time (min)' : ['mean', 'median', 'min', 'max']})
stats_regions.sort_values([('kcal', 'mean')], ascending=False).head()

Unnamed: 0_level_0,kcal,kcal,kcal,kcal,carb,carb,carb,carb,fat,fat,...,sodium,sodium,cholesterol,cholesterol,cholesterol,cholesterol,Time (min),Time (min),Time (min),Time (min)
Unnamed: 0_level_1,mean,median,min,max,mean,median,min,max,mean,median,...,min,max,mean,median,min,max,mean,median,min,max
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
malaysian,435.6,427.0,33.0,1238.0,30.422857,26.7,1.7,113.2,24.091429,19.8,...,0.011,3.031,90.914286,66.0,0.0,340.0,88.424242,30.0,2.0,1460.0
portuguese,402.173333,378.0,27.0,2266.0,33.832,28.5,4.0,108.6,20.009333,17.1,...,0.003,10.693,84.986667,67.0,0.0,389.0,188.071429,50.0,2.0,1533.0
italian,391.215459,360.0,4.0,1641.0,34.242171,29.9,0.0,174.8,19.998438,17.2,...,0.001,7.648,71.113736,50.0,0.0,734.0,89.86286,39.0,0.0,2160.0
indonesian,387.648649,401.0,65.0,716.0,29.056757,19.6,6.5,94.2,21.545946,19.0,...,0.004,2.459,97.918919,68.0,0.0,500.0,68.972222,28.5,3.0,380.0
french,377.932609,319.5,9.0,3274.0,27.991739,22.05,0.4,240.9,22.572391,18.15,...,0.002,8.623,100.473913,73.5,0.0,780.0,120.860294,45.0,2.0,2880.0


# 6. Visualization

In this part we present the overall visualization of informations we retrieve in the dataset.

###  Plots

In [36]:
# Interactive plot of correlation between nutritive values 
def f(nutritive1, nutritive2):
    
    sns.set_context("notebook", font_scale=1.5)
    sns.scatterplot(cleaned_df[nutritive1], cleaned_df[nutritive2])
    plt.show()
    
# Interact
interact(f, nutritive1=['kcal', 'carb', 'fat', 'protein', 'sodium', 'cholesterol'],
            nutritive2=['kcal', 'carb', 'fat', 'protein', 'sodium', 'cholesterol']);

interactive(children=(Dropdown(description='nutritive1', options=('kcal', 'carb', 'fat', 'protein', 'sodium', …

In the plot above we can see the correlation between the different nutritional values. 

For example, there are many recipes where high carbs and fats correspond to high caloric plates, but less so for high proteins. Also it would seem that fats and cholesterol are not as correlated as we would think.

Below is a plot that shows the correlation coefficient for pairs of nutritional values by region. 

In [27]:
# Correlation between nutritional values shown per region
def f(region):
    sns.set_context("notebook", font_scale=1.5)
    
    # .iloc[:,:-1] is to avoid the Time column
    # It can be interesting to see if there is a correlation
    corr = cleaned_df.iloc[:,:-1][cleaned_df['Region'] == region].corr()
    sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()
    
# Interact
interact(f, region=cleaned_df.Region.unique().dropna());

interactive(children=(Dropdown(description='region', options=('us', 'korean', 'japanese', 'indonesian', 'thai'…

Below is a plot that shows the statistics by nutritional or time value of recipes classified by region. The plot is automatically ordered by median, so we obtain the region with the highest median for that item value.

In [28]:
# Item value statistics by regions
def f(item):
    recipe_sorted = stats_regions.sort_values([(item, 'median')], ascending=False)

    sns.set_context("notebook", font_scale=1.5)
    sns.boxplot(cleaned_df[item], cleaned_df['Region'], order=recipe_sorted.index)
    
    ## There is a big outlier for Sodium & Time, we will handle it later
    if(item == 'sodium'):
        plt.xlim(-0.5, 10)
        
    if(item == 'Time (min)'):
        plt.xlim(-50, 1500)
    ##
    plt.show()
    
# Interact
interact(f, item=['kcal', 'carb', 'fat', 'protein', 'sodium', 'cholesterol', 'Time (min)']);

interactive(children=(Dropdown(description='item', options=('kcal', 'carb', 'fat', 'protein', 'sodium', 'chole…

We can see that the most calorical, fat and protein rich recipes belong to Malaysia, while the sodium intake is won by the korean recipes. The ones that have to be most careful about the cholesterol intake seem to be the French.

By comparing the median we also see that the longest cooking time and preparation in total is for Persian recipes, whereas Japanese's recipes are the shortest

### Maps

In [29]:
# Loading JSON of world map
world_json = json.load(open(DATA_FOLDER + 'world-countries.json'))
cont_json = json.load(open(DATA_FOLDER + 'continents.json'))

At the end we will have our own JSON map done with the help of the following website:https://geojson-maps.ash.ms/  
Until then, we show relevant informations for each continent by using a dictionnary that will map `Region` name to its corresponding continent.

In [30]:
#Dic we used to set the continent depending on the region
dic_continent = {'korean': 'Asia','japanese': 'Asia','indonesian': 'Asia', 'thai': 'Asia',
 'indian': 'Asia', 'chinese': 'Asia',
 'bangladeshi': 'Asia','filipino': 'Asia', 'malaysian': 'Asia','pakistani': 'Asia','vietnamese': 'Asia', 'israeli': 'Asia',
 'persian': 'Asia','lebanese': 'Asia','us': 'North America', 'canadian': 'North America',
 'mediteranean': 'Europe', 'turkish': 'Europe', 'dutch': 'Europe', 'italian': 'Europe', 'french': 'Europe',
 'swiss': 'Europe', 'scandinavian': 'Europe','austrian': 'Europe', 'eastern_europe': 'Europe',
 'spanish': 'Europe', 'belgian': 'Europe', 'uk_and_ireland': 'Europe', 'greek': 'Europe',
 'german': 'Europe', 'portuguese': 'Europe', 'african': 'Africa', 'south_american': 'South America',
 'mexican': 'South America', 'australian': 'Australia', 'caribbean':'South America'}

In [31]:
#We create a new column in the df to set the continent depending on the Region
recipes_continent = cleaned_df.copy()
recipes_continent['Continent'] = recipes_continent['Region'].map(dic_continent)
recipes_continent.head(5)

Unnamed: 0_level_0,Region,Title,Ingredients,kcal,carb,fat,protein,sodium,cholesterol,Time (min),Continent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
b9705d990df6857f20756fc996a54b63,us,Traditional Indiana Persimmon Pudding,2 cups persimmon pulp |2 eggs |1 cup white sug...,278.0,53.9,3.8,7.8,0.224,35.0,135.0,North America
4658708d644d7b446d843fed5ddf60c4,us,Fish Tacos,1 cup all-purpose flour |2 tablespoons cornsta...,409.0,43.0,18.8,17.3,0.407,54.0,,North America
beed004e2a1772ba0db9da913f54122e,us,Wisconsin Slow Cooker Brats,8 bratwurst |2 (12 fluid ounce) cans or bottle...,377.0,12.8,27.4,13.8,1.046,69.0,245.0,North America
96353c72421bd74096277c6cf8b17097,us,Buffalo Chicken Wing Sauce,2/3 cup hot pepper sauce (such as Frank&#39;s ...,104.0,0.4,11.6,0.2,0.576,31.0,,North America
ee659a6a5e69834b60744cc3e103729e,us,Minnesota's Favorite Cookie,"1 cup butter, softened |1 1/2 cups brown sugar...",140.0,14.9,8.7,1.5,0.076,22.0,12.0,North America


In [32]:
def layer_colormap(topojson, df, column, colorscale):
    
    # Create a layer
    feature_map = folium.FeatureGroup(name=column, overlay=False)  
    
    def style_function(feature):
    # Fetching values for the mean of the category for the given asked continent
        value = recipes_continent[recipes_continent['Continent'] == feature['properties']['CONTINENT']][column].mean()
        return {
            'color': 'black',
            'weight': 1,
            'fillOpacity': 0.5,
            'fillColor': '#black' if np.isnan(value) else colorscale(value)
                }
    # Fetch values from the DataFrame and apply the colormap to the values
    # If the value is NaN, the corresponding color is dark-grey
    folium.GeoJson(cont_json, style_function=style_function).add_to(feature_map)

    return feature_map;

In [33]:
# Create a new empty map
map_info  = folium.Map([30,0], tiles='cartodbpositron', zoom_start=2)

# Add for each nutritive information the map
for category in ['kcal','carb','fat','protein','sodium','cholesterol', 'Time (min)']:
    colorscale = branca.colormap.linear.YlOrRd_09.scale((min(stats_regions[category]['mean'])), max(stats_regions[category]['mean']))
    layer_colormap(cont_json, recipes_continent, category, colorscale).add_to(map_info)
    
# Add a legend to the colormap and append it to the base layer
colorscale.caption = 'Mean of the nutritive value selected'
map_info.add_child(colorscale) 

# Adding the tile Layer thus it is prettier
folium.TileLayer(tiles='cartodbpositron', overlay=True).add_to(map_info)

# Layer Control to select the different layer created before
folium.LayerControl(collapsed=False, position='bottomleft').add_to(map_info);

# Save/Display
map_info.save('map_info.html')
#map_info

In [34]:
%%HTML
<iframe src="map_info.html" width=100% height=700></iframe>

On the previous map, we can see how the different nutritive properties of the recipes vary through the different continents. We can thus see some correlations like the kcal of the recipe and the fat which are both high in the same continents.  

**Note:** we actually have a small issue with the colormap and we will be fixing it by using a different kind of interactive map to show more interesting information (Ingredients distribution, min/max or median for nutrition)