# Project - The Food symphony 🥙
 <p><div class="lev1"><a href="#1/-Introduction-of-the-structure"><span class="toc-item-num">1.&nbsp;&nbsp;</span>Introduction of the structure</a></div>
  <div class="lev3"><a href="#2/-Preprocessing-the-ingredients-data-(Missing-values,-upper-cases,-natural-language-processing)"><span class="toc-item-num">2.&nbsp;&nbsp;</span>Preprocessing the ingredients data (Missing values, upper cases, natural language processing)</a></div>
  <div class="lev3"><a href="#3/-Co-occurences-and-Covariance"><span class="toc-item-num">3.&nbsp;&nbsp;</span>Co-occurences and Covariance</a></div>
  <div class="lev3"><a href="#4/-Linear-Regression"><span class="toc-item-num">4.&nbsp;&nbsp;</span>Linear Regression</a></div>
  <div class="lev3"><a href="#5/-Next-step-and-final-goal"><span class="toc-item-num">5.&nbsp;&nbsp;</span>Next step and final goal</a></div>

## 1/ Introduction of the structure
First of all, in this section, we remind and describe in detail the overall structure of the our project: the steps and the goals.

The diagram below gives an illustrated overview:

<img src="Structure.jpeg" width=900>

The diagram illustrates the work we have been doing until now for the food symphony project. The first part consists in filtering the raw data in order to get a clean data set where we could do some analysis and perform machine learning algorithms on it.

The most important step in the first part of the project  and the most time consuming  was to design filters that enable us to get a proper data set to analyse. Indeed, we had to deal with a lot of different typos containing valid information. Instead of skipping data,  we did our best to drag out all the available information.

The first filtering step was to deal with special characters and filter useless parenthesis. After this first filtering step, we decide to create 3 lists in order to select the ingredients that had the information about quantity, unit, ingredients and the technics to apply for each ingredient. The ones that didn’t have all this information were not taken into account. 

The unit list, was created by hand, by checking information in Wikipedia. The techniques list was done by using the raw data, and taking all the verbs in the past tense. The ingredients list was create by web scraping. 
Once the filtering is done, we create a data frame with the columns as the ingredients and the raw the recipes.
 We create the co-occurrence matrix and the covariance matrix to make a first analysis, and we will use them in order to create new recipes.
 
Additionally, we generate a data frame containing diverse nutrient information by ingredient. This dataset will be very helpful to improve our current results in the prediction of the recipe calories contempt, while using linear regression models.



In [None]:
# ----- Librairies ----- #
import pandas as pd
import re
import sys
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
from nltk import pos_tag
from nltk.tag import UnigramTagger
from nltk.corpus import brown
import webcolors
from IPython.display import display
sys.path.append("..")
from ADA_JEX2017.Project.Functions.str_functions import *
from ADA_JEX2017.Project.Functions.pre_process import *
import matplotlib.pyplot as plt

# ----- Loading the dataset'recipeInfo_WestWhiteHorvitz_WWW2013.csv' ----- #
path='../ADA_JEX2017/Project/Functions/'

data_file='./recipeInfo/recipeInfo_WestWhiteHorvitz_WWW2013.csv'
raw_data = pd.read_csv(data_file ,sep=';')
raw_data.head(5)

### Vizualizing nutrients data
Extract the nutrients from the dataset which will be used it the regression model

In [None]:
nutri_data=raw_data[['title','kcal_total','kcal_carb','kcal_fat','kcal_protein','mg_sodium','mg_cholesterol']].copy()
nutri_data.head(5)

### Creating useful lists

In the next section, we are loading lists (ingredient_list.txt, units_list.txt, technique_list.txt) which were created previously using the the function in 'list_creation.py'

In [None]:
# ----- Initializing and loading the list of techniques, units and ingredients created previously ----- #
with open(path+'units_list.txt', 'r') as f:
    units_list = [line.rstrip('\n') for line in f]
    
with open(path+'technique_list.txt', 'r') as f:
    techniques_list = [line.rstrip('\n') for line in f]

with open(path+'ingredient_list.txt', 'r') as f:
    ingredient_list = [line.rstrip('\n') for line in f]

# ----- Initialize lemmatizer and apply on the data ----- #
# Lemmatizer is used to get the stem of each word in order to get a more homogeneous data
lemmatizer = WordNetLemmatizer()
ingredient_list=[lemmatizer.lemmatize(token).lower() for token in ingredient_list]

## 2/ Preprocessing the ingredients data (Missing values, upper cases, natural language processing)

In [None]:
# ----- Make a dataframe with our data while dropping the NaN values ----- #
ingr_dataframe=raw_data[['title','ingredients_list','ingredients_bag-of-words']].copy().dropna()
ingr_dataframe = ingr_dataframe.reset_index(drop=True)
display(ingr_dataframe.head())

# Ignore upper case in the ingredients list string
ingr_dataframe['ingredients_list']=ingr_dataframe['ingredients_list'].str.lower()

#ingr_data_reduced=ingr_dataframe.head(100) # create a reduced data as draft to test when creating new functions

In [None]:
# ----- Function to process the text in the ingredient list ----- #
# We notice that for some ingredients in the ingredients list, the quantity is given twice with one quantity given in volume or mass within parenthesis
# Therefore, we apply the next function to return only the wanted quantity
fun_add_preprocess(ingr_dataframe,units_list)

### Extraction of ingredients for each recipe

In [None]:
# !!!!! ----- Test cell : to inspect a specific recipe ----- !!!!!! #
receipe=ingr_dataframe.loc[0]['Recipe_preporcess']
print(receipe)
dic_ingr,dictec,wasted,wasted_numb=fun_extract_ingredients\
    (receipe,ingredient_list,techniques_list,units_list,to_gram=True)

In [None]:
#----- Use whole data frame to extract each ingredient with its quantity and unit by using the lists  ------ #

ingr_data_reduced=ingr_dataframe.head(100)
all_dic=[]
not_used_ingr=[]
wastes=0
for index, row in ingr_dataframe.iterrows():
    recipe=row['Recipe_preporcess']
    # Function in str_functions.py to extract the ingredients for each recipe
    dic_ingre,dictec,wasted_ingr,wasted_number=fun_extract_ingredients\
            (recipe,ingredient_list,techniques_list,units_list,to_gram=True)
    # Also convert each quantity in the same unit (grams) if to_gram is set to True
    all_dic.append(dic_ingre)

# We implemented the number of ingredients which didn't fit the criteria 
# Then we plotted the ingredient that we threw away in order to complete manually our ingredient list with important ingredients that our list may miss
    #not_used_ingr.append(wasted_ingr) 
    #wastes=wastes+wasted_number
    
# ----- Create the dataframe of all the ingredient and their quantities ----- #
ingredients_frame=pd.DataFrame(data=all_dic)
display(ingredients_frame.head(5))

# ----- Print the number of ingredients ----- #
print('There are : ',len(list(ingredients_frame)), 'ingredients')
ingred_used={}
for i in list(ingredients_frame):
    ingred_used[i]=sum(ingredients_frame[i].value_counts())

In [None]:
# ----- Sort the ingredients by occurrence ----- #
occu=sorted((value,key) for (key,value) in ingred_used.items())
occu[::-1]

ing_occ = pd.DataFrame(occu[::-1]).head(10)
people = ing_occ[1].values
score = ing_occ[0].values
x_pos = np.arange(len(people))
plt.figure(figsize=((20,10)))
plt.barh(x_pos, score,align='center',color='#31a354')
plt.yticks(x_pos, people) 
plt.show()

## 3/ Co-occurences and Covariance

### Co-occurences


In [None]:
# ----- Using K-NN to find the the association between ingredient ----- #
from sklearn.neighbors import NearestNeighbors

newdf = ingredients_frame.notnull().astype('int')
coocc = newdf.T.dot(newdf)
display(coocc)

#### Similarity and best association of ingredients
Using the co-occurence matrice we compute with the k- Nearest Neighbourg Regression, the proximity between ingredients and end up with a combination of ingredients which would be likely associated to a specific ingredient.

For instance, for apple chutney we find ingredients that is great when they are combined.

In [None]:
r=0.4
number_neighbors_toshow=5
neigh = NearestNeighbors(number_neighbors_toshow,r)
neigh.fit(coocc.values)

b = neigh.kneighbors(coocc.values, return_distance=False)
a = pd.DataFrame(b)
d = dict(zip(range(len(ingredients_frame.columns)),list(ingredients_frame.columns)))
a.replace(d, inplace=True)
a.rename(index=d,inplace=True)
a.loc['apple chutney'][1:]

#### Check similiraties between recipes:

In [None]:
r=0.4
number_neighbors_toshow=5
neigh = NearestNeighbors(number_neighbors_toshow,r)
neigh.fit(newdf.values)
b = neigh.kneighbors(coocc.values, return_distance=False)
a = pd.DataFrame(b)
d = dict(zip(range(len(ingredients_frame.columns)),list(ingredients_frame.columns)))
a.replace(d, inplace=True)
a.rename(index=d,inplace=True)
a.loc['chicken']

### Covariance

C_ij is the covariance between the i th and j th ingredients.

Covariance is just unscaled correlation.  
If a number at a certain position in the covariance matrix is large, then the variable that corresponds to that row and the variable that corresponds to that column change with one another. When one goes up, the other goes up. When one goes down, the other goes down. 

In [None]:
import math
ratio_u_ingred={}
test_ingredients_frame=ingredients_frame.copy()
for ingred in list(ingredients_frame):
    mean = ingredients_frame[ingred].apply(pd.to_numeric, errors='coerce').dropna(axis=0, how='any').mean()
    if math.isnan(mean):
        mean=1
    test_ingredients_frame[ingred][ingredients_frame[ingred].astype(str).str.contains('u')]=mean
    ratio_u_ingred[ingred]=\
    sum(ingredients_frame[ingred].dropna(axis=0, how='any').astype(str).str.contains('u'))/len(ingredients_frame[ingred].dropna(axis=0, how='any').astype(str).str.contains('u'))

In [None]:
ingr_matrix=test_ingredients_frame.fillna(0).values.T
covar=np.cov(ingr_matrix)

In [None]:
display(covar)

In [None]:
d = dict(zip(range(len(ingredients_frame.columns)),list(ingredients_frame.columns)))
i=15
print('With: ',d[i])
arr = np.array(covar[i])
topingr=arr.argsort()[-6:][::-1]
print('you should try:')
for x in topingr[:]:
    print(d[x])


## 4/ Linear Regression
With the linear regression on the nutrients information from the data, we will be able to compute an approximation of nutrient value for each ingredient so that we can approximate the nutrient value for the recipes which the value was missing in the dataset

In [None]:
from sklearn.linear_model import LinearRegression
nutri_data=raw_data[['kcal_total','kcal_carb','kcal_fat','kcal_protein','mg_sodium','mg_cholesterol']].copy()
a = nutri_data.replace('?',np.nan)
a.dropna(axis=0,how='any',inplace=True)

ingredients_frame=test_ingredients_frame
ingredients_frame['Total'] = a['kcal_total']
ingredients_frame['Carbohyd'] = a['kcal_fat']
ingredients_frame['fat'] = a['kcal_fat']
ingredients_frame['protein'] = a['kcal_protein']
ingredients_frame['mg_sodium'] = a['mg_sodium']
ingredients_frame['mg_cholestrol'] = a['mg_cholesterol']

display(ingredients_frame)
Train_set = ingredients_frame.dropna(subset=['Total'])

In [None]:
from sklearn.metrics import mean_absolute_error

X = Train_set.drop(['Total','Carbohyd','fat','protein','mg_sodium','mg_cholestrol'],axis=1)
X = X.replace(np.nan,0)
Y = Train_set['Total'] 

X_train = X[:20000]
Y_train = Y[:20000]

X_test = X[20000:]
Y_test = Y[20000:]

line_reg = LinearRegression()
line_reg.fit(X_train ,Y_train)
line_reg.coef_[np.where(line_reg.coef_ < 0)] = 0
Y_predict = line_reg.predict(X_test)

print(mean_absolute_error(Y_test,Y_predict))

## 5/ Next step and final goal

- Check for outliers in the data in order to improve our results.
By outliers in the recipes we mean, quantities of ingredients that are aberrant while comparing with the quantities for the other ingredients in the recipe.

- Improve and optimize the linear regression by appling a better conversion between volume and grams

- Use the covariance, co-occurence and the result of k-NN regression, to make new special recipe based on the association score of ingredients.

- Try to form some clusters based on the nutrients in order to group ingredients into categories.

- Implement constraints like vegan, gluten-free or low-calorie meals with the result from the linear regression


Our goal will be to create new recipes and suggest special recipes from the work that have been done, we believe that it will be possible. Special recipe will be created with the help of the covariance matrix and the co-occurrence matrix. Futhermore the user will be able to add constrains regarding the nutrients present in the recipe, such as gluten free, vegan etc.


