# Data acquisition and cleaning
**This part is only dedicated to how the data was acquired and cleaned**

We have two sources of data:
- Provided data from ICC
- Crawled data from web

## Process of the provided data
The provided data was easily parsed with the `Perl` script given by the TA, no cleaning was necessary here, removing NaN values and dropping duplicates is done when loading the data.

## Process of the crawled data
We used several own-made `BASH` script to fetch and retrieve data from a given website.

The first was made by hand (the structure of the folders was done with `wget -x`): we retrieve each category of regional cuisines and create separated folders. In each folder there was the **index.html** page of the corresponding regional cuisine page.

Then, by using the following script, we retrieve links for each category we had previously found:

In [1]:
# fetcher.sh
#!/usr/bin/env bash
STARTING=$PWD
for directory in $(find $STARTING -type d); 
do
    cd "$directory"
    url=$(cat *.html* | grep "canonical*"  | sed "s/.*href=\"//" | sed "s/\" \/>/?page=/")

    $STARTING/./crawling.sh $url 2
    sleep 5
    cd $STARTING
done

SyntaxError: invalid syntax (<ipython-input-1-e390fcf6d71f>, line 3)

In [None]:
# crawling.sh
#!/usr/bin/env bash
for i in $(eval echo {1..$2})
do
 TARGET="${1/$'\r'/}$i"
    --wait=10 \
    --random-wait \
    --reject '*.js,*.css,*.ico,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' \
    --execute robots=off \
    --user-agent=AGENT \
    --convert-links \
    --no-cache \
    --no-clobber \
    --no-http-keep-alive \
    --follow-tags=a/href \
    --accept=html \
    --header="Accept: text/html" \
    --ignore-tags=img,link,script \
    $TARGET
done

After this first step, we had a *urls.txt* file for each subfolder, which has all the recipes link for a given category.  
Last step was to execute for each line the following script.  
It downloads the page into a temporary `HTML` file, retrieves the required data and timeouts for 5 seconds to avoid the website robot to detect us.

In [None]:
#!/usr/bin/env bash

STARTING=$PWD
TMP_FILE="tmp.html"
DATA_FILE="data.csv"
DESC_FILE="desc.csv"
URL_LISTS="urls.txt"

for directory in $(find $STARTING -type d); 
do
    cd "$directory"
    for url in $(cat $URL_LISTS)
    do
        #################################### Downloading
        wget \
            --wait=10 \
            --random-wait \
            --reject '*.js,*.css,*.ico,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' \
            --execute robots=off \
            --user-agent=AGENT \
            --convert-links \
            --no-cache \
            --no-clobber \
            --no-http-keep-alive \
            --output-document="$TMP_FILE" \
            "$url"

        #################################### Parsing
        # Main info -> inggredients
        hash=$(md5sum $TMP_FILE | sed "s/  $TMP_FILE.*//")
        title=$(cat $TMP_FILE | grep "<title>" | sed "s/.*<title>//" | sed "s/Recipe - Allrecipes.*//")
        ing=$(cat $TMP_FILE | grep "checkList__item'\}\[true\]" | sed "s/.*title=\"//" | sed "s/\">//" | tr "\r" " " | tr "\n" "|")

        # Nutritive
        nutritive=$(cat $TMP_FILE | grep -A 20 "<div class=\"nutrition-summary-facts\">" | grep "itemprop")

        # Calories values
        cal=$(echo "$nutritive" | grep "calorie*" | sed 's/<span itemprop=\"calories\">//' | sed "s/ calories;<\/span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')

        # Fat values
        fat=$(echo "$nutritive" |grep "fat*")
        val=$(echo "$fat" | sed 's/<span itemprop=\"fatContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$fat" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                fat=$val
            else
                fat=$(echo $val*1000 | bc)
            fi
        fi

        # Carbon values
        carb=$(echo "$nutritive" |grep "carbon*")
        val=$(echo "$carb" | sed 's/<span itemprop=\"carbohydrateContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$carb" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                carb=$val
            else
                carb=$(echo $val*1000 | bc)
            fi
        fi

        # Protein values
        prot=$(echo "$nutritive" |grep "prot*")
        val=$(echo "$prot" | sed 's/<span itemprop=\"proteinContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$prot" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                prot=$val
            else
                prot=$(echo $val*1000 | bc)
            fi
        fi

        # Cholesterol values
        chol=$(echo "$nutritive" |grep "chol*")
        val=$(echo "$chol" | sed 's/<span itemprop=\"cholesterolContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$chol" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                chol=$val
            else
                chol=$(echo $val*1000 | bc)
            fi
        fi

        # Sodium values
        sod=$(echo "$nutritive" |grep "sodium*")
        val=$(echo "$sod" | sed 's/<span itemprop=\"sodiumContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$sod" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                sod=$val
            else
                sod=$(echo $val*1000 | bc)
            fi
        fi
        ######################################### Get Directives
        reg="<span class=\"recipe-directions__list--item\">"
        desc=$(cat "$TMP_FILE" | grep "$reg" | sed "s/$reg//" | tr "\n" " " | tr -s " ")
        ######################################### Printout
        echo -e "$hash\t${PWD##*/}\t$title\t$ing\t$cal\t$carb\t$fat\t$prot\t$sod\t$chol" >> "$DATA_FILE"
        echo -e "$hash£$desc" >> $DESC_FILE
        #################################### napping
        sleep 5
    ######################################### end for URLS
    done 
    ######################################### 
    cd $STARTING
done

**Note**: we have also retrieved the textual description to make text analysis on it (e.g time of cooking etc..)

# 1. Data Loading

In [None]:
# Basic imports
import re
import os.path
import numpy as np
import scipy as sp
import pandas as pd

DATA_FOLDER = './data/'
recipes_df = pd.read_csv(DATA_FOLDER + 'recipes_df.csv', sep='\t', encoding = "utf-8")

# 2. Ingredient Parsing
**This is just a presentation of how we parsed and clean the ingredient list, the following code is working but it needs some tweaking to make in work in this notebook. To see the result, please read** `DataAnalysis.ipynb` **section 2**

In this part we are trying to get a list of ingredients for each recipe. This list should be clean, which means it should contain only the names of the ingredients and no other informations, like quantities.

To do this, first we cleaned the list of ingredients by applying a low-case and by removing a set of words chosen manually (contained in `black_list`), then we used the natural language processing library `nltk` to remove words different from nouns.

In [None]:
# Copy for test
recipes_copy = recipes_df.copy()

# lowercase to be insensitive
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].str.lower()

# Coerce filtering, removing any occurence of these words as a first filter
black_list = ['inches','inch','medium','pounds','pound','ounces','ounces','fluid','ground','tablespoons','tablespoon','cups','cup','teaspoons','teaspoon', 'all-purpose', '\(.*\)']
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].replace(black_list, '', regex=True)

# Remove non alphabetic values expect of '|' which is the seperating char
recipes_copy['Ingredients'] = recipes_copy['Ingredients'].str.replace('[^a-zA-Z  -]+', ' ')

# Retrieve list of ingredients in overall
keywords_list = recipes_copy['Ingredients'].str.split(" ", expand=True).stack().unique()
len(keywords_list)

In [None]:
### Retrieve bad ingredients

# NLP-related imports
import nltk
nltk.download('punkt');
nltk.download('averaged_perceptron_tagger');

# NLP to identify only verbs
tokens = nltk.word_tokenize(' '.join(keywords_list))
tagged = nltk.pos_tag(tokens)

# Fetching the list of non correct word
gray_list = [word for word,pos in tagged if not(pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]

# Further filtering by removing gray_listed word with regex
ingredient_serie = recipes_copy['Ingredients'].replace(gray_list, '')

# Retrieve list of ingredients in overall
keywords_list = ingredient_serie.str.split(" ", expand=True).stack().unique()

# NLP to identify only nouns
tokens = nltk.word_tokenize(' '.join(keywords_list))
tagged = nltk.pos_tag(tokens)
nouns = [word for word,pos in tagged if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]

# We need to remove word that a smaller than 3 letters, as we suppose they are not ingredients
ing_list = [item for item in nouns if len(item) > 3]

At this point, we have a list of ingredients contained in `ing_list`, which can be used to filter our dataset. Unfortunately, as we are going to see below, some ingredients are not spelled correctly while others are not ingredients at all.

In [None]:
# Take original ingredients list and split each word to count recurrencies
ing_ds = recipes_copy['Ingredients'].str.split(" ", expand=True) \
                                        .stack().value_counts()  \
                                        .to_frame(name='count')  \
                                        .reset_index()

# Keeping only the ingredient in the previous list
ing_ds = ing_ds[ing_ds['index'].isin(ing_list)].reset_index(drop=True)

#ing_ds.sort_values(by='index') # if you want to see similar words
ing_ds.head(21)
#ing_ds['index'].to_csv('ing_list') # used to dataclean by hand

ing_ds[ing_ds['index'] == 'pioneer']

We can see that the words `powder`, `taste` and `sauce` are contained in the ten most recurring words, although they are not ingredients. These words should then be parsed by hand and removed to obtain a list that is ingredients-only.

We can also notice that some ingredients are duplicated due to different spellings (i.e. `onion` and `onions`). 

We tried to implement a way to merge similar words by finding a metric that calculates the distance between words. As such the similar words should have close values given by the metric.

In [None]:
### Retrieve similar names

# We can create a space with N dimensions
# Each letter of a word is mapped to its corresponding integer in this space
# Similar words will lie closely in this space

# Convert ingredient's distribution to list  
ing_ds_list = ing_ds['index'].values.tolist()
# print("\033[1mbar before sort:\033[0m", ing_ds_list)

# Looking for the longest word/ingredient
N = len(sorted(ing_ds_list, key=len)[-1])
# print("\033[1m\nLongest word is:\033[0m", N, " long")

# For each word in the list, we append the NULL element ASCII to have the same number of elements
converted_ing_list = [item + chr(0) * (N - len(item)) for item in ing_ds_list]
# print("\033[1m\n converted_ing_list after padding:\033[0m\n", converted_ing_list)

# Convert into each spatiales ASCII -> Numpy matrix
word_matrix = np.array([[ord(char) for char in string] for string in converted_ing_list])
# print("\033[1m\n converted_ing_list after ASCII int conversion:\033[0m\n", word_matrix)

# Compute the distance between each row 
# Idea: use backwards propagation to calculate the optimal weights
w = [10, 4, 3, 2, 1, 1, 1, 1, 1 ,1 ,1 ,1 ,1 , 1, 1]
#distance_matrix = sp.spatial.distance.cdist(word_matrix, word_matrix, 'wminkowski', p=2, w=w)
distance_matrix = sp.spatial.distance.cdist(word_matrix, word_matrix, 'euclidean')

# print("\033[1m\nDistance of the matrix define by converted_ing_list:\033[0m\n", distance_matrix)

# Thresholding <-> if the distance is small enough words are the same!
normed_dist = (distance_matrix < 60).astype(int)
# print("\033[1m\nDistance of the matrix thresholded:\033[0m\n", normed_dist)

# The list has been sorted
# if we take the first non-zero value for each row we get the matching word
vec = normed_dist.argmax(axis=0)
# print("\033[1m\nIndex of corresponding words in sorted [converted_ing_list]:\033[0m\n", vec)

# Foo after TODO name
deconverted_ing_list = [converted_ing_list[i].replace(chr(0), '') for i in vec]
# print("\033[1m deconverted_ing_list operation:\033[0m", deconverted_ing_list)

# Result
ing_dict = dict(zip(ing_ds_list, deconverted_ing_list))

ing_dict

As we can see, this method is not accurate for now. We would need more time to optimize the weights to use and to filter non-ingredient words.

As an alternative option, we envision to clean the ingredient list by hand.

This algorithm does not take into account the statistical relevance of letters in the english language, but only alphabetical closeness. 

**Laste minute update:** We actually can use two different strategies here:
- Check if the word exists in the English dictionnary, if there is a word which exists also but only one letter is changing then we combine them (i.e. `onion` and `onions`).
- We can implement the [Levenshtein distance](https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance) and apply it for each word in the list sorted alphabetically with a of moving window (thus we avoid useless computing)

In [None]:
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

**At this point we clean the data by hand**: the only issue we have now is the similar words which can be fixed by the suggested methods above.

In [None]:
with open(DATA_FOLDER + 'cleaned_list') as f:
    hand_ing_list = f.read().splitlines()
    
levenshtein('glaa', 'glaaaaaaaaa')

In [None]:
gla = [[levenshtein(item1, item2) for item1 in keywords_list] for item2 in keywords_list]

Sorry I didn't comment yet but we are doing exactly the same thing as above but with the Levensthein distance

In [None]:
matrix = np.array(gla)
normed_mat = (matrix <= 1).astype(int)
normed_mat

In [None]:
vec = normed_mat.argmax(axis=0)
vec

In [None]:
# Foo after TODO name
transformed_list = [keywords_list[i] for i in vec]
# print("\033[1m deconverted_ing_list operation:\033[0m", deconverted_ing_list)

# Result
ing_dict = dict(zip(keywords_list, transformed_list))
ing_dict

In [None]:
# Saving
np.save(DATA_FOLDER + 'full_ing_dict.npy', ing_dict) 