# Data acquisition and cleaning
**This part is only dedicated to how the data was acquired and cleaned**
This notebook isn't suppose to be executed.

We have two sources of data:
- Provided data from ICC
- Crawled data from web

## Process of the provided data
The provided data was easily parsed with the `Perl` script given by the TA, no cleaning was necessary here, removing NaN values and dropping duplicates is done when loading the data.

## Process of the crawled data
We used several own-made `BASH` script to fetch and retrieve data from a given website.

The first was made by hand (the structure of the folders was done with `wget -x`): we retrieve each category of regional cuisines and create separated folders. In each folder there was the **index.html** page of the corresponding regional cuisine page.

Then, by using the following script, we retrieve links for each category we had previously found:

In [None]:
# fetcher.sh
#!/usr/bin/env bash
STARTING=$PWD
for directory in $(find $STARTING -type d); 
do
    cd "$directory"
    url=$(cat *.html* | grep "canonical*"  | sed "s/.*href=\"//" | sed "s/\" \/>/?page=/")

    $STARTING/./crawling.sh $url 2
    sleep 5
    cd $STARTING
done

In [None]:
# crawling.sh
#!/usr/bin/env bash
for i in $(eval echo {1..$2})
do
 TARGET="${1/$'\r'/}$i"
    --wait=10 \
    --random-wait \
    --reject '*.js,*.css,*.ico,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' \
    --execute robots=off \
    --user-agent=AGENT \
    --convert-links \
    --no-cache \
    --no-clobber \
    --no-http-keep-alive \
    --follow-tags=a/href \
    --accept=html \
    --header="Accept: text/html" \
    --ignore-tags=img,link,script \
    $TARGET
done

After this first step, we had a *urls.txt* file for each subfolder, which has all the recipes link for a given category.  
Last step was to execute for each line the following script.  
It downloads the page into a temporary `HTML` file, retrieves the required data and timeouts for 5 seconds to avoid the website robot to detect us.

In [None]:
#!/usr/bin/env bash

STARTING=$PWD
TMP_FILE="tmp.html"
DATA_FILE="data.csv"
DESC_FILE="desc.csv"
URL_LISTS="urls.txt"

for directory in $(find $STARTING -type d); 
do
    cd "$directory"
    for url in $(cat $URL_LISTS)
    do
        #################################### Downloading
        wget \
            --wait=10 \
            --random-wait \
            --reject '*.js,*.css,*.ico,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' \
            --execute robots=off \
            --user-agent=AGENT \
            --convert-links \
            --no-cache \
            --no-clobber \
            --no-http-keep-alive \
            --output-document="$TMP_FILE" \
            "$url"

        #################################### Parsing
        # Main info -> inggredients
        hash=$(md5sum $TMP_FILE | sed "s/  $TMP_FILE.*//")
        title=$(cat $TMP_FILE | grep "<title>" | sed "s/.*<title>//" | sed "s/Recipe - Allrecipes.*//")
        ing=$(cat $TMP_FILE | grep "checkList__item'\}\[true\]" | sed "s/.*title=\"//" | sed "s/\">//" | tr "\r" " " | tr "\n" "|")

        # Nutritive
        nutritive=$(cat $TMP_FILE | grep -A 20 "<div class=\"nutrition-summary-facts\">" | grep "itemprop")

        # Calories values
        cal=$(echo "$nutritive" | grep "calorie*" | sed 's/<span itemprop=\"calories\">//' | sed "s/ calories;<\/span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')

        # Fat values
        fat=$(echo "$nutritive" |grep "fat*")
        val=$(echo "$fat" | sed 's/<span itemprop=\"fatContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$fat" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                fat=$val
            else
                fat=$(echo $val*1000 | bc)
            fi
        fi

        # Carbon values
        carb=$(echo "$nutritive" |grep "carbon*")
        val=$(echo "$carb" | sed 's/<span itemprop=\"carbohydrateContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$carb" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                carb=$val
            else
                carb=$(echo $val*1000 | bc)
            fi
        fi

        # Protein values
        prot=$(echo "$nutritive" |grep "prot*")
        val=$(echo "$prot" | sed 's/<span itemprop=\"proteinContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$prot" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                prot=$val
            else
                prot=$(echo $val*1000 | bc)
            fi
        fi

        # Cholesterol values
        chol=$(echo "$nutritive" |grep "chol*")
        val=$(echo "$chol" | sed 's/<span itemprop=\"cholesterolContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$chol" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                chol=$val
            else
                chol=$(echo $val*1000 | bc)
            fi
        fi

        # Sodium values
        sod=$(echo "$nutritive" |grep "sodium*")
        val=$(echo "$sod" | sed 's/<span itemprop=\"sodiumContent\">//' | sed "s/<span.*//" | sed 's/[[:blank:]]//g' | sed ':a;N;$!ba;s/\n//g')
        if [[ $val ]]
        then
            if [[ $(echo "$sod" | sed "s/.*hidden=\"true\">//" | grep "mg") ]]
            then
                sod=$val
            else
                sod=$(echo $val*1000 | bc)
            fi
        fi
        ######################################### Get Directives
        reg="<span class=\"recipe-directions__list--item\">"
        desc=$(cat "$TMP_FILE" | grep "$reg" | sed "s/$reg//" | tr "\n" " " | tr -s " ")
        ######################################### Printout
        echo -e "$hash\t${PWD##*/}\t$title\t$ing\t$cal\t$carb\t$fat\t$prot\t$sod\t$chol" >> "$DATA_FILE"
        echo -e "$hash£$desc" >> $DESC_FILE
        #################################### napping
        sleep 5
    ######################################### end for URLS
    done 
    ######################################### 
    cd $STARTING
done

**Note**: we have also retrieved the textual description to make text analysis on it (e.g time of cooking etc..)