# **Dataset in the kitchen**
------------------------------------------------------------------------


### Creation of an ingredient data set by *Olivier Burgaud* (EURECOM 2019), supervised by *Pr. M. Filipone*.
----

I begin my project by studying past projects, and I saw that there was a lack of consistent and widespread dataset to train the algorithm of recipes. 

Here, I wanted to create a dataset, from one of the widest open source website: ***Wikipedia***. I think it was a good place to find list and set of various and different ingredients. 

The task was not easy because there is no standard format for ***Wikipedia page***, indeed we can find Table-like page, Alphabetical or ramdomly ordered list. Besides, the content of these tables was not always consistent, for instances we can find hyper text link as "classical" text or sentences instead of the ingredient name.
I choose to scrap ***Wikipedia page*** with BeautifulSoup library and then I implemented few cleaning function.

I create a dictionary with the scrapped data, in order to have a consistent database : type of ingredient are the key, and ingredients are the value. The main advantage of a dictionnary is that is very easy to use, especially with Pandas. Besides, adding a value or a key is straightforward. Last, this format is a built-in schema of Python and it is widely used.

In [211]:
import requests
from bs4 import BeautifulSoup
import bs4
import time
import numpy as np
import re
import time
import copy

## 1. The scrapping Part

This is the first step of the database creation. The goal of this step is gather data directly from *Wikipedia*. It is the first step of data cleaning and the sketch of the database architecture.
After, a fine analysis of several *Wikipedia* page, I acknowledge the fact that there exists 3 mains schema for these pages : *table* , *alphabetical list* and *randomly ordered list* (not totally random in fact, but ordered on different criteria).
I implemented two functions: *make_soup* and *make_link_list*.

### *make_soup*

This function parses the html page into a *BeautifulSoup element*. In order to be able to search with html tags.
It returns a *BeautifulSoup element*.

### *make_link_list*

Here we extract the information of the BeautifuSoup element. 
I create 3 differents loop, it depends on the *type* of each wikipedia page, i.e. if it is a *table*, an *alphabetical list*, or a *randomly ordered list*. 
Find and identify between which tags should we extract data, and I didn't achieve to automatize it. I analyzed the source code of several "typical" pages to know which tag contains the useful data.

After that, I create a list with the data extracted, but, obviously, this data is not clean and I have to clean it.
That why I implement a *for* loop to begin the cleaning.

It returns a Python *list* and the length of the scrapping, it is *cpu_time*.

In [148]:
#A function that gets the URL of the page to be scraped,gets the html content 
#and uses BeautifulSoup to parse html content.

def make_soup(link):
    get_page = requests.get(link)
    html = get_page.content
    soup = BeautifulSoup(html, 'html.parser')
    return  soup


#####This function create a list with all the link of the foods in a wikipedia Page and it begins the data cleaning.
def make_link_list(wiki_page_to_scrap):
    start_time = time.time()
    link_table = []
    soup = make_soup(wiki_page_to_scrap)
    table = soup.find('table',{'class':'wikitable'})
    
            ### This first loop is used to scrap Wiki table Data.
    if isinstance(table , bs4.element.Tag):        
        temp = []
        table_cells = []
        table = soup.find('table',{'class':'wikitable'})
        for row in table.find_all("tr"):
            cells = row.find_all(['th' , 'td'])
            table_cells.append(cells)
        
        ### This loop is used to locate the "Common name" column index in our table cells
        indices = []
        for j in table_cells:
            for i, elem in enumerate(j):
                elem = str(elem)
                if ('name' in elem) and (not 'Scientific' in elem) :
                    indices.append(i)
        indice = indices[0]
        
        ### Here we implement a loop to keep only the string of the Common name column.        
        for cell in table_cells[2:]:            
            if (len(cell) < indice) == True : ## It is the condition if we have a blank cells i.e there is no common name.
                pass
             
            else:    
                temp.append(cell[indice].text)
            
        ### We discard all '\n' tag at the END of the lines and we keep only the common name.
        for link in range(len(temp)):
            temp[link] = temp[link].strip('\n')            
            links = re.sub("[\(\[].*?[\)\]]", "", temp[link]) 
            links , sep , tail = links.partition(',')
            link_table.append(links)
        
            
        #print('cpu time for the table schema = {:.4f} sec.'.format(time.time() - start_time))
        cpu_time = time.time()-start_time
            
            ### Here is when the Wiki page is just an Alphabetical List.    
    elif (len(soup.find_all('div' , {'class':'div-col'}))>0) == True : 
        for row in soup.find_all('div' , {'class':'div-col'}):
            
            for col in row.find_all('li'):
                species = col.text
                ###We just keep the common name of the species, because only the common name is used in recipes.
                only_common_species = re.sub("[\(\[].*?[\)\]]", "", species) 
                only_common_species , sep , tail = only_common_species.partition(',')
                link_table.append(only_common_species)
                
                ###Cleaning of the list, we remove all the occurence of string begining by List.
        for word in link_table[:]:
            if (word.find('List') != -1) or (word.find('Healthline') != -1) :
                link_table.remove(word)
            
            
        #print('cpu time for the  Alphabetical list schema = {:.4f} sec.'.format(time.time() - start_time))  
        cpu_time = time.time()-start_time
        ### For the list pattern without alphabetical list.
    elif (len(soup.find_all('div' , {'class' : 'mw-parser-output'}))> 0 ) == True:
        for row in soup.find_all('div' , {'class' : 'mw-parser-output'}):
             for col in row.find_all('li'):
                    species = col.text
        ###We just keep the common name of the species, because only the common name is used in recipes.
                    only_common_species = re.sub("[\(\[].*?[\)\]]", "", species) 
                    only_common_species , sep , tail = only_common_species.partition(',')
                    link_table.append(only_common_species)
                    
                ###Cleaning of the list, we remove all the occurence of string begining by List.
        for word in link_table[:]:
            if (word.find('List') != -1) or (word.find('Healthline') != -1):
                link_table.remove(word)
           
        
        #print('cpu time for the list schema = {:.4f} sec.'.format(time.time() - start_time))   
        cpu_time = time.time()-start_time
    return (link_table , cpu_time)  





## 2. Cleaning functions

### *tagger_cleaner*

This function find "\n" tags that stays into string. After the location it separates the string with "\n" as separator. Then, it returns the list with all the ingredients.

### *remove_meta*

This function discards each occurence of a meta element, i.e. the class or type name, that begins by a number ("*1 fish, 2 Roe ...*")because ingredient with a number in its name does not exist. It returns the list that we give as input, without meta elements.

In [214]:
### Cleaning function of the dictionary
### Few common error in the categories:
    #html tag as "\n"
    
def tagger_cleaner(list_of_ingre):
    start_time = time.time()    
    spliter_list = []
    cleaned_list = []
    for i in list_of_ingre:
        if '\n' in i : 
            spliter_list.append(i.split('\n'))        
        else:
            cleaned_list.append(i.capitalize())
    if spliter_list != []:
        clean_ingre_list = list(np.hstack(spliter_list))
    
        for ingre in clean_ingre_list:
            ingre = ingre.capitalize()
            cleaned_list.append(ingre)
    cleaned_list = list(set(cleaned_list))
    cleaned_list = list(filter(None , cleaned_list))
    cleaned_list.sort()
    cpu_time = cpu_time = time.time()-start_time
    return(cleaned_list , cpu_time)

### This function remove every elements with a numbers in the string.
def remove_meta(list_of_ingre): 
    start_time = time.time()
    temp_list = []
    temp_ingre_list = copy.deepcopy(list_of_ingre)
    for i in (range(len(list_of_ingre))):        
        for c in list_of_ingre[i]:
            if c.isdigit():
                temp_list.append(i)
    
    temp_list = list(set(temp_list))
    for j in temp_list:
        temp_ingre_list.remove(list_of_ingre[j])
    
    cpu_time = time.time()-start_time
    return(temp_ingre_list, cpu_time)

## 3. Database creation

Here I create two function for the creation of the database, with the construction of a dictionnary. I implemented two user-friendly function, to allow everyone to add a category or just one element to one category.


### *add_cat_to_dict*

This function permits to add a full category to a dictionary and it verifies if this category does not exist already. The user have to give a *list_of_ingre*, a *category* name and the name of the dictionary as input, the function will return the *food_dict* with the new category if the *category* name does not exist. Else it prints a message to redirect the user to the 2nd function.
I create a function to do this, instead of hardcode a dictionnary in order to give the maximum freedom for the user. Indeed he can create dictionnary as he wants, for instance a dictionnary for the fruit, another for the vegetables...

### *add_ingre_to_dict*

Here user can add just one ingredient in one category. The user just need to put the *ingredient*, the *category* where the *ingredient* should belongs. 
Besides this function check if the *category*, the *ingredient* are already in the dictionnary. If not, it returns a message to redirect the user to use *add_cat_to_dict*.

In [220]:
# Let's create a dictionary with the different categories of vegetables. 

### This function is the constructor of the dictionnary, when we want to add a category and a list (which can be empty)to the dict.
def add_cat_to_dict(list_of_ingr , category , food_dict):

    CAT_NAME = category.upper()
    if CAT_NAME in food_dict:
        print("This category exists already, please use the function add_ingre_to_dict")
    else:
        food_dict.update({CAT_NAME : list_of_ingr})
        return 

###This function permits to the user to add an element in a category, I thought that the user will add ingredient
###one by one.
    
def add_ingre_to_dict(ingredient , category , food_dict):
    CTG = category.upper()
    ingre = ingredient.capitalize()
    #First we check if the category exist.
    if CTG in food_dict:        
        if ingre in food_dict[CTG]:
            print("This ingredient is already in the category.")
        else:
            food_dict[CTG].append(ingre)
            food_dict = food_dict[CTG].sort()
    else :
        print('This category does not exist, you can create a new one with the function add_cat_to_dict.')
    return 

## 4. Application

Now, let's how we can create a *dictionary* of 8 different types of ingredients.
These types are:
+ Citrus
+ Leaf vegetables (*salads* is the generic term)
+ Culinary herbs and spices (*spices* is the generic term)
+ Fruit
+ Herbs
+ Vegetables
+ Edible flowers
+ Seafood

Besides I create a list for each execution time, in order to have an idea of the duration of the processing 

### URL identification

This step is the foudation of the database creation. The user have to find some *Wikipedia* page to find some data to scrap. My project is far from perfect: the user must find a "good" page to scrap : a *Wikipedia List_of_...* page, and not all page works with my functions. 


In [None]:
page_citrus = 'https://en.wikipedia.org/wiki/List_of_citrus_fruits'
page_salads = 'https://en.wikipedia.org/wiki/List_of_leaf_vegetables'
page_spices = 'https://en.wikipedia.org/wiki/List_of_culinary_herbs_and_spices'
page_fruit = 'https://simple.wikipedia.org/wiki/List_of_fruits'
page_herbs = 'https://simple.wikipedia.org/wiki/List_of_herbs'
page_vegetable = 'https://simple.wikipedia.org/wiki/List_of_vegetables'
page_edible_flower = 'https://en.wikipedia.org/wiki/List_of_edible_flowers'
page_seafood = 'https://en.wikipedia.org/wiki/List_of_types_of_seafood'

### List creation

Here I create some list of ingredients. I don't scrap meat because the name of each piece can variate very much according to the country, so the data should be not relevant. To fix this problem, we can think about two approaches:
+ Analyze a set of recipes with meat and find which name are relevant.
+ Create different category in the dictionary depending on the country (huge set will be created).

In [217]:
list_citrus , t_citrus = make_link_list(page_citrus)

list_salad , t_salad = make_link_list(page_salads)

list_spices , t_spices = make_link_list(page_spices)

list_fruit , t_fruit = make_link_list(page_fruit)

list_herbs , t_herbs = make_link_list(page_herbs)

list_vegetable , t_vegetable = make_link_list(page_vegetable)

list_edible_flower , t_edible_flower = make_link_list(page_edible_flower)

list_seafood , t_seafood = make_link_list(page_seafood)

time_list = [t_citrus , t_salad , t_spices , t_fruit , t_herbs , t_vegetable , t_edible_flower , t_seafood]

### Cleaning step

Now, each list created will be cleaned with the cleaning functions.

In [218]:
### I applied the 2 cleaning functions for each list.
list_citrus , t_citrus = tagger_cleaner(list_citrus)
time_list[0] += t_citrus # Updating the time of processing
list_citrus , t_citrus = remove_meta(list_citrus)
time_list[0] += t_citrus

list_salad , t_salad = tagger_cleaner(list_salad)
time_list[1] += t_salad
list_salad , t_salad = remove_meta(list_salad)
time_list[1] += t_salad

list_spices , t_spices = tagger_cleaner(list_spices)
time_list[2] += t_spices
list_spices , t_spices = remove_meta(list_spices)
time_list[2] += t_spices

list_fruit , t_fruit = tagger_cleaner(list_fruit)
time_list[3] += t_fruit
list_fruit , t_fruit = remove_meta(list_fruit)
time_list[3] += t_fruit

list_herbs , t_herbs = tagger_cleaner(list_herbs)
time_list[4] += t_herbs
list_herbs , t_herbs = remove_meta(list_herbs)
time_list[4] += t_herbs

list_vegetable , t_vegetable = tagger_cleaner(list_vegetable)
time_list[5] += t_vegetable
list_vegetable , t_vegetable = remove_meta(list_vegetable)
time_list[5] += t_vegetable

list_edible_flower , t_edible_flower = tagger_cleaner(list_edible_flower)
time_list[6] += t_edible_flower
list_edible_flower , t_edible_flower = remove_meta(list_edible_flower)
time_list[6] += t_edible_flower

list_seafood , t_seafood = tagger_cleaner(list_seafood)
time_list[7] += t_seafood
list_seafood , t_seafood = remove_meta(list_seafood)
time_list[7] += t_seafood

### Dictionary creation

Now we will add each category and its instances, to a dictionnary.

In [232]:
food_dict = dict()

add_cat_to_dict(list_citrus , 'citrus' , food_dict)
add_cat_to_dict(list_salad , 'Salads', food_dict)
add_cat_to_dict(list_spices , 'spicEs' , food_dict)
add_cat_to_dict(list_fruit , 'FRUITs' , food_dict)
add_cat_to_dict(list_herbs , 'herbs' , food_dict)
add_cat_to_dict(list_vegetable , 'vegetables' , food_dict)
add_cat_to_dict(list_edible_flower , 'edible_flowers' , food_dict)
add_cat_to_dict(list_seafood , 'seafood' , food_dict)

print(len(food_dict))
print(food_dict['CITRUS'])
print(len(food_dict['SALADS']))

8
['Balady citronisrael citron', 'Bergamot orange', 'Bitter orangeseville orangesour orangebigarade orangemarmalade orange', 'Blood orange', "Buddha's handbushukanfingered citron", 'Calamondincalamansi', 'Cam sành', 'Citron', 'Clementine', 'Corsican citron', 'Desert lime', 'Etrog', 'Finger lime', 'First ladyanadomikan', 'Florentine citron', 'Grapefruit', 'Greek citron', 'Hyuganatsukonatsutosakonatsunew summer orange', 'Kabosu', 'Kaffir lime', 'Key lime', 'Kinnow', 'Kiyomi', 'Kumquat', 'Lemon', 'Lime', 'Mandarin orangemandarinmandarine', 'Mangshanyegan', 'Meyer lemon', 'Moroccan citron', 'Myrtle-leaved orange tree', 'Orangesweet orange', 'Oroblancosweetie', 'Papeda', 'Persian limetahiti limebearss lime', 'Pomelopummelopommeloshaddock', 'Ponderosa lemon', 'Rangpurlemandarin', 'Round limeaustralian limeaustralian round lime', 'Satsumacold hardy mandarinsatsuma mandarinsatsuma orangechristmas orangetangerine', 'Shangjuanichang lemon', 'Shonan gold', 'Sudachi', 'Sweet limettamediterranean s

## Bibliography

+ BeautifulSoup to Download   : https://pypi.org/project/beautifulsoup4/
+ BeautifulSoup documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/



In [None]:
print('Total number' , number_citrus + number_fruit + number_herbs + number_salad + number_spices + number_vegetable )

In [114]:
f = ['' , 'Olivier\n']
for word in f[:]:
    
    word = word.strip('\n')
            #if (word.find('List') != -1) or (word.find('Healthline') != -1):
             #   f.remove(word)
                
print(f)
a = 'ol'
f[0] += 'a'
a= 'p'
f[0] += 'p'

print(f)


['', 'Olivier\n']
['ap', 'Olivier\n']


In [16]:
a='PastiS\nRicard'
a= a.split('\n')
print(a)
a.append('alcool')
c = a[2].capitalize()
a[2] = c
print(c)
a.sort()
print(a)
#b = a.append('Alcool')
#print(b)

['PastiS', 'Ricard']
Alcool
['Alcool', 'PastiS', 'Ricard']


In [17]:
liste_test = ['PastiS\nRicard' , 'EURECOM' , 'OlivierIng']

h = len(liste_test)

for i in range(h):
    if '\n' in liste_test[i]:
        temp = liste_test[i].split('\n')
        print('oui' , temp)
    else:
        print('propre')

oui ['PastiS', 'Ricard']
propre
propre


In [18]:
dico=dict()
print(dico)
dico.update({'Olivier' : [185 , 22]})
print (dico)
dico['Olivier'].append('Thomas')
print(dico)
dico.update({'Alcool' : 'PAstis'})
print(dico)
dico['Alcool'].append('biere')

{}
{'Olivier': [185, 22]}
{'Olivier': [185, 22, 'Thomas']}
{'Olivier': [185, 22, 'Thomas'], 'Alcool': 'PAstis'}


AttributeError: 'str' object has no attribute 'append'

In [41]:

r = 'olivier'
print(r.hexdigits)

AttributeError: 'str' object has no attribute 'hexdigits'