# 01.- CREATING FOOD DATABASE

We will start building our dataset scrapping the information from [BEDCA](https://www.bedca.net/bdpub/index.php) the Spanish Food Composition  
Database published by the Ministry of Science and Innovation and under the coordination and funding  
of the Spanish Agency for Food Safety and Nutrition of the Ministry of Health, Social Services and Equality.

The food composition values collected in this database have been obtained from different sources that  
include laboratories, the food industry and scientific publications or have been calculated by the agency.

Our databse will be stored in a `pandas.DataFrame`.

___
### PREREQUISITES

In [None]:
# !pip install selenium
# !pip install webdriver-manager

___
### IMPORTS

In [1]:
import numpy as np
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By

In [2]:
from funcs_data import mklist, item_to_lists, nameof

from funcs_driver import launch_driver, goto, foodindex, get_in, get_back

from funcs_scrapping import get_general_info, get_nutritional_facts
from funcs_scrapping import refine_nutritional_facts, get_group_info

In [4]:
# This cell only needs to be executed to update funcs_driver
# and funcs_scrapping after them have been imported

# %run funcs_data
# %run funcs_driver.py
# %run funcs_scrapping.py

___
### BUILDING THE DATAFRAME ( I )

Reading up on [different ways to increase a DataFrame](https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-then-filling-it) lead me to build a list for every column to be  
grown via `list.append()` and then build the `pd.DataFrame`

In [5]:
# The DataFrame will be made of a group de characteristics that we will use as columns
characteristics = ['foodname_ESP', 'foodname_ENG', 'quantity',
                   'energy', 'fats', 'prot', 'water', 'fiber',
                   'carbs', 'm_unsat_fats', 'p_unsat_fats',
                   'sat_fats', 'palm_acid', 'chol', 'A', 'D', 'E',
                   'B9', 'B3', 'B2', 'B1', 'B12', 'B6', 'C', 'calcium',
                   'iron', 'potassium', 'magnesium', 'sodium', 'phosphorus',
                   'iodide', 'selenium', 'zinc']

general_info =  ['foodname_ESP', 'foodname_ENG', 'quantity',]
macros =        ['energy', 'fats', 'prot', 'carbs']
complementary = ['water', 'fiber', 'm_unsat_fats', 'p_unsat_fats',
                 'sat_fats', 'chol']
flags =         ['palm_acid']
vitamins =      ['A', 'D', 'E', 'B9', 'B3', 'B2', 'B1', 'B12', 'B6', 'C',]
minerals =      ['calcium', 'iron', 'potassium', 'magnesium', 'sodium',
                 'phosphorus', 'iodide', 'selenium', 'zinc',]

# Lets generate all lists with the help of a short func build 'ad hoc'
foodname_ESP, foodname_ENG, quantity = mklist(len(general_info))
energy, fats, prot, carbs = mklist(len(macros))
water, fiber, m_unsat_fats, p_unsat_fats, sat_fats, chol = mklist(len(complementary))  
palm_acid = []
calcium, iron, potassium, magnesium, sodium, phosphorus, iodide, selenium, zinc = mklist(len(minerals))
A, D, E, B9, B3, B2, B1, B12, B6, C = mklist(len(vitamins))

# To make it easier to add elements to every list, a 'superlist' is created
lists = [foodname_ESP, foodname_ENG, quantity,
         energy, fats, prot, water, fiber,
         carbs, m_unsat_fats, p_unsat_fats,
         sat_fats, palm_acid, chol, A, D, E,
         B9, B3, B2, B1, B12, B6, C, calcium, iron,
         potassium, magnesium, sodium, phosphorus,
         iodide, selenium, zinc]

With all the lists created its time to make them grow with the scrapped data

___
### SCRAPPING

In [6]:
# Build the driver using Firefox as navigator
driver = launch_driver()



Current firefox version is 96.0
Get LATEST geckodriver version for 96.0 firefox
Driver [/Users/mabatalla/.wdm/drivers/geckodriver/macos/v0.30.0/geckodriver] found in cache


In [7]:
# With the base URL of the database the program will go to the all-food list
url = 'https://www.bedca.net/bdpub/index.php'
goto(driver, url)

In [8]:
# Generating an iterator with every food found
num_of_foods = len(foodindex(driver))

for i in range(num_of_foods):
    # The website refreshes everytime we go in and out of a food
    # so it so the iterator needs to be regenerated at every loop
    foods = foodindex(driver)
    item = []
    
    # Get in - Get data
    get_in(foods[i])
    get_general_info(driver, item)
    get_nutritional_facts(driver, item)
    refine_nutritional_facts(driver, item)

    # Add information found to lists
    item_to_lists(item, lists)

    # Get out
    get_back(driver)
    
    ### CHECKPOINT ###
    # In case of exception just change the range
    # at the top for range(last_i_printed, num_of_foods):
    print(i, foodname_ESP[i])

0 Aceite de algodón
1 Aceite de cacahuete
2 Aceite de coco
3 Aceite de colza
4 Aceite de germen de trigo
5 Aceite de girasol
6 Aceite de grano de uva
7 Aceite de hígado de bacalao
8 Aceite de lino
9 Aceite de nuez
10 Aceite de oliva
11 Aceite de oliva virgen extra
12 Aceite de oliva virgen extra, producción ecologica
13 Aceite de palma
14 Aceite de sésamo
15 Aceite de soja
16 Aceituna
17 Aceituna negra, con hueso
18 Acelga, cruda
19 Acelga, en conserva
20 Acelgas, hervidas
21 Achicoria, cruda
22 Agua de la red
23 Agua mineral, mineralización debil
24 Agua, con gas, embotellada
25 Aguacate
26 Aguacate congelado
27 Aguardiente
28 Ajo
29 Ajo, en polvo
30 Ajo, frito
31 Albahaca
32 Albaricoque
33 Albondigas en conserva
34 Alcachofa, cruda
35 Alcachofas en conserva
36 Alcaparra
37 Alioli
38 Almeja
39 Almejas en conserva
40 Almendra, cruda
41 Almendra, cruda, con cáscara
42 Almendra, frita, salada
43 Almendra, tostada
44 Almidón de arroz
45 Almidón de maíz
46 Almidón de trigo
47 Altramuz
48 A

The webscrapping took 115 mins. It's time to close the driver.

In [13]:
driver.close()

___
### BUILDING THE DATAFRAME ( II )

With all foods inspected and all lists filled its time to finish the `pd.DataFrame` construction

In [9]:
data = {k: v for k, v in zip(characteristics, lists)}

nutritional_values = pd.DataFrame(data)
nutritional_values.set_index(['foodname_ESP', 'foodname_ENG'], inplace=True)

Let's take a quicklook to the DataFrame

In [10]:
nutritional_values.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,energy,fats,prot,water,fiber,carbs,m_unsat_fats,p_unsat_fats,sat_fats,...,C,calcium,iron,potassium,magnesium,sodium,phosphorus,iodide,selenium,zinc
foodname_ESP,foodname_ENG,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Aceite de algodón,Cotton oil,100,888.0,100.0,0.0,0.0,0.0,0.0,17.8,51.9,25.9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Aceite de cacahuete,Peanut oil,100,887.0,99.9,0.0,0.1,0.0,0.0,47.8,28.5,18.8,...,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0
Aceite de coco,Coconut oil,100,888.0,100.0,0.0,0.0,0.0,0.0,4.96,0.77,84.31,...,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Aceite de colza,Rape oil,100,888.0,100.0,0.0,0.0,0.0,0.0,65.3,28.01,6.29,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
Aceite de germen de trigo,Wheat germ oil,100,887.0,99.9,0.0,0.1,0.0,0.0,15.1,61.7,18.8,...,0.0,0.0,0.0,,,,,0.0,,


And export as `.csv` the database

In [12]:
nutritional_values.to_csv('./data/foods_raw.csv', encoding='utf-8')

___
### THE MORE YOU KNOW...

Our database also contains information on which food group each food belongs to.  
Due to the navigation operation is high time consuming to code, I prefered to get this information by hand,  
clicking on the driver to access to the desired page and then letting the scrapper get the information

In [None]:
dairy_group = get_group_info(driver, dairy_group)

In [None]:
eggs_group = get_group_info(driver, eggs_group)

In [None]:
meat_group = get_group_info(driver, meat_group)

In [None]:
seafood_group = get_group_info(driver, seafood_group)

In [None]:
fats_oils_group = get_group_info(driver, fats_oils_group)

In [None]:
cereals_group = get_group_info(driver, cereals_group)

In [None]:
legumes_group = get_group_info(driver, legumes_group)

In [None]:
vegetables_group = get_group_info(driver, vegetables_group)

In [None]:
fruits_group = get_group_info(driver, fruits_group)

In [None]:
sweets_group = get_group_info(driver, sweets_group)

In [None]:
drinks_group = get_group_info(driver, drinks_group)

In [None]:
misc_group = get_group_info(driver, misc_group)

In [None]:
groups = [dairy_group, eggs_group, meat_group,
          seafood_group, fats_oils_group, cereals_group,
          legumes_group, vegetables_group, fruits_group,
          sweets_group, drinks_group, misc_group]

##### Let's take a quicklook at the information of the groups

In [None]:
num_of_items = 0
for group in groups:
    num_of_items += len(group)
    print(len(group), 'items in', nameof(group, globals()))
    
print(num_of_items, 'items in total')

Now let's build a `pd.Series` for each group and export them as `.csv` later on so we can import them again.

In [None]:
for group in groups:
    name_of_Series = nameof(group, globals())
    series = pd.Series(group, name=f'{name_of_Series}')
    series.to_csv(f'./data/{name_of_Series}.csv', encoding='utf-8')

In [None]:
# There was an error with misc_group name (see cell above)
# This cell overwrites the corresponding Series

name_of_Series = 'misc_group'
series = pd.Series(misc_group, name=f'{name_of_Series}')
series.to_csv(f'./data/{name_of_Series}.csv', encoding='utf-8')