# 01.- CREATING FOOD DATABASE

We will start building our dataset scrapping the information from [BEDCA](https://www.bedca.net/bdpub/index.php)the Spanish Food Composition  
Database published by the Ministry of Science and Innovation and under the coordination and funding  
of the Spanish Agency for Food Safety and Nutrition of the Ministry of Health, Social Services and Equality.

The food composition values collected in this database have been obtained from different sources that  
include laboratories, the food industry and scientific publications or have been calculated by the agency.

Our databse will be stored in a `pandas.DataFrame`.

### PREREQUISITES

In [None]:
# !pip install selenium
# !pip install webdriver-manager

### IMPORTS

In [None]:
import numpy as np
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By

In [None]:
from funcs_data import mklist, item_to_lists, nameof

from funcs_driver import launch_driver, goto, foodindex, get_in, get_back

from funcs_scrapping import get_general_info, get_nutritional_facts
from funcs_scrapping import refine_nutritional_facts, get_group_info

In [None]:
# This cell only needs to be executed to update funcs_driver
# and funcs_scrapping after them have been imported

# %run funcs_data
# %run funcs_driver.py
# %run funcs_scrapping.py

### BUILDING THE DATAFRAME ( I )

Reading up on [different ways to increase a DataFrame](https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-then-filling-it) lead me to build a list for every column to be  
grown via `list.append()` and then build the `pd.DataFrame`

In [None]:
# The DataFrame will be made of a group de characteristics that we will use as columns
general_info =  ['foodname_ESP', 'foodname_ENG',
                 'quantity']
macros =        ['energy', 'fats', 'prot', 'carbs']
complementary = ['water', 'fiber', 'm_unsat_fats',
                 'p_unsat_fats', 'sat_fats']
flags =         ['palm_acid']
minerals =      ['calcium', 'iron', 'potassium', 'magnesium',
                 'sodium', 'phosphorus', 'iodide', 'selenium',
                 'zinc']
vitamins =      ['A', 'D', 'E', 'B9', 'B3', 'B2', 'B1',
                 'B12', 'B6', 'C']

characteristics = general_info + macros + complementary + flags + minerals + vitamins

# Lets generate all lists with the help of a short func build 'ad hoc'
foodname_ESP, foodname_ENG, quantity = mklist(len(general_info))
energy, fats, prot, carbs = mklist(len(macros))
water, fiber, m_unsat_fats, p_unsat_fats, sat_fats = mklist(len(complementary))  
palm_acid = []
calcium, iron, potassium, magnesium, sodium, phosphorus, iodide, selenium, zinc = mklist(len(minerals))
A, D, E, B9, B3, B2, B1, B12, B6, C = mklist(len(vitamins))

# To make it easier to add elements to every list, a 'superlist' is created
lists = [foodname_ESP, foodname_ENG, quantity,
         energy, fats, prot, carbs,
         water, fiber, m_unsat_fats, p_unsat_fats,
         sat_fats, palm_acid, calcium, iron,
         potassium, magnesium, sodium, phosphorus,
         iodide, selenium, zinc, A, D, E,
         B9, B3, B2, B1, B12, B6, C]

With all the lists created its time to make them grow with the scrapped data

### SCRAPPING

In [None]:
# Build the driver using Firefox as navigator
driver = launch_driver()

In [None]:
# With the base URL of the database the program will go to the all-food list
url = 'https://www.bedca.net/bdpub/index.php'
goto(driver, url)

In [None]:
# Generating an iterator with every food found
num_of_foods = len(foodindex(driver))

for i in range(num_of_foods):
    # The website refreshes everytime we go in and out of a food
    # so it so the iterator needs to be regenerated at every loop
    foods = foodindex(driver)
    item = []
    
    # Get in - Get data
    get_in(foods[i])
    get_general_info(driver, item)
    get_nutritional_facts(driver, item)
    refine_nutritional_facts(driver, item)

    # Add information found to lists
    item_to_lists(item, lists)

    # Get out
    get_back(driver)
    
    ### CHECKPOINT ###
    # In case of exception just change the range
    # at the top for range(last_i_printed, num_of_foods):
    print(i, foodname_ESP[i])

### BUILDING THE DATAFRAME ( II )

With all foods inspected and all lists filled its time to finish the `pd.DataFrame` construction

In [None]:
data = {k: v for k, v in zip(characteristics, lists)}

nutritional_values = pd.DataFrame(data)
nutritional_values.set_index(['foodname_ESP', 'foodname_ENG'], inplace=True)

Let's take a quicklook to the DataFrame

In [None]:
nutritional_values.head()

And export as `.csv` the database

In [None]:
nutritional_values.to_csv('./data/nutritional_values.csv', encoding='utf-8')

### THE MORE YOU KNOW...

Our database also contains information on which food group each food belongs to.  
Due to the navigation operation is high time consuming to code, I prefered to get this information by hand,  
clicking on the driver to access to the desired page and then letting the scrapper get the information

In [None]:
dairy_group = get_group_info(driver, dairy_group)

In [None]:
eggs_group = get_group_info(driver, eggs_group)

In [None]:
meat_group = get_group_info(driver, meat_group)

In [None]:
seafood_group = get_group_info(driver, seafood_group)

In [None]:
fats_oils_group = get_group_info(driver, fats_oils_group)

In [None]:
cereals_group = get_group_info(driver, cereals_group)

In [None]:
legumes_group = get_group_info(driver, legumes_group)

In [None]:
vegetables_group = get_group_info(driver, vegetables_group)

In [None]:
fruits_group = get_group_info(driver, fruits_group)

In [None]:
sweets_group = get_group_info(driver, sweets_group)

In [None]:
drinks_group = get_group_info(driver, drinks_group)

In [None]:
misc_group = get_group_info(driver, misc_group)

In [None]:
groups = [dairy_group, eggs_group, meat_group,
          seafood_group, fats_oils_group, cereals_group,
          legumes_group, vegetables_group, fruits_group,
          sweets_group, drinks_group, misc_group]

##### Let's take a quicklook at the information of the groups

In [None]:
num_of_items = 0
for group in groups:
    num_of_items += len(group)
    print(len(group), 'items in', nameof(group, globals()))
    
print(num_of_items, 'items in total')

Now let's build a `pd.Series`for each group and export them as `.csv` later on we can import them again.

In [None]:
for group in groups:
    name_of_Series = nameof(group, globals())
    series = pd.Series(group, name=f'{name_of_Series}')
    series.to_csv(f'./data/{name_of_Series}.csv', encoding='utf-8')

In [None]:
# There was an error with misc_group name (see cell above)
# This cell overwrites the corresponding Series

name_of_Series = 'misc_group'
series = pd.Series(misc_group, name=f'{name_of_Series}')
series.to_csv(f'./data/{name_of_Series}.csv', encoding='utf-8')