# Clean raw data

a csv file with differents recipes has been created in the notebook [get_from_api](get_from_api.ipynb), this file containraw data. The objective of this notebook will be to clean this data a create a database (with sql)

In [269]:
import pandas as pd
import numpy as np

In [270]:
df = pd.read_csv('../Data/raw_recipe.csv')
df.info()
df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1393 entries, 0 to 1392
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   1393 non-null   int64 
 1   image                1393 non-null   object
 2   sourceUrl            1393 non-null   object
 3   title                1393 non-null   object
 4   instructions         1369 non-null   object
 5   summary              1393 non-null   object
 6   extendedIngredients  1393 non-null   object
dtypes: int64(1), object(6)
memory usage: 76.3+ KB


Unnamed: 0,id,image,sourceUrl,title,instructions,summary,extendedIngredients
0,715495,https://spoonacular.com/recipeImages/715495-55...,http://www.pinkwhen.com/turkey-tomato-cheese-p...,Turkey Tomato Cheese Pizza,Heat up your grill to 450 degrees.Start off wi...,Turkey Tomato Cheese Pizza might be just the <...,"[{'id': 11333, 'aisle': 'Produce', 'image': 'g..."
1,665282,https://spoonacular.com/recipeImages/665282-55...,https://www.foodista.com/recipe/MQZZ3YMC/whole...,Whole Wheat Dinner Rolls,"In a small saucepan, bring the 1 cup water and...","You can never have too many bread recipes, so ...","[{'id': 14412, 'aisle': 'Beverages', 'image': ..."
2,632197,https://spoonacular.com/recipeImages/632197-55...,http://www.foodista.com/recipe/Y26QLV35/almond...,Almond Toffee Bars,<ol><li>1. Preheat oven to 350 degrees (325 if...,Almond Toffee Bars is a dessert that serves 1....,"[{'id': 1002050, 'aisle': 'Baking', 'image': '..."
3,658536,https://spoonacular.com/recipeImages/658536-55...,https://www.foodista.com/recipe/C3QSZ5T7/roast...,Roasted Cauliflower and Leek Soup,Preheat oven to 425 degrees Farenheit.\nSpread...,Roasted Cauliflower and Leek Soup is a <b>glut...,"[{'id': 11135, 'aisle': 'Produce', 'image': 'c..."
4,639836,https://spoonacular.com/recipeImages/639836-55...,https://www.foodista.com/recipe/LDHBBCLQ/cocon...,Coconut-Almond Crusted Tilapia,Pat and dry fish fillets. Sprinkle both sides ...,Coconut-Almond Crusted Tilapia requires approx...,"[{'id': 15261, 'aisle': 'Seafood', 'image': 'r..."


## clean data

* delete duplicates
* clean the columns:id, image, sourceUrl and title are already usable as such. instructions and summary are html code, we will keep this format (at least for now). The column which is more tricky to clean is extendedIngredients (that we will rename ingredients)

extendedIngredients is a list of dictionnaries. From this dictionnary we will keep the folowwing keys:

* id: the id of the ingredient
* aisle: could give us an idea of the ecological impact of the ingredient (we will see if we can use it later)
* nameClean: the name of the ingredient
* amount: the quantity needed
* unit: unit of 'amount'


In [271]:
#delete the duplicates 
df = df.drop_duplicates(subset=['id'])

In [272]:
import ast

def clean_ingredients(ingredients: str) -> list:
    """ 
    ingredients is a list of dictionnaries but is has the type of a string.
    for each dictionnaries contained in ingredients, keep only the keys: id, aisle, nameClean, amount, unit
    return the cleaned list of dictionnaries.
    """
    ingredients = ast.literal_eval(ingredients) #the column contains string, we want it to be a list
    cleaned_list = []

    for ingredient in ingredients:
        cleaned_dict = {key: value for key, value in ingredient.items() if key in ['id', 'aisle', 'nameClean', 'amount', 'unit']}
        cleaned_list.append(cleaned_dict)

    return cleaned_list


In [273]:
df['ingredients'] = df['extendedIngredients'].apply(clean_ingredients)

# Create the db
the app will iteract with the database with SQL. It is now time to create a database.

__TABLE Recipes__:

| id          | image | sourceUrl | title | instructions | ingredients |
|-------------|-------|-----------|-------|--------------|-------------|
| PRIMARY KEY |       |           |       |              |             |

ingredients column will contain the id of the ingredient, the amount needed and the unit


__TABLE Ingredients__:

| id          | aisle | name|
|-------------|-------|-----|
| PRIMARY KEY |       |     |

Note that ingredients of Recipes do not reference id of the table Ingredients because the column multiple ingredients

### split the ingredient columns as wanted

In [274]:
def get_all_ingredients(ingredients: list) -> list:
    """
    for each dictionnary of the list list
    keep only the keys : id, aisle, nameClean
    """
    cleaned_list=[]
    for ingredient in ingredients:
        clean_dict = {key: value for key, value in ingredient.items() if key in ['id', 'aisle', 'nameClean']}
        cleaned_list.append(clean_dict)

    return cleaned_list


In [275]:
def get_recipes_table(ingredients: list) -> list:
    """
    for each dictionnary of the list list
    keep only the keys : id, amount, unit
    return the list of dictionnaries
    """
    cleaned_list=[]
    for ingredient in ingredients:
        clean_dict = {key: value for key, value in ingredient.items() if key in ['id', 'amount', 'unit']}
        cleaned_list.append(clean_dict)

    return cleaned_list

In [276]:
all_ingr_duplicates = df['ingredients'].apply(get_all_ingredients)
df['ingredients'] = df['ingredients'].apply(get_recipes_table)

### creation of a dataframe with all the ingredients

In [277]:
list_ingr = []
list_id = []

for recipe in all_ingr_duplicates:
    for ingredient in recipe:
        if ingredient['id'] in list_id:
            continue
        else:
            new_row = {'id': ingredient['id'], 'aisle': ingredient['aisle'], 'name': ingredient['nameClean']}
            list_ingr.append(new_row)
            list_id.extend([ingredient['id']])

df_ingr = pd.DataFrame(list_ingr)


### now create the database

In [278]:
#sql do not accept lists of value, transform them in strings
df['ingredients'] = df['ingredients'].astype('string')

In [279]:
import sqlite3

#connection to the db (create it if it does not exist)
conn = sqlite3.connect('../Data/Supermarket.db')

df.to_sql('Recipes', conn, if_exists='replace', index=False)
df_ingr.to_sql('Ingredients', conn, if_exists='replace', index=False)

conn.commit()
conn.close()

In [280]:
# No constraints are added, it would not be usefull in our case and sqlite3 do not support some functions, 
# for example it is impossible to add a constraint primary key directly to a table, you have to
# create a table directly with the constraint and add data in it
# we won't do it here