### NLP

- Features we will be using out of the collected data to build knn model:
    1. Nutrition
    2. Tags
    3. Ingredients
- In order to process the tags and ingredients we have to convert them into numerical representations
- For tags it is better to use embeddings as these words are more categorical which have repetitions throughout recipies.
- For ingredients we will be using the tf-idf vectorization 

#### Import dependencies

In [46]:
import pandas as pd
import numpy as np
import re

#### Get the data

In [2]:
df1 = pd.read_csv("../Data Cleaning/cleaned_data1.csv")
df2 = pd.read_csv("../../data - scraping/data cleaning/cleaned_data2.csv")

In [None]:
df1["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
1964    {'kcal': '333 g', 'fat': '10 g', 'saturates': ...
1965    {'kcal': '424 g', 'fat': '13 g', 'saturates': ...
1966    {'kcal': '143 g', 'fat': '12 g', 'saturates': ...
1967    {'kcal': '351 g', 'fat': '13 g', 'saturates': ...
1968    {'kcal': '309 g', 'fat': '13 g', 'saturates': ...
Name: nutrition, Length: 1969, dtype: object

In [5]:
df = pd.concat([df1,df2], ignore_index=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3595 entries, 0 to 3594
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    3595 non-null   int64 
 1   name          3595 non-null   object
 2   ingredients   3595 non-null   object
 3   instructions  3595 non-null   object
 4   nutrition     3595 non-null   object
 5   time          3588 non-null   object
 6   serving_size  3593 non-null   object
 7   tags          3595 non-null   object
dtypes: int64(1), object(7)
memory usage: 224.8+ KB


#### Extracting Features

- First we need the nutrition values in per recipe so let's explode them

In [42]:
eval(df2["nutrition"][0])  #11 different values

{'Calories ': '393 cal',
 'Kilojoules ': '1644 kJ',
 'Protein ': '12 g',
 'Total fat ': '9.4 g',
 'Saturated fat ': '1.4 g',
 'Carbohydrates ': '56 g',
 'Sugar ': '4.6 g',
 'Dietary fibre ': '4.7 g',
 'Sodium ': '158 mg',
 'Calcium ': '173 mg',
 'Iron ': '3.1 mg'}

In [None]:
eval(df1["nutrition"][0]) #7 different values

{'Energy': '195 cal',
 'Protein': '10.3 g',
 'Carbohydrates': '30.5 g',
 'Fiber': '7.9 g',
 'Fat': '4.1 g',
 'Cholesterol': '0 mg',
 'Sodium': '8.8 mg'}

In [43]:
eval(df1["nutrition"][1960]) #8 different values

{'kcal': '563 g',
 'fat': '26 g',
 'saturates': '3.8 g',
 'carbs': '65 g',
 'sugars': '9 g',
 'fibre': '8 g',
 'protein': '17 g',
 'salt': '0.6 g'}

- Since our data is collected from three different sources, there seems to be varying measures of nutrition, let us keep the most basic nutrition values and drop the rest for now. 
- Later, we could also impute the additional nutrients like calcium, iron using a food nutrition dataset

In [24]:
#feature set
features = df[["ingredients","nutrition","tags"]]

In [25]:
features.head(3)

Unnamed: 0,ingredients,nutrition,tags
0,['2 cups sprouted vaal (field beans/ butter be...,"{'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...","['Non-stick Pan', 'Boiled Indian recipes', 'Sa..."
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","{'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi..."
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","{'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun..."


##### First let us begin with preprocessing nutrients

In [26]:
features["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
3590    {'Calories ': '495 cal', 'Kilojoules ': '2070 ...
3591    {'Calories ': '546 cal', 'Kilojoules ': '2280 ...
3592    {'Calories ': '380 cal', 'Kilojoules ': '1590 ...
3593                                                   {}
3594    {'Calories ': '449 cal', 'Kilojoules ': '1880 ...
Name: nutrition, Length: 3595, dtype: object

In [118]:
# apply a function to this in order to convert values from string to numeric and 
# remove the units like 'cal' and 'g'
# nutrients we are gonna keep: 
# ['Calories', 'Protein', 'Carbohydrates', 'Fiber', 'Fat', 'Sodium']

import ast

def nutrition_preprocessing(nutrients_series):
    mapped = {"calories": None, "protein":None, "carbohydrates":None, "fiber":None, "fat":None, "sodium":None}
    to_dict = ast.literal_eval(nutrients_series)
    for key,value in to_dict.items():
        prep_key = key.strip().lower()
        cleaned = re.sub(r'[^\d\.]', '', value)
        if cleaned == "":  # Some values in carbohydrates are like "N/A"
            continue
        if prep_key in ["calories","energy","kcal"]:
            mapped["calories"]= float(cleaned)
        if prep_key == "protein":
            mapped["protein"]= float(cleaned)
        if prep_key in ["carbohydrates","carbs"]:
            mapped["carbohydrates"]= float(cleaned)
        if prep_key in ["fiber","dietary fibre","fibre"]:
            mapped["fiber"]= float(cleaned)    
        if prep_key in ["fat","total fat"]:
            mapped["fat"]= float(cleaned)
        if prep_key in ["sodium","salt"]:
            mapped["sodium"]= float(cleaned)
        else:
            print(prep_key,cleaned)
    return mapped

In [119]:
cleaned = features["nutrition"].apply(nutrition_preprocessing)

energy 195
protein 10.3
carbohydrates 30.5
fiber 7.9
fat 4.1
cholesterol 0
energy 74
protein 2.6
carbohydrates 9.5
fiber 3.2
fat 2.9
cholesterol 0
energy 374
protein 13.3
carbohydrates 16.4
fiber 4.2
fat 28.4
cholesterol 2.5
energy 92
protein 0.1
carbohydrates 0.4
fiber 0
fat 10
cholesterol 0
energy 68
protein 2.3
carbohydrates 3.9
fiber 3.4
fat 5.3
cholesterol 0
energy 70
protein 2.2
carbohydrates 7.7
fiber 3
fat 3.3
cholesterol 0
energy 26
protein 1.6
carbohydrates 3.2
fiber 0.4
fat 0.8
cholesterol 2.1
energy 69
protein 3.9
carbohydrates 12.1
fiber 3.3
fat 0.5
cholesterol 0
energy 32
protein 1
carbohydrates 2
fiber 0.3
fat 2.3
cholesterol 0
energy 151
protein 7.9
carbohydrates 17.6
fiber 8.6
fat 5.4
cholesterol 0
energy 205
protein 6.5
carbohydrates 30.8
fiber 9.1
fat 7.2
cholesterol 0
energy 24
protein 0.3
carbohydrates 1.4
fiber 1
fat 2.1
cholesterol 0
energy 139
protein 5.9
carbohydrates 25
fiber 2.3
fat 1.9
cholesterol 0
energy 128
protein 5.1
carbohydrates 12.1
fiber 3.1
fat 7.1

In [120]:
exploded_df = pd.json_normalize(cleaned)

In [121]:
exploded_df

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
0,195.0,10.3,30.5,7.9,4.1,8.8
1,74.0,2.6,9.5,3.2,2.9,21.8
2,374.0,13.3,16.4,4.2,28.4,98.6
3,92.0,0.1,0.4,0.0,10.0,0.0
4,68.0,2.3,3.9,3.4,5.3,114.5
...,...,...,...,...,...,...
3590,495.0,19.0,45.0,14.0,25.0,650.0
3591,546.0,23.0,50.0,12.0,28.0,770.0
3592,380.0,16.0,55.0,11.0,9.0,350.0
3593,,,,,,


In [122]:
# add the exploded column to features 
features_exploded = features.drop("nutrition", axis=1).join(exploded_df)

In [123]:
features_exploded.head(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
0,['2 cups sprouted vaal (field beans/ butter be...,"['Non-stick Pan', 'Boiled Indian recipes', 'Sa...",195.0,10.3,30.5,7.9,4.1,8.8
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi...",74.0,2.6,9.5,3.2,2.9,21.8
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun...",374.0,13.3,16.4,4.2,28.4,98.6


In [124]:
# droping the two na values because of missing carbohydrate found while extracting 
# nutrients
features_exploded.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3595 entries, 0 to 3594
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3595 non-null   object 
 1   tags           3595 non-null   object 
 2   calories       3567 non-null   float64
 3   protein        3568 non-null   float64
 4   carbohydrates  3563 non-null   float64
 5   fiber          3567 non-null   float64
 6   fat            3568 non-null   float64
 7   sodium         3552 non-null   float64
dtypes: float64(6), object(2)
memory usage: 224.8+ KB


In [128]:
features_exploded[features_exploded.isnull().any(axis=1)].tail(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
3582,['1 large slice sourdough or wholegrain bread'...,"['Gluten-free option', 'Ready in 20 minutes', ...",,,,,,
3587,"['spray oil', '1 small potato, diced', '1 cup ...","['Gluten-free option', 'Nut free', 'Meals for ...",,,,,,
3593,"['1 cup mesclun or other salad leaves', '1 cup...","['Meals for one', 'No, or minimal, cooking', '...",,,,,,


In [126]:
# We have no option but to drop these rows as nutrient info for them are also missing

In [129]:
features_exploded.dropna(inplace=True)

In [130]:
features_exploded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3546 entries, 0 to 3594
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3546 non-null   object 
 1   tags           3546 non-null   object 
 2   calories       3546 non-null   float64
 3   protein        3546 non-null   float64
 4   carbohydrates  3546 non-null   float64
 5   fiber          3546 non-null   float64
 6   fat            3546 non-null   float64
 7   sodium         3546 non-null   float64
dtypes: float64(6), object(2)
memory usage: 249.3+ KB


##### Preprocessing the ingredients