### NLP

- Features we will be using out of the collected data to build knn model:
    1. Nutrition
    2. Tags
    3. Ingredients
- In order to process the tags and ingredients we have to convert them into numerical representations
- For tags it is better to use embeddings as these words are more categorical which have repetitions throughout recipies.
- For ingredients we will be using the tf-idf vectorization 

#### Pre-requisites

In [240]:
import warnings
warnings.filterwarnings(action="ignore")

#### Import dependencies

In [241]:
import pandas as pd
import numpy as np
import re


#### Get the data

In [242]:
df = pd.read_csv("../Data Cleaning/all_combined_clean_data.csv")

In [243]:
df["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
3614    {'Calories ': '495 cal', 'Kilojoules ': '2070 ...
3615    {'Calories ': '546 cal', 'Kilojoules ': '2280 ...
3616    {'Calories ': '380 cal', 'Kilojoules ': '1590 ...
3617                                                   {}
3618    {'Calories ': '449 cal', 'Kilojoules ': '1880 ...
Name: nutrition, Length: 3619, dtype: object

In [244]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3619 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    3619 non-null   int64 
 1   name          3619 non-null   object
 2   ingredients   3619 non-null   object
 3   instructions  3619 non-null   object
 4   nutrition     3619 non-null   object
 5   time          3607 non-null   object
 6   serving_size  3594 non-null   object
 7   tags          3619 non-null   object
dtypes: int64(1), object(7)
memory usage: 226.3+ KB


### Extracting Features

#### I. Nutrition

- First we need the nutrition values in per recipe so let's explode them

In [245]:
eval(df["nutrition"][0])  #7 different values

{'Energy': '195 cal',
 'Protein': '10.3 g',
 'Carbohydrates': '30.5 g',
 'Fiber': '7.9 g',
 'Fat': '4.1 g',
 'Cholesterol': '0 mg',
 'Sodium': '8.8 mg'}

In [246]:
eval(df["nutrition"].iloc[-1]) #11 different values

{'Calories ': '449 cal',
 'Kilojoules ': '1880 kJ',
 'Protein ': '38 g',
 'Total fat ': '17 g',
 'Saturated fat ': '6 g',
 'Carbohydrates ': '35 g',
 'Sugar ': '8 g',
 'Dietary fibre ': '5 g',
 'Sodium ': '530 mg',
 'Calcium ': '130 mg',
 'Iron ': '3.5 mg'}

In [247]:
eval(df["nutrition"][1960]) #8 different values

{'kcal': '563 g',
 'fat': '26 g',
 'saturates': '3.8 g',
 'carbs': '65 g',
 'sugars': '9 g',
 'fibre': '8 g',
 'protein': '17 g',
 'salt': '0.6 g'}

- Since our data is collected from three different sources, there seems to be varying measures of nutrition, let us keep the most basic nutrition values and drop the rest for now. 
- Later, we could also impute the additional nutrients like calcium, iron using a food nutrition dataset

In [248]:
#feature set
features = df[["ingredients","nutrition","tags"]]

In [249]:
features.head(3)

Unnamed: 0,ingredients,nutrition,tags
0,['2 cups sprouted vaal (field beans/ butter be...,"{'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...","['Non-stick Pan', 'Boiled Indian recipes', 'Sa..."
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","{'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi..."
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","{'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun..."


In [250]:
features["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
3614    {'Calories ': '495 cal', 'Kilojoules ': '2070 ...
3615    {'Calories ': '546 cal', 'Kilojoules ': '2280 ...
3616    {'Calories ': '380 cal', 'Kilojoules ': '1590 ...
3617                                                   {}
3618    {'Calories ': '449 cal', 'Kilojoules ': '1880 ...
Name: nutrition, Length: 3619, dtype: object

In [251]:
# apply a function to this in order to convert values from string to numeric and 
# remove the units like 'cal' and 'g'
# nutrients we are gonna keep: 
# ['Calories', 'Protein', 'Carbohydrates', 'Fiber', 'Fat', 'Sodium']

import ast

def nutrition_preprocessing(nutrients_series):
    mapped = {"calories": None, "protein":None, "carbohydrates":None, "fiber":None, "fat":None, "sodium":None}
    to_dict = ast.literal_eval(nutrients_series)
    for key,value in to_dict.items():
        prep_key = key.strip().lower()
        cleaned = re.sub(r'[^\d\.]', '', value)
        if cleaned == "":  # Some values in carbohydrates are like "N/A"
            continue
        if prep_key in ["calories","energy","kcal"]:
            mapped["calories"]= float(cleaned)
        if prep_key == "protein":
            mapped["protein"]= float(cleaned)
        if prep_key in ["carbohydrates","carbs"]:
            mapped["carbohydrates"]= float(cleaned)
        if prep_key in ["fiber","dietary fibre","fibre"]:
            mapped["fiber"]= float(cleaned)    
        if prep_key in ["fat","total fat"]:
            mapped["fat"]= float(cleaned)
        if prep_key in ["sodium","salt"]:
            mapped["sodium"]= float(cleaned)
    return mapped

In [252]:
cleaned = features["nutrition"].apply(nutrition_preprocessing)

In [253]:
exploded_df = pd.json_normalize(cleaned)

In [254]:
exploded_df

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
0,195.0,10.3,30.5,7.9,4.1,8.8
1,74.0,2.6,9.5,3.2,2.9,21.8
2,374.0,13.3,16.4,4.2,28.4,98.6
3,92.0,0.1,0.4,0.0,10.0,0.0
4,68.0,2.3,3.9,3.4,5.3,114.5
...,...,...,...,...,...,...
3614,495.0,19.0,45.0,14.0,25.0,650.0
3615,546.0,23.0,50.0,12.0,28.0,770.0
3616,380.0,16.0,55.0,11.0,9.0,350.0
3617,,,,,,


In [255]:
# add the exploded column to features 
features_exploded = features.drop("nutrition", axis=1).join(exploded_df)

In [256]:
features_exploded.head(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
0,['2 cups sprouted vaal (field beans/ butter be...,"['Non-stick Pan', 'Boiled Indian recipes', 'Sa...",195.0,10.3,30.5,7.9,4.1,8.8
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi...",74.0,2.6,9.5,3.2,2.9,21.8
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun...",374.0,13.3,16.4,4.2,28.4,98.6


In [257]:
# droping the two na values because of missing carbohydrate found while extracting 
# nutrients
features_exploded.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3619 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3619 non-null   object 
 1   tags           3619 non-null   object 
 2   calories       3581 non-null   float64
 3   protein        3582 non-null   float64
 4   carbohydrates  3577 non-null   float64
 5   fiber          3581 non-null   float64
 6   fat            3582 non-null   float64
 7   sodium         3566 non-null   float64
dtypes: float64(6), object(2)
memory usage: 226.3+ KB


In [258]:
features_exploded[features_exploded.isnull().any(axis=1)].tail(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
3606,['1 large slice sourdough or wholegrain bread'...,"['Gluten-free option', 'Ready in 20 minutes', ...",,,,,,
3611,"['spray oil', '1 small potato, diced', '1 cup ...","['Gluten-free option', 'Nut free', 'Meals for ...",,,,,,
3617,"['1 cup mesclun or other salad leaves', '1 cup...","['Meals for one', 'No, or minimal, cooking', '...",,,,,,


In [259]:
# We have no option but to drop these rows as nutrient info for them are also missing

In [260]:
features_exploded.dropna(inplace=True)

In [261]:
features_exploded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3560 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3560 non-null   object 
 1   tags           3560 non-null   object 
 2   calories       3560 non-null   float64
 3   protein        3560 non-null   float64
 4   carbohydrates  3560 non-null   float64
 5   fiber          3560 non-null   float64
 6   fat            3560 non-null   float64
 7   sodium         3560 non-null   float64
dtypes: float64(6), object(2)
memory usage: 250.3+ KB


In [262]:
nutrients = features_exploded.drop(["ingredients","tags"],axis=1)

In [263]:
nutrients.describe()

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
count,3560.0,3560.0,3560.0,3560.0,3560.0,3560.0
mean,243.524438,12.660281,26.046096,5.684955,1091.591,210.312846
std,179.831314,13.025845,18.678152,15.025915,64586.52,316.882171
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,94.0,2.6,10.8,1.8,2.4,10.575
50%,194.0,6.7,22.15,4.0,6.8,54.95
75%,390.0,20.825,38.8,8.0,13.8,353.0
max,2167.0,79.0,102.8,680.0,3853612.0,5829.6


In [264]:
nutrients.keys()

Index(['calories', 'protein', 'carbohydrates', 'fiber', 'fat', 'sodium'], dtype='object')

- Scaling 

In [265]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_nutrients = scaler.fit_transform(nutrients)

In [266]:
scaled_nutrients

array([[-0.26987099, -0.18122528,  0.23848881,  0.14743569, -0.0168401 ,
        -0.63601292],
       [-0.9428183 , -0.7724408 , -0.88597738, -0.16540119, -0.01685868,
        -0.59498244],
       [ 0.7256461 ,  0.04911842, -0.51650991, -0.09884015, -0.01646381,
        -0.35258702],
       ...,
       [ 1.68223235,  0.79389641,  1.28263598,  0.42033594, -0.01647   ,
         1.76647925],
       [ 0.75901538,  0.25642776,  1.55036602,  0.3537749 , -0.01676422,
         0.44087926],
       [ 1.14276219,  1.94561494,  0.47944585, -0.04559132, -0.01664034,
         1.00899354]], shape=(3560, 6))

In [267]:
scaled_nutrients_df = pd.DataFrame(scaled_nutrients,columns=nutrients.keys())

In [268]:
scaled_nutrients_df.describe()

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
count,3560.0,3560.0,3560.0,3560.0,3560.0,3560.0
mean,0.0,-1.27738e-16,1.91607e-16,-6.386901000000001e-17,-2.99386e-18,-3.1934500000000005e-17
std,1.00014,1.00014,1.00014,1.00014,1.00014,1.00014
min,-1.354373,-0.972072,-1.394664,-0.3783965,-0.01690359,-0.6637874
25%,-0.831587,-0.7724408,-0.8163676,-0.2585866,-0.01686643,-0.6304107
50%,-0.275433,-0.4576377,-0.2086204,-0.1121524,-0.01679829,-0.4903547
75%,0.814631,0.6268972,0.6829207,0.1540918,-0.0166899,0.4503478
max,10.697502,5.093646,4.109865,44.88311,59.65735,17.73554


#### 2. Ingredients

In [269]:
sample= features_exploded["ingredients"].sample(10)
# let's start with this sample, then we can apply a cleaning function to the whole column

In [270]:
for s in sample:
    print(s)

['2 cups bajra (black millet) flour', 'salt to taste', 'bajra (black millet) flour for rolling', 'melted ghee for brushing', '', 'garlic chutney', 'jaggery (gur)', 'ghee']
['1 tablespoon olive oil', '1 large onion, finely chopped', '2 cloves garlic, crushed', '2 teaspoon finely grated fresh ginger', '1 small red chilli, deseeded and finely chopped', '2 teaspoon brown mustard seeds', '½ teaspoon ground turmeric', '2 cups vine-ripened tomatoes, diced', '1 1 /2 cups pack diced butternut squash and', '1 cup dried red lentils, rinsed and drained', '2 cups very low salt vegetable stock', '10 cups baby spinach', '4 small reduced-fat naan breads', '4 tablespoons low-fat natural yogurt, to serve']
['2 1/2 cups grated bottle gourd (doodhi / lauki)', '1 tsp ghee', '1 1/2 cups low fat milk , 99.7% fat-free', '1/2 tsp cardamom (elaichi) powder', '2 tsp sugar substitute']
['225g Hokkien noodles', '1 head buttercrunch lettuce leaves, rinsed and dried', '1 medium avocado, thinly sliced', '8 baby cucum

In [271]:
sample.iloc[0]

"['2 cups bajra (black millet) flour', 'salt to taste', 'bajra (black millet) flour for rolling', 'melted ghee for brushing', '', 'garlic chutney', 'jaggery (gur)', 'ghee']"

- Right now are ingredients column contains a list of splitted strings but what we need is a list of tokens
- Problems :
    * removing measures like cups and tbsp
    * removing quantities
    * map english and hindi ingredients (paneer --> cottage cheese)
    * lemon juice, flax seeds should be one token 
    * lemmatization 

1. Tokenization

In [272]:
def custom_tokenizer(text):
    # Remove quantities and units
    text = re.sub(r'\b\d+/?\d*\b', '', text.lower())  # remove quantities
    text = re.sub(r'\b(?:tsp|tbsp|cup|cups|g|gram|grams|kg|ml|l|oz|teaspoon|tablespoon|pinch)\b', '', text, flags=re.IGNORECASE)

    # Tokenize 
    tokens = re.findall(r'\b[a-zA-Z]+\b', text)
    return tokens 

In [273]:
custom_tokenizer(sample.iloc[0])

['bajra',
 'black',
 'millet',
 'flour',
 'salt',
 'to',
 'taste',
 'bajra',
 'black',
 'millet',
 'flour',
 'for',
 'rolling',
 'melted',
 'ghee',
 'for',
 'brushing',
 'garlic',
 'chutney',
 'jaggery',
 'gur',
 'ghee']

In [274]:
for s in sample:
    print(custom_tokenizer(s))

['bajra', 'black', 'millet', 'flour', 'salt', 'to', 'taste', 'bajra', 'black', 'millet', 'flour', 'for', 'rolling', 'melted', 'ghee', 'for', 'brushing', 'garlic', 'chutney', 'jaggery', 'gur', 'ghee']
['olive', 'oil', 'large', 'onion', 'finely', 'chopped', 'cloves', 'garlic', 'crushed', 'finely', 'grated', 'fresh', 'ginger', 'small', 'red', 'chilli', 'deseeded', 'and', 'finely', 'chopped', 'brown', 'mustard', 'seeds', 'ground', 'turmeric', 'vine', 'ripened', 'tomatoes', 'diced', 'pack', 'diced', 'butternut', 'squash', 'and', 'dried', 'red', 'lentils', 'rinsed', 'and', 'drained', 'very', 'low', 'salt', 'vegetable', 'stock', 'baby', 'spinach', 'small', 'reduced', 'fat', 'naan', 'breads', 'tablespoons', 'low', 'fat', 'natural', 'yogurt', 'to', 'serve']
['grated', 'bottle', 'gourd', 'doodhi', 'lauki', 'ghee', 'low', 'fat', 'milk', 'fat', 'free', 'cardamom', 'elaichi', 'powder', 'sugar', 'substitute']
['hokkien', 'noodles', 'head', 'buttercrunch', 'lettuce', 'leaves', 'rinsed', 'and', 'dried

* Observations:
    - Stop words still need to be removed 
    - Have to remove verbs like added, chopped, splash, peeled (nltk pos tag)
    - Adverbs like roughly 
    - there are some ingredients that are mentioned both in hindi and english like: 'flour' and 'gehun ka atta', "panner" and "cottage cheeze" (ignore hindi words by filtering from english dictionary)
    - might need to specify n grams as there are some tokens with more than one word so order also matters like "gehun ka atta", "baby spinach" "cottage cheeze", "chocolate chips" (however this will increase vector size)


In [275]:
# pip install spacy
# python -m spacy download en_core_web_sm


In [276]:
import spacy
# load english model 
nlp = spacy.load('en_core_web_sm')

In [277]:
doc = nlp(sample.iloc[0])

In [278]:
type(doc) #this is a spacy object containing tokens

spacy.tokens.doc.Doc

In [279]:
for token in doc:
    print(f"{token.text:<15} | POS: {token.pos_:<10} | Lemma: {token.lemma_}")

[               | POS: X          | Lemma: [
'               | POS: NUM        | Lemma: '
2               | POS: NUM        | Lemma: 2
cups            | POS: NOUN       | Lemma: cup
bajra           | POS: NOUN       | Lemma: bajra
(               | POS: PUNCT      | Lemma: (
black           | POS: ADJ        | Lemma: black
millet          | POS: NOUN       | Lemma: millet
)               | POS: PUNCT      | Lemma: )
flour           | POS: NOUN       | Lemma: flour
'               | POS: PUNCT      | Lemma: '
,               | POS: PUNCT      | Lemma: ,
'               | POS: PUNCT      | Lemma: '
salt            | POS: NOUN       | Lemma: salt
to              | POS: PART       | Lemma: to
taste           | POS: NOUN       | Lemma: taste
'               | POS: PUNCT      | Lemma: '
,               | POS: PUNCT      | Lemma: ,
'               | POS: PUNCT      | Lemma: '
bajra           | POS: NOUN       | Lemma: bajra
(               | POS: PUNCT      | Lemma: (
black           | POS: A

In [280]:
def custom_tokenizer(text):
    #text cleaning
    text = text.lower()
     # Replace unicode superscript fractions and other weird chars
    text = re.sub(r'[¼½¾⅓⅔⁄¹²³⁴⁵⁶⁷⁸⁹⁰]', ' ', text)

    # Separate numbers stuck to units or words (e.g., "100ml" → "100 ml")
    text = re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2', text)

    # Remove standalone numbers and units
    text = re.sub(r'\b\d+/?\d*\b', ' ', text)
    text = re.sub(r'\b(?:ml|l|tsp|tbsp|cup|cups|g|kg|oz|gram|grams|pinch|cm|inch|store|style)\b', '', text)

    # Remove hyphens or leading punctuation leftovers
    text = re.sub(r'^[^\w\s]+|[^\w\s]+$', '', text)

    # Remove extra symbols, keep only words
    text = re.sub(r'[^\w\s]', ' ', text)

    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # spacy to remove stop words and anythings except NOUN and PROPN
    # return lemmatized tokens
    text = nlp(text)
    tokens = [token.lemma_ for token in text 
              if not token.is_stop 
              and not token.is_punct 
              and token.pos_ in {'NOUN','PROPN'}]
    
    return tokens 

In [281]:
for s in sample:
    print(custom_tokenizer(s))

['bajra', 'millet', 'flour', 'salt', 'millet', 'flour', 'ghee', 'garlic', 'chutney', 'jaggery', 'gur', 'ghee']
['tablespoon', 'olive', 'oil', 'onion', 'clove', 'garlic', 'teaspoon', 'ginger', 'chilli', 'teaspoon', 'mustard', 'teaspoon', 'ground', 'turmeric', 'vine', 'tomato', 'pack', 'butternut', 'squash', 'lentil', 'salt', 'vegetable', 'stock', 'baby', 'spinach', 'naan', 'bread', 'tablespoon', 'yogurt']
['bottle', 'gourd', 'doodhi', 'lauki', 'ghee', 'milk', 'cardamom', 'elaichi', 'powder', 'sugar', 'substitute']
['hokkien', 'head', 'buttercrunch', 'lettuce', 'leave', 'medium', 'avocado', 'baby', 'cucumber', 'lengthways', 'cherry', 'tomato', 'radish', 'edamame', 'bean', 'tablespoon', 'sesame', 'seeds', 'honey', 'soy', 'small', 'carrot', 'tablespoon', 'miso', 'tablespoon', 'rice', 'wine', 'vinegar', 'tablespoon', 'extra', 'virgin', 'oil', 'teaspoon', 'honey', 'maple', 'syrup']
['potato', 'round', 'tablespoon', 'lime', 'juice', 'teaspoon', 'salt', 'soy', 'sauce', 'teaspoon', 'honey', 'ch

- apply it to our dataset now

In [282]:
features_exploded["ingredients"].apply(custom_tokenizer)

0       [field, bean, butter, bean, oil, cumin, jeera,...
1       [capsicum, cube, paneer, cottage, cheese, coco...
2       [onion, cashew, nut, kaju, chilli, cottage, ch...
3       [oil, vinegar, mustard, sarson, powder, salt, ...
4                        [flax, seed, lemon, juice, salt]
                              ...                        
3613    [spray, oil, red, cabbage, cut, wedge, bulb, f...
3614    [onion, apple, vinegar, teaspoon, sugar, teasp...
3615    [yoghurt, tablespoon, paste, tablespoon, lime,...
3616    [carrot, desiree, potato, pea, capsicum, sprin...
3618    [spray, oil, tablespoon, olive, oil, boneless,...
Name: ingredients, Length: 3560, dtype: object

VECTORIZING USING tfidfVectorizer
- use min_df for filtering out rare words that don't generalize or add value (e.g., spelling errors, unique names, uncommon variants).
- use max_features to limit feature space to prevent sparsity, overfitting

Edit -- Without these hyperparameters the feature set or vocabulary was coming out to be 1576, it was also considering features with spelling errors, possibly missed during the cleaning

In [283]:
# Convert this into a vector
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, min_df= 5, max_features= 800)
#Edit: ngrams is causing too much of array size (18652 features), 
#      this is more than the number of rows in our set, for now, lets take it without ngrams
tfidf_matrix = vectorizer.fit_transform(features_exploded["ingredients"])

In [284]:
tfidf_matrix.toarray()[0]

array([0.17614604, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.18363797, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.3100418 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [285]:
# Get feature (token) names
feature_names = vectorizer.get_feature_names_out()

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

df_tfidf.head()  



Unnamed: 0,adrak,agria,ajmoda,ajwain,akhrot,alfalfa,allspice,almond,alsi,amaranth,...,x,xa0,yam,yeast,yellow,yoghurt,yogurt,yolk,zest,zucchini
0,0.176146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.173541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.182461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [286]:
feature_names

array(['adrak', 'agria', 'ajmoda', 'ajwain', 'akhrot', 'alfalfa',
       'allspice', 'almond', 'alsi', 'amaranth', 'amchur', 'amla', 'anar',
       'anardana', 'anchovy', 'anise', 'anjeer', 'apple', 'approx',
       'apricot', 'arborio', 'arhar', 'aril', 'asafoetida', 'asparagus',
       'atta', 'avocado', 'baby', 'bacon', 'badam', 'badi', 'bag',
       'baguette', 'baingan', 'bajra', 'baking', 'ball', 'banana',
       'barbecue', 'barley', 'basil', 'basis', 'basmati', 'baton',
       'batter', 'bay', 'bean', 'beans', 'beansprout', 'beef', 'beej',
       'beet', 'beetroot', 'bengal', 'berry', 'besan', 'bhaji', 'bhindi',
       'bhopla', 'bird', 'biscuit', 'bite', 'black', 'blackberry',
       'blanched', 'blend', 'blueberry', 'boiling', 'bok', 'bone',
       'boneless', 'bottle', 'bran', 'bread', 'breadcrumb', 'breast',
       'brine', 'brinjal', 'broccoli', 'broccolini', 'brown', 'brushing',
       'brussels', 'buckwheat', 'bulb', 'bun', 'bunch', 'bunche',
       'bunches', 'burger', 

In [287]:
#final features 
# df_tfidf = df_tfidf.reset_index(drop=True)
# scaled_nutrients_df = scaled_nutrients_df.reset_index(drop=True)
X = scaled_nutrients_df.join(df_tfidf, rsuffix="_tfidf")
X.head()

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium,adrak,agria,ajmoda,ajwain,...,x,xa0,yam,yeast,yellow,yoghurt,yogurt,yolk,zest,zucchini
0,-0.269871,-0.181225,0.238489,0.147436,-0.01684,-0.636013,0.176146,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.942818,-0.772441,-0.885977,-0.165401,-0.016859,-0.594982,0.173541,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.725646,0.049118,-0.51651,-0.09884,-0.016464,-0.352587,0.182461,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.84271,-0.964394,-1.373246,-0.378397,-0.016749,-0.663787,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.976188,-0.795475,-1.185835,-0.152089,-0.016822,-0.302404,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we have all the features names ready, let's train the model in knn

### Model building: Finding nearest neighbours through KNN

In [288]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=5, metric='euclidean')
knn.fit(X)

In [289]:
# Let's say we have the input features like :
input_nutrients ={
    'calories': 345.8,
    'protein': 18.2,
    'carbohydrates': 63.7,
    'fiber': 5.9,
    'fat': 14.4,
    'sodium': 738.5
}

# available pantry items: 
input_ingredients = "flax seeds olive oil chopped onions garlic paste adrak boiled potatoes green chili cumin powder turmeric fresh coriander lemon juice"



In [290]:
# Now using this, let's find out the most similar recipies from our data, 
# for that we have to convert the ingredients the same way using the vectorizer
input_vectorized_ingredients = vectorizer.transform([input_ingredients])
input_vectorized_ingredients

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14 stored elements and shape (1, 800)>

In [291]:
input_ingredients_df = pd.DataFrame(input_vectorized_ingredients.toarray(), columns=feature_names)
input_ingredients_df


Unnamed: 0,adrak,agria,ajmoda,ajwain,akhrot,alfalfa,allspice,almond,alsi,amaranth,...,x,xa0,yam,yeast,yellow,yoghurt,yogurt,yolk,zest,zucchini
0,0.291691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [292]:
list(input_nutrients.values())

[345.8, 18.2, 63.7, 5.9, 14.4, 738.5]

In [293]:
input_nutrients = np.array(list(input_nutrients.values()),ndmin=2)

In [294]:
input_scaled_nutrients = scaler.transform(input_nutrients)

In [295]:
input_features = pd.DataFrame(input_scaled_nutrients,columns=scaled_nutrients_df.keys())
input_features

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
0,0.56881,0.425346,2.016216,0.014314,-0.016681,1.667059


In [296]:
input_features = input_features.join(input_ingredients_df, rsuffix="_tfidf")
input_features

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium,adrak,agria,ajmoda,ajwain,...,x,xa0,yam,yeast,yellow,yoghurt,yogurt,yolk,zest,zucchini
0,0.56881,0.425346,2.016216,0.014314,-0.016681,1.667059,0.291691,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [297]:
distances, indices = knn.kneighbors(input_features)

In [298]:
distances #ordered

array([[1.40583734, 1.42650061, 1.47721682, 1.48233354, 1.49234101]])

In [299]:
indices  #indices from our df

array([[2808, 3238, 3002, 2947, 2232]])

In [300]:
# Top 5 similar recipies
df.iloc[indices.flatten()]

Unnamed: 0.1,Unnamed: 0,name,ingredients,instructions,nutrition,time,serving_size,tags
2808,2808,Oregano chicken with warm tomato and olive salsa,"['2 x 250g chicken breast fillets, each cut ho...",['1 Place the chicken on a plate. Sprinkle bot...,"{'Calories ': '503 cal', 'Kilojoules ': '2101 ...",30 mins,4,"['High fibre', 'High protein', '3 vege serves'..."
3238,3238,Asian salmon and super greens oven-steamed bake,"['2 cups small Brussels sprouts, trimmed, thin...","['1 Preheat oven to 180°C. Into a large, deep ...","{'Calories ': '584 cal', 'Kilojoules ': '2440 ...",40 mins,4,"['Dairy free', 'Diabetes-friendly', 'Gluten-fr..."
3002,3002,Miso and corn soup with ramen noodles,"['4 cups reduced-salt vegetable stock', '1 tab...","['1 Combine stock, 2 cups of water and miso pa...","{'Calories ': '266 cal', 'Kilojoules ': '1112 ...",30 mins,4,"['Dairy free', 'Vegetarian', '3 vege serves', ..."
2947,2947,Blueberry muffins made healthier,"['2 medium eggs', '250ml jar apple sauce', '5 ...",['1 Preheat the oven to 190°C/fan 170°C/gas 5 ...,"{'Calories ': '198 cal', 'Kilojoules ': '829 k...",35 mins,12,"['High fibre', 'Family favourites', 'Makeovers']"
2232,2232,Roasted garlic and pumpkin soup,"['¼ cup rolled', '¼ cup cubed grainy bread (st...","['1 To make the savoury granola, preheat the o...","{'Calories ': '417 cal', 'Kilojoules ': '1742 ...",45 mins,4,"['Diabetes-friendly', 'High fibre', 'Low sodiu..."


In [301]:
##### Create a pipeline
def top_5_recipies(cal,protein,carbs,fiber,fat,sodium, ingredients):
    input_nutrients ={
    'calories': cal,
    'protein': protein,
    'carbohydrates': carbs,
    'fiber': fiber,
    'fat': fat,
    'sodium': sodium
}
    input_vectorized_ingredients = vectorizer.transform([ingredients])
    input_ingredients_df = pd.DataFrame(input_vectorized_ingredients.toarray(), columns=feature_names)
    input_nutrients = np.array(list(input_nutrients.values()),ndmin=2)
    input_scaled_nutrients = scaler.transform(input_nutrients)
    input_nutrients_df = pd.DataFrame(input_scaled_nutrients,columns=scaled_nutrients_df.keys())
    input_features = input_nutrients_df.join(input_ingredients_df, rsuffix="_tfidf")
    distances, indices = knn.kneighbors(input_features)
    return df.iloc[indices.flatten()]


In [346]:
result = top_5_recipies(290,10,38,4.5,8,120,"rice coconut dal vegetable")

In [347]:
result['name'].apply(print)

green moong dal khichdi recipe
Slow-cooked pulled beef
cabbage pulao recipe
Mushrooms stuffed with kale, ricotta and seeds
bajra khichdi for acidity recipe


1687    None
3014    None
1513    None
2152    None
1468    None
Name: name, dtype: object

In [348]:
result['ingredients'].apply(print)


['1/2 cup rice (chawal)', '1/2 cup green moong dal (split green gram)', '1 tbsp oil', '1 tsp mustard seeds ( rai / sarson)', '1 tsp urad dal (split black lentils)', '1/4 tsp asafoetida (hing)', '2 tsp finely chopped garlic (lehsun)', '1/2 cup finely chopped onion', '1 tsp chilli powder', '1/4 tsp turmeric powder (haldi)', '1 tsp coriander-cumin seeds (dhania-jeera) powder', 'salt to taste', '1 tbsp finely chopped coriander (dhania)', '3 cups cooked long grained rice']
['1kg beef brisket, fat trimmed', '1 cup passata', '1 tablespoon', '2 tablespoon Worcestershire sauce', '2 teaspoon paprika', '2 cloves garlic, crushed', '6 wholemeal tortillas, warmed', '200g salad leaves', '4 tomatoes, diced', '170g reduced-fat Greek yogurt']
['1 1/2 cups shredded cabbage', '1 1/2 cups soaked and cooked brown rice', '2 tsp coconut oil or', '1/2 tsp mustard seeds ( rai / sarson)', '1/2 tsp urad dal (split black lentils)', '4 to 5 curry leaves (kadi patta)', '2 tbsp raw peanuts', 'a pinch of asafoetida (h

1687    None
3014    None
1513    None
2152    None
1468    None
Name: ingredients, dtype: object

In [344]:
result['nutrition'].apply(print)

{'Energy': '261 cal', 'Protein': '9.7 g', 'Carbohydrates': '32.5 g', 'Fiber': '2.2 g', 'Fat': '10.3 g', 'Cholesterol': '0 mg', 'Sodium': '7.2 mg'}
{'Energy': '266 cal', 'Protein': '9.8 g', 'Carbohydrates': '41 g', 'Fiber': '4.2 g', 'Fat': '6.9 g', 'Cholesterol': '1.7 mg', 'Sodium': '72 mg'}
{'Energy': '356 cal', 'Protein': '11.6 g', 'Carbohydrates': '41.1 g', 'Fiber': '10.7 g', 'Fat': '17.7 g', 'Cholesterol': '0 mg', 'Sodium': '44.4 mg'}
{'Calories ': '151 cal', 'Kilojoules ': '632 kJ', 'Protein ': '0.7 g', 'Total fat ': '16.2 g', 'Saturated fat ': '2.5 g', 'Carbohydrates ': '0.7 g', 'Sugar ': '0.3 g', 'Dietary fibre ': '0.9 g', 'Sodium ': '9 mg', 'Calcium ': '30 mg', 'Iron ': '1.1 mg'}
{'Energy': '249 cal', 'Protein': '11.8 g', 'Carbohydrates': '38.8 g', 'Fiber': '6.4 g', 'Fat': '5.2 g', 'Cholesterol': '0 mg', 'Sodium': '25.4 mg'}


359     None
420     None
1630    None
2699    None
1847    None
Name: nutrition, dtype: object

- the model seems to work decently, however if we have more recipies, it will give better result

In [350]:
# another try 
result = top_5_recipies(290,10,38,4.5,8,120,"yogurt lettuce cheeze spinach")

In [351]:
result['name'].apply(print)

Paneer and Suva Sandwich
Nutritious Burger
Quinoa Paneer Carrot Peppers Salad, for Lunch Or Dinner
Chimichurri
palak bajra khichdi recipe


359     None
420     None
1630    None
2699    None
1847    None
Name: name, dtype: object

In [352]:
result['ingredients'].apply(print)

['8 multigrain bread', '', '4 curly red lettuce or cos lettuce , torn into pieces', 'melted butter for brushing', '1 cup crumbled paneer (cottage cheese)', '1/4 cup finely chopped dill leaves', '2 tsp finely chopped green chillies', 'salt to taste']
['1/2 cup soy granules', '3/4 cup grated carrot', '1/2 cup crumbled low fat paneer (cottage cheese)', '1/2 cup finely chopped onion', '2 tbsp whole wheat flour (gehun ka atta)', 'hummus', '1/2 tsp garlic (lehsun) paste', '1 tsp green chilli paste', '1/4 cup finely chopped mint leaves (phudina)', 'salt and', '1 tsp oil for greasing and cooking', '3/4 cup low fat curds (dahi)', '1/4 cup chopped spring onion greens', '1/2 cup finely chopped capsicum ( chopped red capsicum ,', '1/2 tsp garlic (lehsun) paste', '1 tsp dry red chilli flakes (paprika)', 'salt to taste', '6 burger buns', '1 1/2 tsp butter for cooking', '6 lettuce', '18 sliced cucumber', '18 tomato slices', '6 onion slice']
['1/2 cup coloured capsicum cubes', '1/4 cup paneer (cottage

359     None
420     None
1630    None
2699    None
1847    None
Name: ingredients, dtype: object

In [355]:
result['nutrition'].apply(print)

{'Energy': '261 cal', 'Protein': '9.7 g', 'Carbohydrates': '32.5 g', 'Fiber': '2.2 g', 'Fat': '10.3 g', 'Cholesterol': '0 mg', 'Sodium': '7.2 mg'}
{'Energy': '266 cal', 'Protein': '9.8 g', 'Carbohydrates': '41 g', 'Fiber': '4.2 g', 'Fat': '6.9 g', 'Cholesterol': '1.7 mg', 'Sodium': '72 mg'}
{'Energy': '356 cal', 'Protein': '11.6 g', 'Carbohydrates': '41.1 g', 'Fiber': '10.7 g', 'Fat': '17.7 g', 'Cholesterol': '0 mg', 'Sodium': '44.4 mg'}
{'Calories ': '151 cal', 'Kilojoules ': '632 kJ', 'Protein ': '0.7 g', 'Total fat ': '16.2 g', 'Saturated fat ': '2.5 g', 'Carbohydrates ': '0.7 g', 'Sugar ': '0.3 g', 'Dietary fibre ': '0.9 g', 'Sodium ': '9 mg', 'Calcium ': '30 mg', 'Iron ': '1.1 mg'}
{'Energy': '249 cal', 'Protein': '11.8 g', 'Carbohydrates': '38.8 g', 'Fiber': '6.4 g', 'Fat': '5.2 g', 'Cholesterol': '0 mg', 'Sodium': '25.4 mg'}


359     None
420     None
1630    None
2699    None
1847    None
Name: nutrition, dtype: object