### NLP

- Features we will be using out of the collected data to build knn model:
    1. Nutrition
    2. Tags
    3. Ingredients
- In order to process the tags and ingredients we have to convert them into numerical representations
- For tags it is better to use embeddings as these words are more categorical which have repetitions throughout recipies.
- For ingredients we will be using the tf-idf vectorization 

#### Pre-requisites

In [1]:
import warnings
warnings.filterwarnings(action="ignore")

#### Import dependencies

In [2]:
import pandas as pd
import numpy as np
import re
# import nltk
# nltk.download('stopwords')

#### Get the data

In [3]:
df = pd.read_csv("../Data Cleaning/all_combined_clean_data.csv")

In [4]:
df["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
3614    {'Calories ': '495 cal', 'Kilojoules ': '2070 ...
3615    {'Calories ': '546 cal', 'Kilojoules ': '2280 ...
3616    {'Calories ': '380 cal', 'Kilojoules ': '1590 ...
3617                                                   {}
3618    {'Calories ': '449 cal', 'Kilojoules ': '1880 ...
Name: nutrition, Length: 3619, dtype: object

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3619 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    3619 non-null   int64 
 1   name          3619 non-null   object
 2   ingredients   3619 non-null   object
 3   instructions  3619 non-null   object
 4   nutrition     3619 non-null   object
 5   time          3607 non-null   object
 6   serving_size  3594 non-null   object
 7   tags          3619 non-null   object
dtypes: int64(1), object(7)
memory usage: 226.3+ KB


### Extracting Features

#### I. Nutrition

- First we need the nutrition values in per recipe so let's explode them

In [None]:
eval(df["nutrition"][0])  #7 different values

{'Energy': '195 cal',
 'Protein': '10.3 g',
 'Carbohydrates': '30.5 g',
 'Fiber': '7.9 g',
 'Fat': '4.1 g',
 'Cholesterol': '0 mg',
 'Sodium': '8.8 mg'}

In [None]:
eval(df["nutrition"].iloc[-1]) #11 different values

{'Calories ': '449 cal',
 'Kilojoules ': '1880 kJ',
 'Protein ': '38 g',
 'Total fat ': '17 g',
 'Saturated fat ': '6 g',
 'Carbohydrates ': '35 g',
 'Sugar ': '8 g',
 'Dietary fibre ': '5 g',
 'Sodium ': '530 mg',
 'Calcium ': '130 mg',
 'Iron ': '3.5 mg'}

In [11]:
eval(df["nutrition"][1960]) #8 different values

{'kcal': '563 g',
 'fat': '26 g',
 'saturates': '3.8 g',
 'carbs': '65 g',
 'sugars': '9 g',
 'fibre': '8 g',
 'protein': '17 g',
 'salt': '0.6 g'}

- Since our data is collected from three different sources, there seems to be varying measures of nutrition, let us keep the most basic nutrition values and drop the rest for now. 
- Later, we could also impute the additional nutrients like calcium, iron using a food nutrition dataset

In [12]:
#feature set
features = df[["ingredients","nutrition","tags"]]

In [13]:
features.head(3)

Unnamed: 0,ingredients,nutrition,tags
0,['2 cups sprouted vaal (field beans/ butter be...,"{'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...","['Non-stick Pan', 'Boiled Indian recipes', 'Sa..."
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","{'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi..."
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","{'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun..."


In [27]:
features["nutrition"]

0       {'Energy': '195 cal', 'Protein': '10.3 g', 'Ca...
1       {'Energy': '74 cal', 'Protein': '2.6 g', 'Carb...
2       {'Energy': '374 cal', 'Protein': '13.3 g', 'Ca...
3       {'Energy': '92 cal', 'Protein': '0.1 g', 'Carb...
4       {'Energy': '68 cal', 'Protein': '2.3 g', 'Carb...
                              ...                        
3614    {'Calories ': '495 cal', 'Kilojoules ': '2070 ...
3615    {'Calories ': '546 cal', 'Kilojoules ': '2280 ...
3616    {'Calories ': '380 cal', 'Kilojoules ': '1590 ...
3617                                                   {}
3618    {'Calories ': '449 cal', 'Kilojoules ': '1880 ...
Name: nutrition, Length: 3619, dtype: object

In [28]:
# apply a function to this in order to convert values from string to numeric and 
# remove the units like 'cal' and 'g'
# nutrients we are gonna keep: 
# ['Calories', 'Protein', 'Carbohydrates', 'Fiber', 'Fat', 'Sodium']

import ast

def nutrition_preprocessing(nutrients_series):
    mapped = {"calories": None, "protein":None, "carbohydrates":None, "fiber":None, "fat":None, "sodium":None}
    to_dict = ast.literal_eval(nutrients_series)
    for key,value in to_dict.items():
        prep_key = key.strip().lower()
        cleaned = re.sub(r'[^\d\.]', '', value)
        if cleaned == "":  # Some values in carbohydrates are like "N/A"
            continue
        if prep_key in ["calories","energy","kcal"]:
            mapped["calories"]= float(cleaned)
        if prep_key == "protein":
            mapped["protein"]= float(cleaned)
        if prep_key in ["carbohydrates","carbs"]:
            mapped["carbohydrates"]= float(cleaned)
        if prep_key in ["fiber","dietary fibre","fibre"]:
            mapped["fiber"]= float(cleaned)    
        if prep_key in ["fat","total fat"]:
            mapped["fat"]= float(cleaned)
        if prep_key in ["sodium","salt"]:
            mapped["sodium"]= float(cleaned)
    return mapped

In [29]:
cleaned = features["nutrition"].apply(nutrition_preprocessing)

In [30]:
exploded_df = pd.json_normalize(cleaned)

In [31]:
exploded_df

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
0,195.0,10.3,30.5,7.9,4.1,8.8
1,74.0,2.6,9.5,3.2,2.9,21.8
2,374.0,13.3,16.4,4.2,28.4,98.6
3,92.0,0.1,0.4,0.0,10.0,0.0
4,68.0,2.3,3.9,3.4,5.3,114.5
...,...,...,...,...,...,...
3614,495.0,19.0,45.0,14.0,25.0,650.0
3615,546.0,23.0,50.0,12.0,28.0,770.0
3616,380.0,16.0,55.0,11.0,9.0,350.0
3617,,,,,,


In [32]:
# add the exploded column to features 
features_exploded = features.drop("nutrition", axis=1).join(exploded_df)

In [33]:
features_exploded.head(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
0,['2 cups sprouted vaal (field beans/ butter be...,"['Non-stick Pan', 'Boiled Indian recipes', 'Sa...",195.0,10.3,30.5,7.9,4.1,8.8
1,"['2 1/2 cups capsicum cubes', '1/2 cup low-fat...","['Non Stick Kadai Veg', 'Antioxidant Rich Indi...",74.0,2.6,9.5,3.2,2.9,21.8
2,"['1 cup sliced onions', '3 tbsp roughly choppe...","['Non-stick Pan', 'Indian Dinner', 'Indian Lun...",374.0,13.3,16.4,4.2,28.4,98.6


In [34]:
# droping the two na values because of missing carbohydrate found while extracting 
# nutrients
features_exploded.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3619 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3619 non-null   object 
 1   tags           3619 non-null   object 
 2   calories       3581 non-null   float64
 3   protein        3582 non-null   float64
 4   carbohydrates  3577 non-null   float64
 5   fiber          3581 non-null   float64
 6   fat            3582 non-null   float64
 7   sodium         3566 non-null   float64
dtypes: float64(6), object(2)
memory usage: 226.3+ KB


In [35]:
features_exploded[features_exploded.isnull().any(axis=1)].tail(3)

Unnamed: 0,ingredients,tags,calories,protein,carbohydrates,fiber,fat,sodium
3606,['1 large slice sourdough or wholegrain bread'...,"['Gluten-free option', 'Ready in 20 minutes', ...",,,,,,
3611,"['spray oil', '1 small potato, diced', '1 cup ...","['Gluten-free option', 'Nut free', 'Meals for ...",,,,,,
3617,"['1 cup mesclun or other salad leaves', '1 cup...","['Meals for one', 'No, or minimal, cooking', '...",,,,,,


In [36]:
# We have no option but to drop these rows as nutrient info for them are also missing

In [37]:
features_exploded.dropna(inplace=True)

In [38]:
features_exploded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3560 entries, 0 to 3618
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ingredients    3560 non-null   object 
 1   tags           3560 non-null   object 
 2   calories       3560 non-null   float64
 3   protein        3560 non-null   float64
 4   carbohydrates  3560 non-null   float64
 5   fiber          3560 non-null   float64
 6   fat            3560 non-null   float64
 7   sodium         3560 non-null   float64
dtypes: float64(6), object(2)
memory usage: 250.3+ KB


#### 2. Ingredients

In [47]:
sample= features_exploded["ingredients"].sample(10)
# let's start with this sample, then we can apply a cleaning function to the whole column

In [48]:
for s in sample:
    print(s)

['1 kg thick curds (dahi)', '1/2 cup powdered sugar', 'a few saffron (kesar) strands', '1 tbsp warm milk', '1/2 tsp cardamom (elaichi) powder', '1 tbsp pistachio slivers', '1 tbsp almond (badam) slivers']
['1/2 ltr low fat milk', '2 tsp cornflour dissolved in 2 tablespoons', '3 tbsp sugar', '3 tbsp cocoa powder', 'freshly made low fat paneer (cottage cheese)', '1/2 ltr low fat milk', '2 tbsp low fat milk', '1/2 tbsp lemon juice']
['½ cup sliced button mushrooms', '1 sprig fresh thyme,', 'leaves picked', '100g roasted pumpkin (see HFG tip)', '25g reduced-fat feta, crumbled', '2 thick slices rye bread', '1½ tablespoons store-bought chunky pesto dip', '20g baby spinach leaves']
['2 red onions, sliced into rings', '2 medium fennel bulbs, prepared as above', '2 tablespoon olive oil', '2 cups bulgar wheat', 'zest and juice of 2 oranges, juice made up to 2½ cups with water', 'handful (about 1/4 cup) each of green and black olives', '½ cup toasted almonds, chopped', 'flatleaf parsley, roughly 

- Right now are ingredients column contains a list of splitted strings but what we need is a list of tokens
- Problems :
    * removing measures like cups and tbsp
    * removing quantities
    * map english and hindi ingredients (paneer --> cottage cheese)
    * lemon juice, flax seeds should be one token 
    * lemmatization 

1. Tokenization

In [50]:
def custom_tokenizer(list_of_strings):
    # Remove quantities and units
    text = " ".join(list_of_strings).lower()
    text = re.sub(r'\b\d+/?\d*\b', '', text)  # remove quantities
    text = re.sub(r'\b(?:tsp|tbsp|cup|cups|g|gram|grams|kg|ml|l|oz|teaspoon|tablespoon|pinch)\b', '', text, flags=re.IGNORECASE)

    # Tokenize 
    tokens = re.findall(r'\b[a-zA-Z]+\b', text)
    return tokens 

In [52]:
custom_tokenizer(sample[0])

KeyError: 0

In [51]:
for s in sample:
    print(custom_tokenizer(s))

['k', 't', 'h', 'i', 'c', 'k', 'c', 'u', 'r', 'd', 's', 'd', 'a', 'h', 'i', 'c', 'u', 'p', 'p', 'o', 'w', 'd', 'e', 'r', 'e', 'd', 's', 'u', 'a', 'r', 'a', 'f', 'e', 'w', 's', 'a', 'f', 'f', 'r', 'o', 'n', 'k', 'e', 's', 'a', 'r', 's', 't', 'r', 'a', 'n', 'd', 's', 't', 'b', 's', 'p', 'w', 'a', 'r', 'm', 'm', 'i', 'k', 't', 's', 'p', 'c', 'a', 'r', 'd', 'a', 'm', 'o', 'm', 'e', 'a', 'i', 'c', 'h', 'i', 'p', 'o', 'w', 'd', 'e', 'r', 't', 'b', 's', 'p', 'p', 'i', 's', 't', 'a', 'c', 'h', 'i', 'o', 's', 'i', 'v', 'e', 'r', 's', 't', 'b', 's', 'p', 'a', 'm', 'o', 'n', 'd', 'b', 'a', 'd', 'a', 'm', 's', 'i', 'v', 'e', 'r', 's']
['t', 'r', 'o', 'w', 'f', 'a', 't', 'm', 'i', 'k', 't', 's', 'p', 'c', 'o', 'r', 'n', 'f', 'o', 'u', 'r', 'd', 'i', 's', 's', 'o', 'v', 'e', 'd', 'i', 'n', 't', 'a', 'b', 'e', 's', 'p', 'o', 'o', 'n', 's', 't', 'b', 's', 'p', 's', 'u', 'a', 'r', 't', 'b', 's', 'p', 'c', 'o', 'c', 'o', 'a', 'p', 'o', 'w', 'd', 'e', 'r', 'f', 'r', 'e', 's', 'h', 'y', 'm', 'a', 'd', 'e'

# ignore

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import string

# Predefine cooking-related irrelevant words to remove
irrelevant_words = set([
    'finely', 'chopped', 'grated', 'sliced', 'diced', 'crushed', 'peeled', 'boiled',
    'roasted', 'minced', 'fresh', 'ground', 'optional', 'whole', 'dry', 'soaked', 'taste', 'cubes'
])

stop_words = set(stopwords.words('english'))

def custom_tokenizer(text):
    # text = " ".join(list_of_strings).lower()
    # Remove quantities and units
    text = re.sub(r'\b\d+/?\d*\b', '', text)  # remove quantities
    text = re.sub(r'\b(?:tsp|tbsp|cup|cups|g|gram|grams|kg|ml|l|oz|teaspoon|tablespoon|pinch)\b', '', text, flags=re.IGNORECASE)

    # Tokenize 
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())

    tokens = [
        token for token in tokens 
        if token not in stop_words 
        and token not in irrelevant_words 
        and token not in string.punctuation
    ]

    return tokens


In [None]:
custom_tokenizer(converted_string)

['sprouted',
 'vaal',
 'field',
 'beans',
 'butter',
 'beans',
 'oil',
 'cumin',
 'seeds',
 'jeera',
 'asafoetida',
 'hing',
 'curry',
 'leaves',
 'kadi',
 'patta',
 'onion',
 'ginger',
 'garlic',
 'adrak',
 'lehsun',
 'paste',
 'tomato',
 'turmeric',
 'powder',
 'haldi',
 'malvani',
 'masala',
 'kokum',
 'jaggery',
 'gur',
 'salt',
 'coriander',
 'dhania']

In [None]:
features_exploded["ingredients"].apply(custom_tokenizer)

0       [sprouted, vaal, field, beans, butter, beans, ...
1       [capsicum, cubes, low, fat, paneer, cottage, c...
2       [onions, roughly, cashew, nut, kaju, roughly, ...
3       [olive, oil, vinegar, mustard, rai, sarson, po...
4                       [flax, seeds, lemon, juice, salt]
                              ...                        
3589    [spray, oil, red, cabbage, cut, wedges, bulbs,...
3590    [large, red, onion, thinly, apple, vinegar, su...
3591    [low, fat, plain, yoghurt, tablespoons, tahini...
3592    [carrot, large, desiree, potato, washed, peas,...
3594    [spray, oil, olive, oil, boneless, skinless, c...
Name: ingredients, Length: 3546, dtype: object

In [None]:
# Convert this into a vector
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)
tfidf_matrix = vectorizer.fit_transform(features_exploded["ingredients"])

In [None]:
tfidf_matrix.toarray()[0]

NameError: name 'tfidf_matrix' is not defined

In [None]:
# Get feature (token) names
feature_names = vectorizer.get_feature_names_out()

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

df_tfidf.head()  



Unnamed: 0,acai,according,acid,action,activia,added,adobo,adrak,advised,aeroplane,...,yolk,yolks,young,za,zest,zested,zoodles,zucchini,zucchinis,zuchinni
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.146202,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.143245,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145946,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#final features 
df_tfidf = df_tfidf.reset_index(drop=True)
features_exploded = features_exploded.reset_index(drop=True)
X = features_exploded.drop(["ingredients","tags"], axis=1).join(df_tfidf, rsuffix="_tfidf")
X.head()

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium,acai,according,acid,action,...,yolk,yolks,young,za,zest,zested,zoodles,zucchini,zucchinis,zuchinni
0,195.0,10.3,30.5,7.9,4.1,8.8,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,74.0,2.6,9.5,3.2,2.9,21.8,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,374.0,13.3,16.4,4.2,28.4,98.6,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,92.0,0.1,0.4,0.0,10.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,68.0,2.3,3.9,3.4,5.3,114.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we have all the features names ready, let's train the model in knn

### Model building: Finding nearest neighbours through KNN

In [None]:
X.isna().sum()

calories         0
protein          0
carbohydrates    0
fiber            0
fat              0
                ..
zested           0
zoodles          0
zucchini         0
zucchinis        0
zuchinni         0
Length: 2187, dtype: int64

In [None]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=5, metric='euclidean')
knn.fit(X)

In [None]:
# Let's say we have the input features like :
input_nutrients ={
    'calories': 345.8,
    'protein': 18.2,
    'carbohydrates': 63.7,
    'fiber': 5.9,
    'fat': 14.4,
    'sodium': 738.5
}

# available pantry items: 
input_ingredients = "flax seeds olive oil chopped onions garlic paste boiled potatoes green chili cumin powder turmeric fresh coriander lemon juice"



In [None]:
# Now using this, let's find out the most similar recipies from our data, 
# for that we have to convert the ingredients the same way using the vectorizer
input_vectorized_ingredients = vectorizer.transform([input_ingredients])
input_vectorized_ingredients

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16 stored elements and shape (1, 2181)>

In [None]:
input_ingredients_df = pd.DataFrame(input_vectorized_ingredients.toarray(), columns=feature_names)


In [None]:
input_features = pd.DataFrame(input_nutrients, index=[0])
input_features

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium
0,345.8,18.2,63.7,5.9,14.4,738.5


In [None]:
input_features = pd.DataFrame(input_nutrients, index=[0]).join(input_ingredients_df, rsuffix="_tfidf")
input_features

Unnamed: 0,calories,protein,carbohydrates,fiber,fat,sodium,acai,according,acid,action,...,yolk,yolks,young,za,zest,zested,zoodles,zucchini,zucchinis,zuchinni
0,345.8,18.2,63.7,5.9,14.4,738.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
distances, indices = knn.kneighbors(input_features)

In [None]:
distances #ordered

array([[22.09427287, 24.3369508 , 24.71618903, 27.9448399 , 28.37469194]])

In [None]:
indices  #indices from our df

array([[2385, 3234, 2709, 3178, 3049]])

In [None]:
# Top 5 similar recipies
df.iloc[indices.flatten()]

Unnamed: 0.1,Unnamed: 0,name,ingredients,instructions,nutrition,time,serving_size,tags
2385,425,Vegan nut roast with redcurrant port sauce,"['200g mixed nuts, such as brazils, peanuts, a...",['1 Whiz the nuts and chunks of bread in a foo...,"{'Calories ': '453 cal', 'Kilojoules ': '1895 ...",1 hr 10 mins,6,"['Gluten free', 'Vegan', '1 vege serve', 'Free..."
3234,1283,Pork and chive potstickers,"['2 tablespoons reduced-salt soy sauce', '½ te...","['1 In a jar or small bowl, combine soy sauce,...","{'Calories ': '573 cal', 'Kilojoules ': '2400 ...",40 mins,4,"['Dairy free', 'High fibre', 'High iron', 'Hig..."
2709,753,Chinese pork mince and noodles,"['1½ tablespoons reduced-salt soy sauce', '2 t...","['1 Combine soy sauce, sugar and vinegar in a ...","{'Calories ': '327 cal', 'Kilojoules ': '1375 ...",35 mins,4,"['High protein', 'Low kilojoule', 'Low sodium'..."
3178,1227,Chunky Italian-style soup with risoni,['2 x 400g cans no-added-salt chopped tomatoes...,"['1 In a large, heavy-based pan set over mediu...","{'Calories ': '431 cal', 'Kilojoules ': '1800 ...",25 mins,4,"['Gluten-free option', 'High fibre', 'Nut free..."
3049,1094,"Pasta with hot smoked salmon, ratatouille and ...",['200g gluten-free legume pasta (or wholemeal ...,['1 Cook the pasta in a large saucepan of boil...,"{'Calories ': '367 cal', 'Kilojoules ': '1570 ...",20 mins,4,"['Diabetes-friendly', 'Gluten free', 'High fib..."


In [None]:
##### Create a pipeline
def top_5_recipies(cal,protein,carbs,fiber,fat,sodium, ingredients):
    input_nutrients ={
    'calories': cal,
    'protein': protein,
    'carbohydrates': carbs,
    'fiber': fiber,
    'fat': fat,
    'sodium': sodium
}
    input_vectorized_ingredients = vectorizer.transform([ingredients])
    input_ingredients_df = pd.DataFrame(input_vectorized_ingredients.toarray(), columns=feature_names)
    input_nutrients_df = pd.DataFrame(input_nutrients, index=[0])
    input_features = input_nutrients_df.join(input_ingredients_df, rsuffix="_tfidf")
    distances, indices = knn.kneighbors(input_features)
    return df.iloc[indices.flatten()]


In [None]:
top_5_recipies(530,21,28,4.8,32,700,"Cauliflower cheese pasta chicken avocado")

Unnamed: 0.1,Unnamed: 0,name,ingredients,instructions,nutrition,time,serving_size,tags
3481,1538,Cauliflower mac ’n’ cheese with rye crumbs,"['250g small dried pasta', '4 cups small flore...",['1 Preheat the oven to 180°C. Lightly grease ...,"{'Calories ': '372 cal', 'Kilojoules ': '1560 ...",55 mins,6,"['Diabetes-friendly', 'High calcium', 'High fi..."
2765,809,Yellow prawn curry,"['¼ cup gluten-free Thai yellow curry paste', ...",['1 Heat a large deep-frying pan over medium h...,"{'Calories ': '367 cal', 'Kilojoules ': '1535 ...",20 mins,4,"['Dairy free', 'Gluten free', 'High fibre', 'H..."
3537,1620,Chicken avocado warm cobb salad,"['150g chicken breast', '2 eggs', '2 cups letu...","['1 Preheat the oven to 200°C.', '2 On a tray ...","{'Calories ': '437 cal', 'Kilojoules ': '1830 ...",20 mins,2,"['Gluten-free option', 'High fibre', 'High pro..."
3088,1133,Meatballs with kumara ‘spaghetti’,"['300g lean beef mince', '½ cup wholegrain bre...",['1 Preheat oven to 200ºC or set barbecue on a...,"{'Calories ': '439 cal', 'Kilojoules ': '1840 ...",40 mins,4,"['Gluten-free option', 'High calcium', 'High f..."
3500,1557,Italian-style fish with capers and polenta,"['¾ cup instant polenta', '¼ cup grated parmes...","['1 In a saucepan, bring 3 cups cold water to ...","{'Calories ': '348 cal', 'Kilojoules ': '1460 ...",30 mins,4,"['High fibre', 'High iron', 'High protein', 'L..."
