# Recipe Diet Classification from Ingredients

this project goal is to classify diets based on the recipe ingredients.

the following steps are made to solve this problem
* 1- loading the dataset from a json file DietDS.json
* 2- clean dataset ingredient colmn using regex by removing 
        - punctuation, digits, stop words, brand names
* 3- encoding the feature column using TfidfVectorizer( which Convert a collection of raw documents to a matrix of TF-IDF features).
* 4- Split the data to (X-TFIDF Matrix, Y-Label value of Diet into training and test data(70:30).
* 5- preform Knn machine learning algorithm to get an accurecy 

In [38]:

# import needed libraries
import json
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import pandas as pd
import re
# import for KNN
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
# LOAD DATA 
ps = PorterStemmer()
dataset_file = r'DietDS.json'
with open(dataset_file) as train_file:
    dict_train = json.load(train_file)

id_ = []
cuisine=[]
ingredients = []
for i in range(len(dict_train)):
    id_.append(dict_train[i]['id'])  
    cuisine.append(dict_train[i]['diet'])
    ingredients.append(dict_train[i]['ingredients'])
    
df = pd.DataFrame({'id':id_, 
                   'diet': cuisine,
                   'ingredients':ingredients})
print(df.head(5))

      id  diet                                        ingredients
0  10259     0  romaine lettuce,black olives,grape tomatoes,ga...
1  25693     0  plain flour,ground pepper,salt,tomatoes,ground...
2  20130     0  eggs,pepper,salt,mayonaise,cooking oil,green c...
3  22213     1                     water,vegetable oil,wheat,salt
4  13162     0  black pepper,shallots,cornflour,cayenne pepper...


In [92]:
df['diet'].value_counts()

0    823
1    176
Name: diet, dtype: int64

In [99]:
#remove punctuation 
df['ing']=df['ingredients'].str.replace('[^\w\s]',' ')


l=[]
for s in df['ing']:
    #Remove Stop Words    
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(s)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    s=' '.join(filtered_sentence)
    
    #Remove low-content adjectives   
    
    #Porter Stemmer Algorithm
    words = word_tokenize(s)
    word_ps=[]
    for w in words:
        word_ps.append(ps.stem(w))
    s=' '.join(word_ps)    
    l.append(s)
df['ing_mod']=l

print(df.head(3))

      id  diet                                        ingredients  \
0  10259     0  romaine lettuce,black olives,grape tomatoes,ga...   
1  25693     0  plain flour,ground pepper,salt,tomatoes,ground...   
2  20130     0  eggs,pepper,salt,mayonaise,cooking oil,green c...   

                                                 ing  \
0  romaine lettuce black olives grape tomatoes ga...   
1  plain flour ground pepper salt tomatoes ground...   
2  eggs pepper salt mayonaise cooking oil green c...   

                                             ing_mod  
0  romain lettuc black oliv grape tomato garlic p...  
1  plain flour ground pepper salt tomato ground b...  
2  egg pepper salt mayonais cook oil green chili ...  


In [100]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['ing_mod'])

In [101]:
Y=[]
Y = df['diet']

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 100)

In [103]:
for k in range(20):
    K = k+1
    neigh = KNeighborsClassifier(n_neighbors = K, weights='uniform', algorithm='auto')
    neigh.fit(X_train, y_train) 
 
    y_pred = neigh.predict(X_test)
    print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K:",K)

Accuracy is  78.5 % for K: 1
Accuracy is  82.5 % for K: 2
Accuracy is  82.5 % for K: 3
Accuracy is  82.5 % for K: 4
Accuracy is  80.0 % for K: 5
Accuracy is  81.5 % for K: 6
Accuracy is  83.0 % for K: 7
Accuracy is  82.5 % for K: 8
Accuracy is  83.0 % for K: 9
Accuracy is  81.5 % for K: 10
Accuracy is  83.0 % for K: 11
Accuracy is  81.0 % for K: 12
Accuracy is  81.5 % for K: 13
Accuracy is  80.5 % for K: 14
Accuracy is  82.0 % for K: 15
Accuracy is  81.0 % for K: 16
Accuracy is  81.0 % for K: 17
Accuracy is  80.5 % for K: 18
Accuracy is  80.5 % for K: 19
Accuracy is  80.5 % for K: 20
