# Cuisine Classification from Ingredients

### **Description:** The main purpose of this project is to classify cuisines based on Ingredients of the Reciepe.


The steps followed are as follows:
* Step-1: Download dataset from [https://www.kaggle.com/c/whats-cooking/data](https://www.kaggle.com/c/whats-cooking/data), store it in dictionary and convert it to dataframe.
* Step-2: Feature Selection:
  * Removal of punctuation, digits, content inside parenthesis using Regex Expression
  * Remove brand names using Regex Expression
  * Convert to lower case and Remove stop words using Regex Expression
  * Use Porter Stemmer Algorithm
* Step-3: Encoded Cuisine column using Label Encoder of sklearn.
* Step-4: Convert the ingredients column after feature selection into TFIDF Matrix
* Step-5: Split the data(X-TFIDF Matrix, Y-Label Encoded value of Cuisine into training and test data(80:20).
* Step-6: Use Different Machine Learning Algorithm to get best accuracy. 


In [91]:
import json
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
file = r'/train.json'
with open(file) as train_file:
    dict_train = json.load(train_file)

print(dict_train[0])

{'id': 10259, 'cuisine': 'greek', 'ingredients': ['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']}


In [92]:
len(dict_train)

39774

In [93]:
dict_train[0]['ingredients']

['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']

In [94]:
id_ = []
cuisine = []
ingredients = []
for i in range(len(dict_train)):
    id_.append(dict_train[i]['id'])
    cuisine.append(dict_train[i]['cuisine'])
    ingredients.append(dict_train[i]['ingredients'])
    
    
    

In [95]:
import pandas as pd
df = pd.DataFrame({'id':id_, 
                   'cuisine':cuisine, 
                   'ingredients':ingredients})
print(df.head(5))

       cuisine     id                                        ingredients
0        greek  10259  [romaine lettuce, black olives, grape tomatoes...
1  southern_us  25693  [plain flour, ground pepper, salt, tomatoes, g...
2     filipino  20130  [eggs, pepper, salt, mayonaise, cooking oil, g...
3       indian  22213                [water, vegetable oil, wheat, salt]
4       indian  13162  [black pepper, shallots, cornflour, cayenne pe...


In [96]:
df['cuisine'].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [97]:
new = []
for s in df['ingredients']:
    s = ' '.join(s)
    new.append(s)
    

In [98]:
df['ing'] = new

In [99]:
import re
l=[]
for s in df['ing']:
    
    #Remove punctuations
    s=re.sub(r'[^\w\s]','',s)
    
    #Remove Digits
    s=re.sub(r"(\d)", "", s)
    
    #Remove content inside paranthesis
    s=re.sub(r'\([^)]*\)', '', s)
    
    #Remove Brand Name
    s=re.sub(u'\w*\u2122', '', s)
    
    #Convert to lowercase
    s=s.lower()
    
    #Remove Stop Words
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(s)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    s=' '.join(filtered_sentence)
    
    #Remove low-content adjectives
    
    
    #Porter Stemmer Algorithm
    words = word_tokenize(s)
    word_ps=[]
    for w in words:
       word_ps.append(ps.stem(w))
    s=' '.join(word_ps)
    
    l.append(s)
df['ing_mod']=l
print(df.head(10))

       cuisine     id                                        ingredients  \
0        greek  10259  [romaine lettuce, black olives, grape tomatoes...   
1  southern_us  25693  [plain flour, ground pepper, salt, tomatoes, g...   
2     filipino  20130  [eggs, pepper, salt, mayonaise, cooking oil, g...   
3       indian  22213                [water, vegetable oil, wheat, salt]   
4       indian  13162  [black pepper, shallots, cornflour, cayenne pe...   
5     jamaican   6602  [plain flour, sugar, butter, eggs, fresh ginge...   
6      spanish  42779  [olive oil, salt, medium shrimp, pepper, garli...   
7      italian   3735  [sugar, pistachio nuts, white almond bark, flo...   
8      mexican  16903  [olive oil, purple onion, fresh pineapple, por...   
9      italian  12734  [chopped tomatoes, fresh basil, garlic, extra-...   

                                                 ing  \
0  romaine lettuce black olives grape tomatoes ga...   
1  plain flour ground pepper salt tomatoes ground..

In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['ing_mod'])

print(X)
#print(vectorizer.get_feature_names())

  (0, 2066)	0.358742368305
  (0, 1368)	0.279113558822
  (0, 211)	0.1474807692
  (0, 1696)	0.140383928524
  (0, 1040)	0.358370896832
  (0, 2487)	0.149020782233
  (0, 965)	0.110423617226
  (0, 1811)	0.104595496092
  (0, 1961)	0.250742183159
  (0, 1699)	0.112545141153
  (0, 2158)	0.235344428308
  (0, 960)	0.406987699414
  (0, 152)	0.209263619966
  (0, 849)	0.318797276253
  (0, 464)	0.152756710941
  (0, 653)	0.335930359104
  (1, 211)	0.170560662478
  (1, 2487)	0.34468335741
  (1, 1811)	0.241928180905
  (1, 1877)	0.345327620662
  (1, 894)	0.184355427894
  (1, 1067)	0.302640570155
  (1, 2116)	0.108657345321
  (1, 2471)	0.273543336047
  (1, 787)	0.184685524469
  :	:
  (39772, 2357)	0.268411022775
  (39772, 253)	0.268411022775
  (39772, 2340)	0.248396991662
  (39772, 2611)	0.249620356348
  (39773, 211)	0.159683476941
  (39773, 2487)	0.161350912208
  (39773, 965)	0.119560178799
  (39773, 1811)	0.226499666077
  (39773, 1699)	0.121857239759
  (39773, 1067)	0.141670118432
  (39773, 2116)	0.1017279

In [102]:
len(new)

39774

In [103]:
type(df['ing'][0])

str

In [104]:
s='1 1cool co1l coo1'
s=re.sub(r"(\d)", "", s)
print(s)

 cool col coo


In [105]:
s='hi 1(bye)'
s=re.sub(r'\([^)]*\)', '', s)
print(s)

hi 1


In [106]:
s='hi 1 Marvel™ hi'
s=re.sub(u'\w*\u2122', '', s)
print(s)

hi 1  hi


In [107]:
import re
from nltk.corpus import stopwords
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
s = pattern.sub('', s)
print(s)

I love phone, super fast 'much new cool things jelly bean....recently I'seen bugs.


In [108]:
#"love phone, super fast much cool jelly bean....but recently bugs."
import nltk
nltk.download('word_tokenize')

[nltk_data] Error loading word_tokenize: Package 'word_tokenize' not
[nltk_data]     found in index


False

In [109]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [110]:
"love phone, super fast much cool jelly bean....but recently bugs."

'love phone, super fast much cool jelly bean....but recently bugs.'

In [111]:
import nltk
nltk.download('word_tokenize')

[nltk_data] Error loading word_tokenize: Package 'word_tokenize' not
[nltk_data]     found in index


False

In [112]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\RAHUL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [113]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

s = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [114]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

In [115]:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

print(s)

This is a sample sentence, showing off the stop words filtration.


In [116]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['cuisine'])
df['cuisine']=le.transform(df['cuisine']) 

In [117]:
df['cuisine'].value_counts()

9     7838
13    6438
16    4320
7     3003
3     2673
5     2646
2     1546
18    1539
11    1423
6     1175
17     989
12     830
19     825
14     821
1      804
4      755
8      667
10     526
15     489
0      467
Name: cuisine, dtype: int64

In [118]:
cuisine_map={'0':'brazilian', '1':'british', '2':'cajun_creole', '3':'chinese', '4':'filipino', '5':'french', '6':'greek', '7':'indian', '8':'irish', '9':'italian', '10':'jamaican', '11':'japanese', '12':'korean', '13':'mexican', '14':'moroccan', '15':'russian', '16':'southern_us', '17':'spanish', '18':'thai', '19':'vietnamese'}

In [119]:
Y=[]
Y = df['cuisine']


In [120]:
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [151]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 100)

In [143]:
df5['ing_mod'].head()

0    thi sampl sentenc , show stop word filtrat .
1    thi sampl sentenc , show stop word filtrat .
2    thi sampl sentenc , show stop word filtrat .
3    thi sampl sentenc , show stop word filtrat .
4    thi sampl sentenc , show stop word filtrat .
Name: ing_mod, dtype: object

In [140]:
X_test = vectorizer.fit(df['ing_mod'])
X_test = X_test.transform(df5['ing_mod'])
y_test = df5['cuisine']

In [142]:
print(X_test)




In [None]:
for K in range(25):
 K_value = K+1
 neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
 neigh.fit(X_train, y_train) 
 y_pred = neigh.predict(X_test)
 print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)


In [153]:
#Implement KNN(So we take K value to be 11)
neigh = KNeighborsClassifier(n_neighbors = 11, weights='uniform', algorithm='auto')
neigh.fit(X_train,y_train) 
y_pred = neigh.predict(X_test)
print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:11")

Accuracy is  74.8082966688 % for K-Value:11


In [79]:
#Implement Grid Serch for best Gamma, C and Selection between rbf and linear kernel
from sklearn import svm, datasets
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)
clf.fit(X_train, y_train)   
print('Best score for data1:', clf.best_score_) 
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best score for data1: 0.776234325403061
Best C: 1
Best Kernel: linear
Best Gamma: auto


In [73]:
#OVA SVM(Grid Search Results: Kernel - Linear, C - 1, Gamma - Auto)
from sklearn import svm
lin_clf = svm.LinearSVC(C=1)
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

78.8812067882


In [78]:
#SVM by Crammer(Grid Search Results: Gamma - , C - )
lin_clf = svm.LinearSVC(C=1.0, multi_class='crammer_singer')
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

78.8434946574


In [33]:
#Implementing OVA Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

66.7127592709


In [152]:
#Implementing OVA Logistic Regerssion
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

77.7247014456


In [154]:
#df3=pd.DataFrame({'id':y_test.index, 'cuisine':y_test.values})
y_test2=[]
y_test1=df5['cuisine'].tolist()
for i in range(len(df3['cuisine'])):
    y_test2.append(cuisine_map[str(df3['cuisine'][i])])
print(y_test2)

['mexican', 'southern_us', 'italian', 'mexican', 'japanese', 'mexican', 'italian', 'indian', 'italian', 'moroccan', 'french', 'mexican', 'japanese', 'southern_us', 'italian', 'mexican', 'italian', 'japanese', 'indian', 'japanese', 'mexican', 'chinese', 'mexican', 'spanish', 'italian', 'greek', 'vietnamese', 'italian', 'thai', 'jamaican', 'italian', 'greek', 'cajun_creole', 'french', 'japanese', 'italian', 'cajun_creole', 'french', 'british', 'british', 'italian', 'italian', 'italian', 'irish', 'mexican', 'moroccan', 'indian', 'french', 'mexican', 'mexican', 'mexican', 'southern_us', 'mexican', 'southern_us', 'chinese', 'southern_us', 'mexican', 'japanese', 'greek', 'brazilian', 'southern_us', 'italian', 'indian', 'italian', 'chinese', 'mexican', 'vietnamese', 'french', 'jamaican', 'italian', 'mexican', 'italian', 'chinese', 'filipino', 'italian', 'chinese', 'indian', 'indian', 'italian', 'vietnamese', 'greek', 'greek', 'chinese', 'russian', 'italian', 'indian', 'mexican', 'filipino', '

In [155]:
#Convert Predicted Output 
#_______ gives the best accuracy. So lets implement it on last time to get the final output.
y_pred1=[]
for i in range(len(y_pred)):
    y_pred1.append(cuisine_map[str(y_pred[i])])
print(y_test2)

['mexican', 'southern_us', 'italian', 'mexican', 'japanese', 'mexican', 'italian', 'indian', 'italian', 'moroccan', 'french', 'mexican', 'japanese', 'southern_us', 'italian', 'mexican', 'italian', 'japanese', 'indian', 'japanese', 'mexican', 'chinese', 'mexican', 'spanish', 'italian', 'greek', 'vietnamese', 'italian', 'thai', 'jamaican', 'italian', 'greek', 'cajun_creole', 'french', 'japanese', 'italian', 'cajun_creole', 'french', 'british', 'british', 'italian', 'italian', 'italian', 'irish', 'mexican', 'moroccan', 'indian', 'french', 'mexican', 'mexican', 'mexican', 'southern_us', 'mexican', 'southern_us', 'chinese', 'southern_us', 'mexican', 'japanese', 'greek', 'brazilian', 'southern_us', 'italian', 'indian', 'italian', 'chinese', 'mexican', 'vietnamese', 'french', 'jamaican', 'italian', 'mexican', 'italian', 'chinese', 'filipino', 'italian', 'chinese', 'indian', 'indian', 'italian', 'vietnamese', 'greek', 'greek', 'chinese', 'russian', 'italian', 'indian', 'mexican', 'filipino', '

In [156]:
#We Choose OVA SVM as it gives the best accuracy of 78.88%
from sklearn import svm
lin_clf = svm.LinearSVC(C=1)
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)
result=pd.DataFrame({'Actual Cuisine':y_test2, 'Predicted Cuisine':y_test2})
print(result)

78.8812067882
     Actual Cuisine Predicted Cuisine
0           mexican           mexican
1       southern_us       southern_us
2           italian           italian
3           mexican           mexican
4          japanese          japanese
5           mexican           mexican
6           italian           italian
7            indian            indian
8           italian           italian
9          moroccan          moroccan
10           french            french
11          mexican           mexican
12         japanese          japanese
13      southern_us       southern_us
14          italian           italian
15          mexican           mexican
16          italian           italian
17         japanese          japanese
18           indian            indian
19         japanese          japanese
20          mexican           mexican
21          chinese           chinese
22          mexican           mexican
23          spanish           spanish
24          italian           italia

# Conclusion
The OVA SVM Algorithm gives the best accuracy of 78.88% for the given dataset.

|       Algorithm       |           Parameters           | Accuracy |
|:----------------------|:------------------------------:|:--------:|
|          KNN          |              K=25              |  74.81%  |
|        OVA SVM        |              C=1               |  78.88%  |
|      Crammer SVM      |C=1, multi_class=crammer_singer |  78.84%  |
|    OVA Naive Bayes    |               -                |  66.71%  |
|OVA Logistic Regression|               -                |  77.72%  |
