# Let's cook model

Let's combine what we've found so far.

- [What are ingredients?](https://www.kaggle.com/rejasupotaro/what-are-ingredients) (Preprocessing & Feature extraction)
- [Representations for ingredients](https://www.kaggle.com/rejasupotaro/representations-for-ingredients)

Steps are below.

1. Load dataset
2. Remove outliers
3. Preprocess
4. Create model
5. Check local CV
6. Train model
7. Check predicted values
8. Make submission

In [1]:
import re
import numpy as np
import pandas as pd
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_validate
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, LabelEncoder


## 1. Load dataset

In [26]:
train = pd.read_json('train.json')
test = pd.read_json('test.json')

## 2. Remove outliers

I saw weird recipes in the dataset .

- water => Japanese
- butter => Indian
- butter => French

Let's filter such single-ingredient recipes and see how it goes.

In [27]:
train['num_ingredients'] = train['ingredients'].apply(len)
train = train[train['num_ingredients'] > 1]

## 3. Preprocess

Currently, the preprocess is like below.

- convert to lowercase
- remove hyphen
- remove numbers
- remove words which consist of less than 2 characters
- lemmatize

This process can be better.

In [28]:
lemmatizer = WordNetLemmatizer()
def preprocess(ingredients):
    ingredients_text = ' '.join(ingredients)
    ingredients_text = ingredients_text.lower()
    ingredients_text = ingredients_text.replace('-', ' ')
    words = []
    for word in ingredients_text.split():
        if re.findall('[0-9]', word): continue
        if len(word) <= 2: continue
        if '’' in word: continue
        word = lemmatizer.lemmatize(word)
        if len(word) > 0: words.append(word)
    return ' '.join(words)


for ingredient, expected in [
    ('Eggs', 'egg'),
    ('all-purpose flour', 'all purpose flour'),
    ('purée', 'purée'),
    ('1% low-fat milk', 'low fat milk'),
    ('half & half', 'half half'),
    ('safetida (powder)', 'safetida (powder)')
]:
    actual = preprocess([ingredient])
    assert actual == expected, priint('"{}" is excpected but got "{}"'.format(expected, actual))

In [29]:
train['x'] = train['ingredients']
test['x'] = test['ingredients']
train.head()

Unnamed: 0,cuisine,id,ingredients,num_ingredients,x
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",9,"[romaine lettuce, black olives, grape tomatoes..."
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",11,"[plain flour, ground pepper, salt, tomatoes, g..."
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,indian,22213,"[water, vegetable oil, wheat, salt]",4,"[water, vegetable oil, wheat, salt]"
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",20,"[black pepper, shallots, cornflour, cayenne pe..."


I need to tune the parameters of TfidfVectorizer later.

In [30]:
train['x'] = train['ingredients'].map(";".join)
test['x'] = test['ingredients'].map(";".join)
train.head()

Unnamed: 0,cuisine,id,ingredients,num_ingredients,x
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",9,romaine lettuce;black olives;grape tomatoes;ga...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",11,plain flour;ground pepper;salt;tomatoes;ground...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12,eggs;pepper;salt;mayonaise;cooking oil;green c...
3,indian,22213,"[water, vegetable oil, wheat, salt]",4,water;vegetable oil;wheat;salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",20,black pepper;shallots;cornflour;cayenne pepper...


In [31]:
vectorizer = make_pipeline(
    TfidfVectorizer(sublinear_tf=True),
    FunctionTransformer(lambda x: x.astype('float16'), validate=False)
)

x_train = vectorizer.fit_transform(train['x'].values)
x_train.sort_indices()
print(x_train)
x_test = vectorizer.transform(test['x'].values)

  (0, 188)	0.20752
  (0, 254)	0.13989
  (0, 531)	0.14563
  (0, 748)	0.33423
  (0, 971)	0.30396
  (0, 1101)	0.38843
  (0, 1107)	0.10529
  (0, 1184)	0.35034
  (0, 1545)	0.26636
  (0, 1892)	0.26099
  (0, 1896)	0.16455
  (0, 2024)	0.10205
  (0, 2210)	0.23914
  (0, 2326)	0.34277
  (0, 2435)	0.2301
  (0, 2808)	0.15186
  (1, 254)	0.17432
  (1, 685)	0.22754
  (1, 910)	0.20764
  (1, 1026)	0.18945
  (1, 1205)	0.19568
  (1, 1219)	0.26318
  (1, 1688)	0.41528
  (1, 1729)	0.21838
  (1, 1883)	0.12036
  :	:
  (39750, 2950)	0.25122
  (39750, 2983)	0.099548
  (39750, 2985)	0.1665
  (39750, 2995)	0.11426
  (39751, 208)	0.22693
  (39751, 254)	0.15771
  (39751, 498)	0.26343
  (39751, 554)	0.26123
  (39751, 559)	0.24463
  (39751, 586)	0.18616
  (39751, 605)	0.19629
  (39751, 873)	0.21521
  (39751, 1057)	0.14063
  (39751, 1107)	0.11865
  (39751, 1205)	0.29956
  (39751, 1219)	0.14063
  (39751, 1372)	0.27515
  (39751, 1897)	0.14832
  (39751, 1905)	0.24597
  (39751, 2024)	0.19482
  (39751, 2324)	0.41089
  (3975

Encode cuisines to numeric values using LabelEncoder.

In [32]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train['cuisine'].values)
dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

{'brazilian': 0,
 'british': 1,
 'cajun_creole': 2,
 'chinese': 3,
 'filipino': 4,
 'french': 5,
 'greek': 6,
 'indian': 7,
 'irish': 8,
 'italian': 9,
 'jamaican': 10,
 'japanese': 11,
 'korean': 12,
 'mexican': 13,
 'moroccan': 14,
 'russian': 15,
 'southern_us': 16,
 'spanish': 17,
 'thai': 18,
 'vietnamese': 19}

## 4. Create model

I've tried LogisticRegression, GaussianProcessClassifier, GradientBoostingClassifier, MLPClassifier, LGBMClassifier, SGDClassifier, Keras but SVC works better so far.

I need to take a look at models and the parameters more closely.

In [34]:
estimator = SVC(
    C=80,
    kernel='rbf',
    gamma=1.7,
    coef0=1,
    cache_size=500,
)
classifier = OneVsRestClassifier(estimator, n_jobs=1)

## 5. Check local CV

TRUST YOUR LOCAL CV. TRUST YOUR LOCAL CV. TRUST YOUR LOCAL CV. I repeated 3 times since this is the most important thing.

Try different prprocesses and parameters while looking at the local CV.

In [11]:
scores = cross_validate(classifier, x_train, y_train, cv=3)
scores['test_score'].mean()

0.81200960588279558

In [35]:
from sklearn.model_selection import train_test_split

X_trainN, X_testN, y_trainN, y_testN = train_test_split(x_train, y_train, test_size=0.2)

In [36]:
classifier.fit(X_trainN, y_trainN)

OneVsRestClassifier(estimator=SVC(C=80, cache_size=500, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma=1.7, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          n_jobs=1)

In [37]:
print("SVM classifier accuracy",classifier.score(X_testN, y_testN))

SGD classifier accuracy 0.812224877374


In [58]:
import itertools
from matplotlib import pyplot as plt
def plot_confusion_matrix(cm, classes, path, normalize=True, title='Confusion matrix', cmap=plt.cm.Blues):
     
    '''
This function is modified to show the color range as normalized to f1 score
both f1 score and class count are printed in the squares
    '''
    
    if normalize:
        cm_normal = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    plt.figure(figsize=(20, 20))
#     plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.imshow(cm_normal, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    #using the raw cm so the counts are printed on the heat map
    normalize = False
    
#     fmt = '.2f' if normalize else 'd'
#     thresh = cm.max() / 2.
    thresh = cm_normal.max() / 2.

    for i, j in itertools.product(range(cm_normal.shape[0]), range(cm_normal.shape[1])):
        plt.text(j, i, format(cm[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if cm_normal[i, j] > thresh else "black")
        plt.text(j, i+0.25, format(cm_normal[i, j], '.2f'),
         horizontalalignment="center",
         color="white" if cm_normal[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.savefig(path+'.png')

In [42]:
y_predict = classifier.predict(X_testN) 
y_predict_train = classifier.predict(X_trainN)

In [59]:
y_predictclass = label_encoder.inverse_transform(y_predict)
y_trueclass = label_encoder.inverse_transform(y_testN)
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from matplotlib import pyplot as plt

cm_lr_train = confusion_matrix(y_trainN, y_predict_train)
cm_lr_test = confusion_matrix(y_testN, y_predict)

plot_confusion_matrix(cm_lr_test, classes=train.cuisine.unique(), path = 'plt1', normalize=True, title="SVM (rbf) Confusion Matrix (count/normalized) - test set")
plot_confusion_matrix(cm_lr_train, classes=train.cuisine.unique(), path = 'plt2', normalize=True, title="SVM (rbf) Confusion Matrix (count/normalized) - train set")

print(y_trueclass.shape, y_predictclass.shape)
report = classification_report(y_trueclass, y_predictclass)
print(report)



(7951,) (7951,)
              precision    recall  f1-score   support

   brazilian       0.84      0.66      0.74        89
     british       0.70      0.49      0.58       159
cajun_creole       0.78      0.69      0.73       293
     chinese       0.83      0.87      0.85       559
    filipino       0.83      0.64      0.72       138
      french       0.61      0.69      0.65       488
       greek       0.75      0.74      0.74       222
      indian       0.89      0.93      0.91       601
       irish       0.76      0.57      0.65       157
     italian       0.83      0.90      0.86      1592
    jamaican       0.96      0.81      0.88       120
    japanese       0.84      0.76      0.79       249
      korean       0.90      0.80      0.85       162
     mexican       0.92      0.92      0.92      1327
    moroccan       0.86      0.83      0.84       173
     russian       0.70      0.51      0.59       101
 southern_us       0.73      0.82      0.77       837
     spanis

In [29]:
from sklearn.model_selection import cross_val_predict

predict = cross_val_predict(classifier, x_train, y_train, cv=3)
conf_mat = confusion_matrix(y, y_pred)


KeyboardInterrupt: 

## 6. Train model

If I become to be confident in the model, I train it with the whole train data for submission.

In [12]:
classifier.fit(x_train, y_train)

OneVsRestClassifier(estimator=SVC(C=80, cache_size=500, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma=1.7, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          n_jobs=1)

## 7. Check predicted values

Check if the model fitted enough.

In [50]:
y_pred = label_encoder.inverse_transform(classifier.predict(x_train))
y_true = label_encoder.inverse_transform(y_train)

print('accuracy score on train data: {}'.format(accuracy_score(y_true, y_pred)))

def report2dict(cr):
    rows = []
    for row in cr.split("\n"):
        parsed_row = [x for x in row.split("  ") if len(x) > 0]
        if len(parsed_row) > 0: rows.append(parsed_row)
    measures = rows[0]
    classes = defaultdict(dict)
    for row in rows[1:]:
        class_label = row[0]
        for j, m in enumerate(measures):
            classes[class_label][m.strip()] = float(row[j + 1].strip())
    return classes
report = classification_report(y_true, y_pred)
pd.DataFrame(report2dict(report)).T

ValueError: Mix of label input types (string and number)

In [40]:
import itertools
from matplotlib import pyplot as plt
def plot_confusion_matrix(cm, classes, normalize=True, title='Confusion matrix', cmap=plt.cm.Blues):
     
    '''
This function is modified to show the color range as normalized to f1 score
both f1 score and class count are printed in the squares
    '''
    
    if normalize:
        cm_normal = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    plt.figure(figsize=(20, 20))
#     plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.imshow(cm_normal, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    #using the raw cm so the counts are printed on the heat map
    normalize = False
    
#     fmt = '.2f' if normalize else 'd'
#     thresh = cm.max() / 2.
    thresh = cm_normal.max() / 2.

    for i, j in itertools.product(range(cm_normal.shape[0]), range(cm_normal.shape[1])):
        plt.text(j, i, format(cm[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if cm_normal[i, j] > thresh else "black")
        plt.text(j, i+0.25, format(cm_normal[i, j], '.2f'),
         horizontalalignment="center",
         color="white" if cm_normal[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


In [18]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from matplotlib import pyplot as plt

y_predict = classifier.predict(x_test) 
y_predict_train = classifier.predict(x_train)

cm_lr_train = confusion_matrix(y_train, y_predict_train)
cm_lr_test = confusion_matrix(y_test, y_predict)

plot_confusion_matrix(cm_lr_test, classes=df.cuisine.unique(), normalize=True, title="Logistic Regression Confusion Matrix (count/normalized) - test set")
plot_confusion_matrix(cm_lr_train, classes=df.cuisine.unique(), normalize=True, title="Logistic Regression Confusion Matrix (count/normalized) - train set")


NameError: name 'y_test' is not defined

In [60]:
import eli5


In [61]:
weights = eli5.explain_weights_df(classifier, top=50)
print(weights)

AttributeError: 'OneVsRestClassifier' object has no attribute 'kernel'

## 6. Make submission

It seems to be working well. Let's make a submission.

In [None]:
y_pred = label_encoder.inverse_transform(classifier.predict(x_test))
test['cuisine'] = y_pred
test[['id', 'cuisine']].to_csv('submission.csv', index=False)
test[['id', 'cuisine']].head()

That's it! Don't trust what I've done here. The score can be better. Please let me know if you find a better approach.