# Cuisine classifiers 1
Can use ./data/cleaned_cuisines.csv or ./data/cleaned_cuisines-MINE.csv

In [None]:
import pandas as pd

cuisines_df = pd.read_csv("./data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

In [4]:
# divide the X and y coordinates into two dataframes for training. cuisine can be the labels dataframe
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [5]:
# Drop the "Unnamed: 0" column and the cuisine column, calling drop(). Save the rest of the data as trainable features
cuisines_feature_df = cuisines_df.drop(["Unnamed: 0", 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Choosing the classifier

Scikit-learn groups classification under SUpervised Learning, and in that category you will find many ways to classify. The variety (https://scikit-learn.org/stable/supervised_learning.html) is quite bewildering at first sight. The following methods all include classification techniques:
- Linear Models
- Support Vector Machines
- Stochastic Gradient Descent
- Nearest Neighbors
- Gaussian Processes
- Decisions Trees
- Ensemble methods (voting Classifier)
- Multiclass and multiouput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)

Which classifier to choose? Ofter, running through several and looking for a good result is a way to test. Scikit-learn offers a side-by side comparison on a created datased https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

A better way than wildly guessing, however, is to follow the ideas on this download chear sheet. https://learn.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet?view=azureml-api-1&WT.mc_id=academic-77952-leestott Here, we discover that, for out multiclass problem, we have some choices. https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/cheatsheet.png?raw=true

Lets try to reason our way through different approaches given the constraints we have:
- **Neural netwoks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task
- **No two-class classifier**: we do not use a two-class classifier, so that rules out one-vs-all
- **Decision tree or logistic regression could work** - A decision tree might work, or logistic regression for multiclass data
- **Multiclass Booested Decision Trees solve a different problem** - A multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks desgined to build rankings, so it is not useful for us

## Using Scikit-learn
We will be using Scikit-learn to analyze the data. However, there are many ways to use logistic regression in Scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

There are two important parameters - `multi_class`and `solver`, that needs to be specified, when we ask Scikit-learn to perform a logistic regression. The `multi_class` value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all `multi_class` values

According to the docs, in the multiclass case, the training algorithm
- **Uses the one-vs-rest (OvR) scheme** if the `multi_class` option is set to `ovr`
- **Uses the cross-entropy loss** if the `multi_class` option is set to `multinomial`. (Currently the `multinomial` option is supported only by ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really design to support binary classification, these schemes allow it to better handle multiclass classification tasks https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/

The 'solver' is defined as "the algorith to use in the optimization problem" https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/solvers.png

#### ---------------------------

In [6]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

In [None]:
from sklearn.multiclass import OneVsRestClassifier

# Create a logistic regression with multi_class set to ovr and the solver set to liblinear:
# lr = LogisticRegression(multi_class='ovr', solver='liblinear') -> gives a deprecated warning
# model = lr.fit(X_train, np.ravel(y_train)) -> gives deprecated warning

lr = LogisticRegression(solver='liblinear') # try lbfgs solver
model = OneVsRestClassifier(lr).fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print("Accuracy is {}".format(accuracy))


Accuracy is 0.7964970809007507


In [14]:
# See model in action by testing one row of data (#50)
# gets the non-zero values
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50] != 0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

ingredients: Index(['cabbage', 'clam', 'fish', 'kelp', 'scallion', 'soybean'], dtype='object')
cuisine: korean


In [None]:
# Check the accuracy of this prediction
test = X_test.iloc[50:51] # selects a single row but keeps it as a DataFrame (preserves column structure)
# iloc[50] - selects a single row and returns a pandas Series (loses column structure)
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending=[False])
topPrediction.head()

[[3.92187198e-02 3.24869385e-04 3.97410070e-02 7.65396652e-01
  1.55318752e-01]]


Unnamed: 0,0
korean,0.765397
thai,0.155319
japanese,0.039741
chinese,0.039219
indian,0.000325


In [20]:
# Print a classification report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     chinese       0.72      0.69      0.70       240
      indian       0.94      0.89      0.92       251
    japanese       0.77      0.76      0.77       238
      korean       0.84      0.80      0.82       254
        thai       0.71      0.84      0.77       216

    accuracy                           0.80      1199
   macro avg       0.80      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199

