## Cuisine Classifiers 3

In [2]:
import pandas as pd
cuisines_df = pd.read_csv("./data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [3]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [4]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Scikit-learn offers a more granular cheat sheet for choosing the estimators (another term for classifiers)
https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/3-Classifiers-2/images/map.png

According with the map (cheat sheet) we can choose Linear SVC, since:
- have >50 samples
- we want to predict a category
- we have labeled data
- fewer than 100k samples
If Linear SVC doenst work we can try KNeighbors Classifier, if that doesnt work we can try SVC and Ensemble classifiers

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, precision_recall_curve
import numpy as np

# Split data in training and test
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

## Linear SVC classifier
Support-Vector Clustering (SVC) is a child of the Support Vector machines family of ML techniques. In this method you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of several (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false', here we set it to 'true'to gather probability estimates. We set the random state to '0'to shuffle the data to get probabilities

Start by creating an array of classifiers. This array will be incremented progressively as we test

In [10]:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True, random_state=0)
}

def trainAndShowResults(classifiersList):
    # Train the model using the classifier and print out a report
    for index, (name, classifier) in enumerate(classifiers.items()):
        classifier.fit(X_train, np.ravel(y_train))

        y_pred = classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("Accuracy (train) for %s: %0.1f%%" % (name, accuracy * 100))
        print(classification_report(y_test, y_pred))

trainAndShowResults(classifiers)

Accuracy (train) for Linear SVC: 79.9%
              precision    recall  f1-score   support

     chinese       0.68      0.75      0.71       239
      indian       0.91      0.87      0.89       247
    japanese       0.78      0.77      0.77       222
      korean       0.87      0.75      0.81       242
        thai       0.78      0.85      0.81       249

    accuracy                           0.80      1199
   macro avg       0.80      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199



## K-Neighbors classifier
K-neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.

In [11]:
# could have just aded to the classifiers above
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True, random_state=0),
    'KNN classifier': KNeighborsClassifier(C)
}

trainAndShowResults(classifiers)

Accuracy (train) for Linear SVC: 79.9%
              precision    recall  f1-score   support

     chinese       0.68      0.75      0.71       239
      indian       0.91      0.87      0.89       247
    japanese       0.78      0.77      0.77       222
      korean       0.87      0.75      0.81       242
        thai       0.78      0.85      0.81       249

    accuracy                           0.80      1199
   macro avg       0.80      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199

Accuracy (train) for KNN classifier: 75.0%
              precision    recall  f1-score   support

     chinese       0.64      0.78      0.70       239
      indian       0.87      0.83      0.85       247
    japanese       0.65      0.86      0.74       222
      korean       0.95      0.60      0.73       242
        thai       0.77      0.69      0.73       249

    accuracy                           0.75      1199
   macro avg       0.77      0.75      0.75      

The KNN is a little worse than Linear SVC

## Support Vector classifier
Support Vector classifiers are part of the Support Vector Machine family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted

In [13]:
# could have just aded to the classifiers above
classifiers = {
    'Linear SVC': SVC(kernel='linear', C=C, probability=True, random_state=0),
    'KNN classifier': KNeighborsClassifier(C),
    'SVC': SVC()
}

trainAndShowResults(classifiers)

Accuracy (train) for Linear SVC: 79.9%
              precision    recall  f1-score   support

     chinese       0.68      0.75      0.71       239
      indian       0.91      0.87      0.89       247
    japanese       0.78      0.77      0.77       222
      korean       0.87      0.75      0.81       242
        thai       0.78      0.85      0.81       249

    accuracy                           0.80      1199
   macro avg       0.80      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199

Accuracy (train) for KNN classifier: 75.0%
              precision    recall  f1-score   support

     chinese       0.64      0.78      0.70       239
      indian       0.87      0.83      0.85       247
    japanese       0.65      0.86      0.74       222
      korean       0.95      0.60      0.73       242
        thai       0.77      0.69      0.73       249

    accuracy                           0.75      1199
   macro avg       0.77      0.75      0.75      

Thats pretty good

## Ensemble Classifiers
Lets try some 'Ensemble Classifiers', specifically Random Forest and AdaBoost

In [None]:
classifiers = {
    'RFST': RandomForestClassifier(n_estimators=100), # Decision trees infused with randomness to avoid overfitting
    # n_estimators = number of trees 
    'ADA': AdaBoostClassifier(n_estimators=100)
}
trainAndShowResults(classifiers)

Accuracy (train) for RFST: 84.9%
              precision    recall  f1-score   support

     chinese       0.80      0.79      0.79       239
      indian       0.93      0.90      0.91       247
    japanese       0.85      0.82      0.83       222
      korean       0.89      0.83      0.86       242
        thai       0.80      0.91      0.85       249

    accuracy                           0.85      1199
   macro avg       0.85      0.85      0.85      1199
weighted avg       0.85      0.85      0.85      1199

Accuracy (train) for ADA: 69.8%
              precision    recall  f1-score   support

     chinese       0.62      0.64      0.63       239
      indian       0.81      0.85      0.83       247
    japanese       0.53      0.76      0.63       222
      korean       0.82      0.69      0.75       242
        thai       0.79      0.55      0.65       249

    accuracy                           0.70      1199
   macro avg       0.72      0.70      0.70      1199
weighted avg