# Build Classification Models

1. Working in this lesson's notebook.ipynb folder, import that file along with the Pandas library:

In [13]:
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np

2. Divide the X and y coordinates into two dataframes for training. cuisine can be the labels dataframe:

In [15]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

3. Drop that Unnamed: 0 column and the cuisine column, calling drop(). Save the rest of the data as trainable features:

In [16]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


We will be using Scikit-learn to analyze our data. However, there are many ways to use logistic regression in Scikit-learn. Take a look at the parameters to pass.

Essentially there are two important parameters - multi_class and solver - that we need to specify, when we ask Scikit-learn to perform a logistic regression. The multi_class value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all multi_class values.

According to the docs, in the multiclass case, the training algorithm:

Uses the one-vs-rest (OvR) scheme, if the multi_class option is set to ovr
Uses the cross-entropy loss, if the multi_class option is set to multinomial. (Currently the multinomial option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"
🎓 The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. source

🎓 The 'solver' is defined as "the algorithm to use in the optimization problem". source.

Scikit-learn offers this table to explain how solvers handle different challenges presented by different kinds of data structures:

We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous lesson. Split your data into training and testing groups by calling train_test_split():

In [17]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Since you are using the multiclass case, you need to choose what scheme to use and what solver to set. Use LogisticRegression with a multiclass setting and the liblinear solver to train.

1. Create a logistic regression with multi_class set to ovr and the solver set to liblinear:

In [18]:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Accuracy is 0.8081734778982486


1. You can see this model in action by testing one row of data (#50):

In [19]:
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

ingredients: Index(['bread', 'chicken', 'egg', 'pepper', 'vegetable_oil', 'wheat'], dtype='object')
cuisine: japanese


3. Digging deeper, you can check for the accuracy of this prediction:

In [20]:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()



Unnamed: 0,0
chinese,0.436315
japanese,0.337605
thai,0.171379
korean,0.039587
indian,0.015114


4. Get more detail by printing a classification report, as you did in the regression lessons:

In [21]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.71      0.70      0.70       218
      indian       0.95      0.92      0.93       252
    japanese       0.65      0.82      0.72       227
      korean       0.90      0.81      0.85       250
        thai       0.87      0.78      0.82       252

    accuracy                           0.81      1199
   macro avg       0.81      0.81      0.81      1199
weighted avg       0.82      0.81      0.81      1199

