# Build Classification Models

现在我们将上一节处理好的数据导入

In [46]:
import pandas as pd
cuisines_df = pd.read_csv('../data/cleaned_cuisines.csv')
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

处理出feature数据

In [48]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


# Choosing your classifier

在Scikit-learn中有十分多的classification, 比如Linear Models
Support Vector Machines
Stochastic Gradient Descent
Nearest Neighbors
Gaussian Processes
Decision Trees
Ensemble methods (voting Classifier)
Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)。
问题是我们要选择哪一个？每一个都试一下来找到最接近现实情况的classification是一个方法。另一个方法是去看前人总结的[ML Cheat sheet](https://docs.microsoft.com/zh-cn/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa)。
Let's see if we can reason our way through different approaches given the constraints we have:

* Neural networks are too heavy. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.
* No two-class classifier. We do not use a two-class classifier, so that rules out one-vs-all.
* Decision tree or logistic regression could work. A decision tree might work, or logistic regression for multiclass data.
* Multiclass Boosted Decision Trees solve a different problem. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.

这里我们决定用logistic regression来训练我们的数据

In [49]:
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

Use LogisticRegression with a multiclass setting and the liblinear solver to train.

In [50]:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))

accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))

Accuracy is 0.8040033361134279


我们可以测试第50行的数据，只有.iloc[50]是取出第50行的数据(从0开始)。

In [51]:
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')

ingredients: Index(['cilantro', 'cinnamon', 'coriander', 'cumin', 'lemon_juice',
       'olive_oil'],
      dtype='object')
cuisine: indian


更近一步地，我们可以看到模型给我们第五十行数据对应不同菜系的概率。其中ascending = [False]代表按descending来排列即从大到小。

In [52]:
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()



Unnamed: 0,0
indian,0.959354
thai,0.027043
japanese,0.007646
chinese,0.005537
korean,0.00042


最后我们可以看一下预测统计的准确率

In [53]:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.78      0.72      0.75       242
      indian       0.90      0.86      0.88       247
    japanese       0.71      0.80      0.75       230
      korean       0.86      0.79      0.82       246
        thai       0.78      0.85      0.82       234

    accuracy                           0.80      1199
   macro avg       0.81      0.80      0.80      1199
weighted avg       0.81      0.80      0.80      1199

