# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import pandas as pd
df = pd.read_csv('/Users/generalassembly/Documents/repo-DC-DSI-3/DC-DSI-3/curriculum/06-week/6.05-random-forests/assets/datasets/car.csv')
df.head()


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [3]:
df.isnull().any()

buying           False
maint            False
doors            False
persons          False
lug_boot         False
safety           False
acceptability    False
dtype: bool

In [4]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['acceptability'])
X = pd.get_dummies(df.drop('acceptability', axis=1))

X.head() # did it work?


Unnamed: 0,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,doors_2,doors_3,...,doors_5more,persons_2,persons_4,persons_more,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [5]:
y

array([2, 2, 2, ..., 2, 1, 3])

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report

In [6]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn import metrics
from sklearn.cross_validation import train_test_split


# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=42,test_size =0.3)

In [None]:
## print metrics.accuracy_score(y_test, y_pred_tree)


## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [54]:
score_total = []

In [59]:
x_axis=[]

In [68]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# make an instance of a KNeighborsClassifier object
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
y_pred_knn = knn.predict(X_test)
score_knn = metrics.accuracy_score(y_test, y_pred_knn)
print score_knn
print(classification_report(y_test, y_pred_knn))
confusion_matrix(y_test, y_pred_knn)




0.888246628131
             precision    recall  f1-score   support

          0       0.79      0.78      0.79       118
          1       0.56      0.47      0.51        19
          2       0.93      0.97      0.95       358
          3       0.92      0.50      0.65        24

avg / total       0.89      0.89      0.88       519



array([[ 92,   3,  23,   0],
       [  7,   9,   2,   1],
       [ 10,   0, 348,   0],
       [  7,   4,   1,  12]])

In [69]:
from sklearn.grid_search import GridSearchCV
n_neighbors = [1,2,5,10,15]


grid = GridSearchCV(estimator=knn, param_grid=dict(n_neighbors=n_neighbors), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)
## Summarize the Results of the Grid Search
x=grid.best_score_
print(x)
print(grid.best_estimator_.n_neighbors)






GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 5, 10, 15]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.725115740741
5


## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [30]:
from sklearn.ensemble import BaggingClassifier

knn = KNeighborsClassifier(n_neighbors=5)
bagging = BaggingClassifier(base_estimator = knn)

bagging.fit(X_train,y_train)
y_pred_bagging = bagging.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_bagging)
print(classification_report(y_test, y_pred_bagging))
confusion_matrix(y_test, y_pred_bagging)




0.894026974952
             precision    recall  f1-score   support

          0       0.79      0.78      0.79       118
          1       0.47      0.37      0.41        19
          2       0.94      0.98      0.96       358
          3       0.87      0.54      0.67        24

avg / total       0.89      0.89      0.89       519



array([[ 92,   5,  20,   1],
       [ 10,   7,   1,   1],
       [  6,   0, 352,   0],
       [  8,   3,   0,  13]])

In [81]:
max_f= [1,2,3]
max_vals = [10,20,30]
grid = GridSearchCV(estimator=bagging, param_grid=dict(max_samples=max_vals,max_features = max_f), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
grid.best_score_
print(grid)




GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10,...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3], 'max_samples': [10, 20, 30]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [70]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


# make an instance of a KNeighborsClassifier object
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred_lg = logreg.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_lg)
print(classification_report(y_test, y_pred_lg))
confusion_matrix(y_test, y_pred_lg)



0.870905587669
             precision    recall  f1-score   support

          0       0.69      0.81      0.74       118
          1       0.36      0.21      0.27        19
          2       0.95      0.97      0.96       358
          3       1.00      0.25      0.40        24

avg / total       0.87      0.87      0.86       519



In [40]:
Cs= [1,2,3,4,5,6,7,8,9]
penalty= ['l1','l2']
grid = GridSearchCV(estimator=logreg, param_grid=dict(C=Cs,penalty=penalty))

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)


GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [38]:
logreg.get_params().keys()


['warm_start',
 'C',
 'n_jobs',
 'verbose',
 'intercept_scaling',
 'fit_intercept',
 'max_iter',
 'penalty',
 'multi_class',
 'random_state',
 'dual',
 'tol',
 'solver',
 'class_weight']

In [71]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(base_estimator = logreg)

bagging.fit(X_train,y_train)
y_pred_bagging = bagging.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_bagging)
print(classification_report(y_test, y_pred_bagging))
confusion_matrix(y_test, y_pred_bagging)



0.863198458574
             precision    recall  f1-score   support

          0       0.67      0.81      0.73       118
          1       0.22      0.11      0.14        19
          2       0.95      0.97      0.96       358
          3       1.00      0.12      0.22        24

avg / total       0.86      0.86      0.85       519



## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [72]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# make an instance of a KNeighborsClassifier object
dtree= DecisionTreeClassifier()
dtree.fit(X_train,y_train)
y_pred_dtree = dtree.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_dtree)
print(classification_report(y_test, y_pred_dtree))
confusion_matrix(y_test, y_pred_dtree)




0.957610789981
             precision    recall  f1-score   support

          0       0.95      0.89      0.92       118
          1       0.71      0.89      0.79        19
          2       0.99      0.99      0.99       358
          3       0.83      0.79      0.81        24

avg / total       0.96      0.96      0.96       519



In [44]:
criterion= ['entropy','gini']
max_features=[0.5,0.7,0.9]
max_depth =[0.5,0.7,0.9]
grid = GridSearchCV(estimator=dtree, param_grid=dict(criterion=criterion,max_depth =max_depth,max_features=max_features ))

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)


GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [0.5, 0.7, 0.9], 'criterion': ['entropy', 'gini'], 'max_depth': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [73]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(base_estimator = dtree)

bagging.fit(X_train,y_train)
y_pred_bagging = bagging.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_bagging)
print(classification_report(y_test, y_pred_bagging))
confusion_matrix(y_test, y_pred_bagging)




0.953757225434
             precision    recall  f1-score   support

          0       0.91      0.91      0.91       118
          1       0.68      0.89      0.77        19
          2       0.99      0.98      0.99       358
          3       0.83      0.79      0.81        24

avg / total       0.96      0.95      0.95       519



In [48]:
max_f= [1,2,3]
max_vals = [10,20,30]
grid = GridSearchCV(estimator=bagging, param_grid=dict(max_samples=max_vals,max_features = max_f), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3], 'max_samples': [10, 20, 30]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

In [74]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

# make an instance of a KNeighborsClassifier object
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_rf)
print(classification_report(y_test, y_pred_rf))
confusion_matrix(y_test, y_pred_rf)




0.924855491329
             precision    recall  f1-score   support

          0       0.84      0.90      0.87       118
          1       0.55      0.84      0.67        19
          2       0.99      0.97      0.98       358
          3       0.91      0.42      0.57        24

avg / total       0.93      0.92      0.92       519



In [51]:
criterion= ['entropy','gini']
max_features=[0.5,0.7,0.9]
max_depth =[0.5,0.7,0.9]
grid = GridSearchCV(estimator=rf, param_grid=dict(criterion=criterion,max_depth =max_depth,max_features=max_features ))

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [0.5, 0.7, 0.9], 'criterion': ['entropy', 'gini'], 'max_depth': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [75]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(base_estimator = rf)

bagging.fit(X_train,y_train)
y_pred_bagging = bagging.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_bagging)
print(classification_report(y_test, y_pred_bagging))
confusion_matrix(y_test, y_pred_bagging)




0.946050096339
             precision    recall  f1-score   support

          0       0.87      0.92      0.89       118
          1       0.72      0.68      0.70        19
          2       0.98      0.99      0.99       358
          3       0.94      0.67      0.78        24

avg / total       0.95      0.95      0.94       519



In [53]:
max_f= [1,2,3]
max_vals = [10,20,30]
grid = GridSearchCV(estimator=bagging, param_grid=dict(max_samples=max_vals,max_features = max_f), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10,...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3], 'max_samples': [10, 20, 30]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


# Extra random

In [77]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

# make an instance of a KNeighborsClassifier object
et = ExtraTreesClassifier(n_jobs=-1)
et.fit(X_train,y_train)
y_pred_et = et.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_et)
print(classification_report(y_test, y_pred_et))
confusion_matrix(y_test, y_pred_et)


0.961464354528
             precision    recall  f1-score   support

          0       0.93      0.92      0.93       118
          1       0.71      0.89      0.79        19
          2       0.99      0.99      0.99       358
          3       0.86      0.79      0.83        24

avg / total       0.96      0.96      0.96       519



array([[109,   5,   2,   2],
       [  1,  17,   0,   1],
       [  4,   0, 354,   0],
       [  3,   2,   0,  19]])

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


In [78]:
score_total=[0.888246628131,0.725115740741,0.894026974952,0.870905587669,0.867052023121,0.959537572254,
             0.967244701349,0.924855491329,0.942196531792,0.961464354528]
x_axis = ['knn','knn_gridsearch','knn_bagging','Logistic Regression','bagging_Logistic Regression','DecisionTrees',
          'bagging_tree','Random Forest','bagging_rf','ExtraTreesClassifier']







In [None]:
result.plot(kind='bar',);

## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters; can you beat your classmates best score?