## Model selection and Ensemble Methods

Today we are going to evaluate some models and write som ensemble methods.

- Cross validation with k-fold
- Boosting
- Bagging
- Stacking

What steps will we take?

- Import dataset
- Preprocess data
- Training and Classification
- Conclusion

In [47]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB

**Creating our dataset**

In [48]:
X, y = make_classification(n_samples=10000, n_features=15, random_state=42)

In [49]:
# Inspect the data
# X is a numpy array with 10000 rows and 15 columns
# each row has a numerical value for each of the 15 features
# y is a numpy array with 10000 rows and 1 column
# each row has a numerical value for the target variable, either 0 or 1

print(X[:50])

[[-1.14474066e+00 -1.11299633e+00  1.40631683e+00  4.00346023e-01
   1.10655032e+00  1.00020244e+00 -1.71572340e+00 -5.04978472e-01
   5.37300051e-01  6.25633913e-01 -5.53859324e-01 -1.31216288e+00
  -1.16249781e+00  1.11093702e+00  6.08045240e-01]
 [-1.00269530e+00  8.20862481e-01 -5.34430586e-01  8.10844344e-01
  -8.40368581e-01 -1.76056710e+00 -5.85908092e-01 -3.95690293e-02
  -4.65022740e-01  3.29737448e-01 -6.29083203e-01 -5.25604875e-01
  -1.09756927e+00 -2.75296653e-01  1.21757322e+00]
 [ 1.18863488e+00 -1.85701357e+00  2.46079214e+00 -1.08624787e-01
   2.24972627e-01  1.88919658e+00  1.25779229e+00 -7.80248245e-01
   8.80834303e-01  7.31413999e-01  7.99919825e-01 -3.86480075e-02
  -1.23934160e-01  1.97734857e+00  1.08332784e+00]
 [-1.15758793e+00  1.97758909e+00  1.80852140e+00  9.34791073e-01
  -1.78114728e+00 -3.63316461e-01  5.55376816e-01 -4.76009681e-01
  -1.54368514e+00  5.26410291e-02  6.22994702e-01  3.50211944e-01
   8.54796837e-01  2.68704616e+00 -1.36895776e-01]
 [-5

**Split the data into training and testing sets**

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [51]:
print(X_train.shape)

(7000, 15)


**Build the classifiers**

In [52]:
# Initialize the classifiers

clf1 = KNeighborsClassifier()
clf2 = GaussianNB()
clf3 = DecisionTreeClassifier()

# Define a meta classifier for stacking classifier
# Use logistic Regression as the meta classifier

clf_meta = LogisticRegression()


In [53]:
# Create our ensemble classifiers

# Define the bagging classifier with a decision tree as the base classifier
clf_bagging = BaggingClassifier(estimator=clf3, n_estimators=10, random_state=42)

# Define the boosting classifier with a decision tree as the base classifier
clf_boosting = AdaBoostClassifier(estimator=clf3, n_estimators=10, random_state=42)

# Create our estimators for stacking classifier
estimators = [('knn', clf1), ('gnb', clf2), ('dt', clf3)]

# Define the stacking classifier with logistic regression as the meta classifier
clf_stack = StackingClassifier(estimators=estimators, final_estimator=clf_meta)


**Evaluate our models**

In [54]:
from sklearn.metrics import accuracy_score

# Fit and predict for each classifier

for clf in (clf1, clf2, clf3, clf_meta,clf_bagging, clf_boosting, clf_stack):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, 'accuracy: ', accuracy_score(y_test, y_pred))


KNeighborsClassifier accuracy:  0.895
GaussianNB accuracy:  0.878
DecisionTreeClassifier accuracy:  0.8873333333333333
LogisticRegression accuracy:  0.89
BaggingClassifier accuracy:  0.9276666666666666
AdaBoostClassifier accuracy:  0.889
StackingClassifier accuracy:  0.9143333333333333


We can see that the bagging classifier has the best accuracy here, so we would choose it to be our final model.

**Cross-validation**

In [56]:
# Cross validation on stacking classifier
# Cross validation is to be used on the training set only, not including the test set
# In this case it is X_train and y_train. Else we run the risk of getting data leakage

scores = cross_val_score(clf_stack, X_train, y_train, cv=5)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Cross-validation scores: [0.92285714 0.91714286 0.91857143 0.91142857 0.90857143]
Average cross-validation score: 0.92


In [57]:
# Cross validation on bagging classifier
scores = cross_val_score(clf_bagging, X_train, y_train, cv=5)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))


Cross-validation scores: [0.93214286 0.92928571 0.93571429 0.92214286 0.92857143]
Average cross-validation score: 0.93
