## Model selection and Ensemble Methods

Today, we're going to evaluate som models and write some ensemble methods.

- Cross validation with k-fold
- Boosting
- Bagging
- Stacking

What steps will we take?

- Import Dataset
- Preprocess Data
- Training and Classification
- Conclusion

**Importing our models**

In [31]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification


**Creating our dataset**

In [32]:
X, y = make_classification(n_samples=10000, n_features = 15, random_state=42)

In [33]:
# Inspect the data
# X is a numpy array with 10000 rows and 15 columns
# each row has a numerical value, a float
# y is a numpy array with 10000 rows and 1 column
# each row has a target value, either 0 or 1

print(y[:50])

[1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1
 1 0 1 0 1 0 0 0 0 1 0 0 1]


**Split the data into train, test**

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [35]:
#Inspect the data
# X_train has 7000 rows, X_test has 3000 rows
# Looks good

print(X_train.shape)

(7000, 15)


**Build the classifiers**

In [36]:
# Initialize the classifiers    

from sklearn.naive_bayes import GaussianNB

clf1 = KNeighborsClassifier()
clf2 = GaussianNB()
clf3 = DecisionTreeClassifier()

# Define a meta-classifier for the stacking classifier
# Use Logistic Regression as the meta-classifier

clf_meta = LogisticRegression()

In [37]:
# Create our ensemble classifiers

# Define the bagging classifier with a decision tree base classifier
clf_bagging = BaggingClassifier(base_estimator=clf3, n_estimators=10, random_state=42)

# Deine the boosting classifier with a decision tree base classifier
clf_boosting = AdaBoostClassifier(base_estimator=clf3, n_estimators=10, random_state=42)

# Create our estimators for the stacking classifier
estimators=[('knn', clf1), ('gnb', clf2), ('dt', clf3)]

# Define the stacking classifier with logistic regression as the meta-classifier
clf_stack = StackingClassifier(estimators = estimators, final_estimator=clf_meta, cv = 10)

**Evaluate our models**

In [38]:
from sklearn.metrics import accuracy_score

# Fit and predict for each classifier

for clf in (clf1, clf2, clf3, clf_meta, clf_bagging, clf_boosting, clf_stack):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, 'accuracy: ', accuracy_score(y_test, y_pred))

KNeighborsClassifier accuracy:  0.8943333333333333
GaussianNB accuracy:  0.8713333333333333
DecisionTreeClassifier accuracy:  0.891
LogisticRegression accuracy:  0.8943333333333333
BaggingClassifier accuracy:  0.9296666666666666
AdaBoostClassifier accuracy:  0.893
StackingClassifier accuracy:  0.912


We can see that the bagging classifier has the best accuracy here, and would choose it as our model.

**Cross-validation**

In [40]:
# Cross validation on stacking classifier

scores = cross_val_score(clf_stack, X_train, y_train, cv=5)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.3f}".format(scores.mean()))

Cross-validation scores: [0.91857143 0.91928571 0.90142857 0.91571429 0.91785714]
Average cross-validation score: 0.915
