# Ensemble Learning

In [1]:
from sklearn import tree
from sklearn import ensemble
from sklearn import metrics
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("data_banknote_authentication.txt")
df = df.sample(frac=1)
data = df.values
# 80/20 split
train_split = (data[:int(data.shape[0]*.8), :data.shape[1]-1], data[:int(data.shape[0]*.8), data.shape[1]-1])
test_split = (data[int(data.shape[0]*.8):, :data.shape[1]-1], data[int(data.shape[0]*.8):, data.shape[1]-1])

In [3]:
baseline_classification_tree = tree.DecisionTreeClassifier()
adjust_classification_tree = tree.DecisionTreeClassifier(max_features=4, min_samples_split = 4, max_depth = 30)
clf = baseline_classification_tree.fit(train_split[0], train_split[1])
pred = clf.predict(test_split[0])
print(f"Baseline Classification F1 Score: {metrics.f1_score(test_split[1], pred):.4f}")
clf = adjust_classification_tree.fit(train_split[0], train_split[1])
pred = clf.predict(test_split[0])
print(f"Adjusted Classification F1 Score: {metrics.f1_score(test_split[1], pred):.4f}")

Baseline Classification F1 Score: 0.9796
Adjusted Classification F1 Score: 0.9796


There were no significant change seen when tuning the parameters of the model. The banknote dataset is easily seperatable and not sensitive to tuning hyperparameters. When the dataset was trained on the logistic regression model, loss converged after several epochs didnt suffer much from overfitting.  

## Random Forest Classifier

In [4]:
forest_classifier = ensemble.RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=2)
clf = forest_classifier.fit(train_split[0], train_split[1])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(clf, data[:,:data.shape[1]-1], data[:,data.shape[1]-1], scoring='f1', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Random Forest F1 Score: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Random Forest F1 Score: 0.992 (0.008)


## Gradient Boost Classifier

In [5]:
gradient_boost = ensemble.GradientBoostingClassifier(loss="log_loss", learning_rate=.2) 
clf = gradient_boost.fit(train_split[0], train_split[1])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(clf, data[:,:data.shape[1]-1], data[:,data.shape[1]-1], scoring='f1', cv=cv, n_jobs=-1, error_score='raise')
print('Gradient Boost F1 Score: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Gradient Boost F1 Score: 0.997 (0.005)


The gradient Boost algorithm has a marginally higher mean F1 Score and a lower std for K fold cross validation. This means that the gradient boost model did slightly, but not significantly better than random forest. 

Since the task for this project is to identify fake bank notes, precision would have been a good metric because it shows how likely it is for a genuine bank note to be flagged as fake. Flagging real bank notes as fake is very bad because it would mean that innocent people are being accused of a false crime. In addition, recall is a good metric because the primary goal of this task if finding fake bank notes. F1 score combines precision and recall to create a metric that evaluates the performance of the model on both of these categories. 

In [6]:
baseline_classification_tree = tree.DecisionTreeClassifier()
clf = baseline_classification_tree.fit(train_split[0], train_split[1])
pred = clf.predict(test_split[0])
print(f"Baseline Accuracy: {np.sum(pred == test_split[1])/test_split[1].shape[0]:.4f}")
forest_classifier = ensemble.RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=2)
clf = forest_classifier.fit(train_split[0], train_split[1])
pred = clf.predict(test_split[0])
print(f"Random Forest Accuracy : {np.sum(pred == test_split[1])/test_split[1].shape[0]:.4f}")
gradient_boost = ensemble.GradientBoostingClassifier(loss="log_loss", learning_rate=.2) 
clf = gradient_boost.fit(train_split[0], train_split[1])
pred = clf.predict(test_split[0])
print(f"Gradient Boost Accuracy : {np.sum(pred == test_split[1])/test_split[1].shape[0]:.4f}")

Baseline Accuracy: 0.9818
Random Forest Accuracy : 0.9891
Gradient Boost Accuracy : 0.9964


Although I chose F1 score as my metric, using any of the popular metrics would have been just as effective because my dataset is balanced and highly seperatable. As shown above, using the accuracy metric yields similar scores as F1. 