# Ensemble Learning Exercise

#### Q1: If you have trained 5 different models on the same training data and they all achieve 95% precision, is there any chance you can combine these models to get better results? How?

Could implement a voting classifier taking the results of each and voting for the most popular class.

#### Q2: Difference between hard and soft voting classifier?

In hard voting, each classifier has a prediction and the majority class is used as the voting classifiers prediction. In soft voting, the probabilities from each classifier are used to create an average over all classifiers and the voting classifier uses this as its output.

#### Q3: Can we speed up the training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests or stacking ensembles?

For a boosting ensemble, the training steps are sequential. Spreading across multiple servers will not speed up training. For stacking, the evaluation of the previous layer must complete before the next can begin. For the other ensembles, spreading across multiple servers will help.

#### Q4: Benefit of OOB evaluation?

The predictor can be evaluated on the set of OOB instances without the need for a seperate set of validation data. This is because training the predictor doesn't require any of the OOB instances. ie our training set can be bigger and produce better results.

#### Q5: What makes extra trees more random than random forests? How does this help and is it slower or faster than regular random forests?

A random forest tries to find the optimum split for each feature. Extra trees chooses a random threshold for each feature and attempts to find the best split after. This makes them faster than random forests since they don't have to find the optimum split. It can also help when a random forest is overfitting as the extra tress random features splits act as a regularization technique.

#### Load MNIST, split into test, training and validation. Train a random forest, an extra-trees & an SVM. Combine these using hard or soft voting into a classifier that outperforms the original ones.

In [28]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [17]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

In [21]:
X = mnist['data']
y = mnist['target']

In [22]:
X_train_full = X[:60000]
y_train_full = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

In [23]:
X_train, y_train = X_train_full[:50000], y_train_full[:50000]
X_valid, y_valid = X_train_full[50000:], y_train_full[50000:]

In [29]:
ext_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)

for clf in (ext_clf, rnd_clf, svm_clf):
    clf.fit(X_train, y_train),
    y_pred = clf.predict(X_valid)
    print(clf.__class__.__name__, accuracy_score(y_valid, y_pred))

ExtraTreesClassifier 0.9743
RandomForestClassifier 0.9736
LinearSVC 0.8662


In [32]:
voting_clf = VotingClassifier(
    estimators=[('ex', ext_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

y_pred = voting_clf.predict(X_valid)

print(voting_clf.__class__.__name__, accuracy_score(y_valid, y_pred))

VotingClassifier 0.9737


In [33]:
#Try removing LinearSVC, change voting to soft.

voting_clf = VotingClassifier(
    estimators=[('ex', ext_clf), ('rf', rnd_clf)],
    voting='soft'
)

voting_clf.fit(X_train, y_train)

y_pred = voting_clf.predict(X_valid)

print(voting_clf.__class__.__name__, accuracy_score(y_valid, y_pred))

VotingClassifier 0.9749


In [34]:
#looking good. Check on the test set now to see which performs the best.

for clf in (ext_clf, rnd_clf, voting_clf):
    clf.fit(X_train_full, y_train_full),
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

ExtraTreesClassifier 0.9722
RandomForestClassifier 0.9705
VotingClassifier 0.9729


#### Create a stacking ensemble using the classifiers from the previous question. First train a blender using a new training set created with the results of predictions on the validation set.

In [38]:
estimators=[('ex', ext_clf), ('rf', rnd_clf)]

X_valid_pred = np.empty((len(X_valid), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_valid_pred[:, index] = estimator.predict(X_valid)

AttributeError: 'tuple' object has no attribute 'predict'