# Objective
Train a RandomForest, Extra-Trees, and SVM Classifier on the MNIST dataset, then combine them into an ensemble that outperforms each individual classifier.

### Load MNIST Dataset and Creat Train, Test, and Validation Sets
The exercise in the book requests that we split the data such that our training set, testing set, and validation set, contain 50,000 instances, 10,000 instances, and 10,000 instances- respectively. We can download the MNIST dataset directly from Scikit-Learn's datasets module:

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"]

# Use train_test_split() twice to get the three sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 20000)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)

  warn(


## Random Forest Classifier
First, let's build the Random Forest classifier. The exercise doesn't specify a certain number of trees, nor any other hyperparameters. Because the objective is to build an ensemble that out performs all individual classifiers, I will spend a little time trying to make the Random Tree Classifier good, but it is not my main priority in this exercise. Note that the Random Forest Classifier is an ensemble of it own, an ensemble of Decision Tree Classifiers.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# We will begin by trying to find some good hyperparameters, not including the number of trees:
randforest_clf = RandomForestClassifier()

param_grid = [{'criterion':['gini', 'entropy'], 'max_features':[None, 'sqrt', 'log2']}]
grid_search = GridSearchCV(randforest_clf, param_grid, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

{'criterion': 'gini', 'max_features': 'sqrt'}


Let's take the highest performing RandomForestClassifier from GridSearchCV and use it as our model. Interestingly, the Random Forest performed best using only a square root of the features, instead of all of them.

Let's test GridSearchCV's best estimator on the validation set and print its accuracy:

In [3]:
from sklearn.metrics import accuracy_score

randforest_clf = grid_search.best_estimator_
pred_y = randforest_clf.predict(X_val)

print("Number of Trees:", randforest_clf.n_estimators)
print("Accuracy:", accuracy_score(pred_y, y_val))

Number of Trees: 100
Accuracy: 0.9686


Wow! 96.8 percent without too much effort. I also printed the number of trees in the estimator- *I was curious*. Let's test this classifier for everything between 80 and 120 trees in the forest. Maybe we'll find an even better result.

In [4]:
import numpy as np

max_estimators = 120
min_estimators = 80
accuracies = []
lowest_accuracy = float("inf")

for num_estimators in range(min_estimators, max_estimators + 1):
    model = RandomForestClassifier(n_estimators=num_estimators, criterion='gini', max_features='log2')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    acc_scr = accuracy_score(y_pred, y_val)
    accuracies.append(acc_scr)
    if acc_scr < lowest_accuracy:
        best_estimator_num = num_estimators
        best_accuracy = acc_scr

print("Best number of trees", best_estimator_num, "with an accuracy of", best_accuracy)

Best number of trees 120 with an accuracy of 0.9682


It appears that the best number of trees was the maximum number that I permitted. But notice that the accuracy is lower. We could continue to add trees for an eternity, but that would way overfit our data and the accuracy gains are very small. Let's stick with the default 100 trees for now and move onto the next classifier.
## Extra-Trees Classifier
The *Extremely Random Trees* classifier or *Extra-Trees* is a classifier that works very much like a Random Forest, but with one main difference, at each nod only a random subset of features is considered for splitting- *instead of all*. This introduces more randomness into the model. This technique trades more bias for less variance. Let's build an ExtraTreesClassifier with all the same hyperparameters as our RandomForestClassifier:

In [5]:
from sklearn.ensemble import ExtraTreesClassifier

extrees_clf = ExtraTreesClassifier(criterion='gini', max_features='log2')
extrees_clf.fit(X_train, y_train)

y_pred = extrees_clf.predict(X_val)

print("Accuracy:", accuracy_score(y_pred, y_val))

Accuracy: 0.9679


The Extra Trees Classifier performed very slightly better, *this could easily change because it's a sstochastic process*, than the Random Forest model. With two pretty strong classifiers, if the misclassified instances are independent- *they probably are not*- we could have a very strong ensemble.

# SVM Classifier
Though SVMs are usually used to predict whether an instance is one of two classes, it can be used for muticlass output, so long as we incorporate a one versus the rest scheme or something similair. Luckily, Scikit-Learn does this automatically when SVM classifiers are used on datasets with multiple outputs. Let's build a default SVM classifier:

In [6]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_val)

print("Accuracy:", accuracy_score(y_pred, y_val))

Accuracy: 0.9788


Our SVM Classifier worked really well, better than both of our previous classifiers, straight out of the box. In the interest of time, I'm just going to use the default classifier in my ensemble. We could definitely spend some time improving this Support Vector Classifier, but it would take quite awhile and it is not our objective.
## The Ensemble
To build our ensemble, we will use Scikit-Learn's VotingClassifier class. Essentially this class takes all the classifiers we've trained and combines them. Its called a VotingClassifier because it assigns an instance the class with the most votes from our three classifiers. This type of voting is called *hard voting*. If we weight the votes based on the confidences of each prediction, we'd be implementing *soft_voting*. Soft voting performs better more often, but let's see if we can get away with using hard voting for ensemble.

In [7]:
from sklearn.ensemble import VotingClassifier

vote_clf = VotingClassifier(estimators=[('rf', randforest_clf), ('et', extrees_clf), ('sv', svc)],
                            voting='hard')

for model in (randforest_clf, extrees_clf, svc, vote_clf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    print(model.__class__.__name__, accuracy_score(y_pred, y_val))

RandomForestClassifier 0.9703
ExtraTreesClassifier 0.9683
SVC 0.9788
VotingClassifier 0.9746


All of our classifiers are performing pretty well, but the ensemble actually performed worse than our Support Vector classifier- *just slightly*. Let's see if we can remedy this with soft voting. To do this, we need access to the predict_proba method which is only available to Support Vector classifiers when probability=True.

In [8]:
from sklearn.ensemble import VotingClassifier

svc = SVC(probability=True)

vote_clf = VotingClassifier(estimators=[('rf', randforest_clf), ('et', extrees_clf), ('sv', svc)],
                            voting='soft')

for model in (randforest_clf, extrees_clf, svc, vote_clf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    print(model.__class__.__name__, accuracy_score(y_pred, y_val))

RandomForestClassifier 0.9691
ExtraTreesClassifier 0.9684
SVC 0.9788
VotingClassifier 0.9787


With soft voting we were able to slightly outperform the three member classifiers. If we had a larger number of slightly less reliable models, it would have been very easy for the ensemble to outperform each individual classifier. But with very few classifiers, that were probably only misclassifying the hardest instances and thus having very similair misclassification sets, it was difficult to perform better than the best model. Luckily, weighting our votes with the confidence of each classification we were able to meet our objective!

In [10]:
X_new, y_new = X_train.append(X_val), y_train.append(y_val)

vote_clf.fit(X_new, y_new)
y_pred = vote_clf.predict(X_test)

print("Accuracy:", accuracy_score(y_pred, y_test))

  X_new, y_new = X_train.append(X_val), y_train.append(y_val)
  X_new, y_new = X_train.append(X_val), y_train.append(y_val)


Accuracy: 0.9817


Finally, we train the model on the training and validation sets together and then find our highest accuracy yet on the test set. Note also, an accuracy of >80 percent on the MNIST dataset, without using neural networks, is considered pretty good. We can't hope to gain much more accuracy using the ensemble that we we're asked to use.