Voting Classifier

We want to train a Random Forest Classifier, Extra Trees Classifier, Linear Support Vector Classifier, and an MLP Classifier.

We combine the 4 Classifiers in a Voting Classifier since it often achieves a higher accuracy than the best classifier in the group.

It is a simple way to combine the predictions of each class and predict the class with the most votes.

In [None]:
import sklearn
import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

We import our MNIST data.

In [None]:
from sklearn.model_selection import train_test_split

x_train_val, x_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000)

x_train, x_val, y_train, y_val = train_test_split(
    x_train_val, y_train_val, test_size=10000)

We split our data into training, validation, and test set.

In [None]:
random_forest_classifier = RandomForestClassifier(n_estimators=100)
extra_trees_classifier = ExtraTreesClassifier(n_estimators=100)
svm_classifier = LinearSVC()
mlp_classifier = MLPClassifier()

estimators = [random_forest_classifier, extra_trees_classifier, svm_classifier, mlp_classifier]
for estimator in estimators:
    estimator.fit(x_train, y_train)

We fit the models with the training data.

In [None]:
[estimator.score(x_val, y_val) for estimator in estimators]

We check the score of our models.

SVC seems a bit low with significantly less than 95%.

In [None]:
grouped_classifiers = [
    ("random_forest_classifier", random_forest_classifier),
    ("extra_tress_classifier", extra_trees_classifier),
    ("svm_classifier", svm_classifier),
    ("mlp_classifier", mlp_classifier),
]

voting_classifier = VotingClassifier(grouped_classifiers)
voting_classifier.fit(x_train, y_train)
voting_classifier.score(x_val, y_val)

So we group the classifier into a voting classifier and we get 0.9681.

Not bad, better than any of the other independent classifiers.

In [None]:
voting_classifier.set_params(svm_classifier=None)
del voting_classifier.estimators_[2]

Let's try getting rid of the SVC to see if our model improves.

In [None]:
voting_classifier.estimators
voting_classifier.estimators_

In [None]:
voting_classifier.voting = "soft"

Let's use soft voting and predict the class with the highest probability averaged over all the individual classifier.

It usually does better than hard voting since it gives more weight to the classifiers that are more confident.

In [None]:
voting_classifier.score(x_val, y_val)

We get 0.9679 which isn't bad.

Let's try our test set.

In [None]:
voting_classifier.score(x_test, y_test)

We get 0.9657 which isn't as good as our validation set but still pretty good.

In [None]:
[estimator.score(x_test, y_test) for estimator in voting_classifier.estimators_]

Our Extra Trees Classifier is still better with 0.9723.

Let's try making a stacking ensemble with a blender to see if our model improves.

We take the classifiers from earlier to make predictions on the validation set and make a new training set with the predictions.

In [None]:
x_validation_predictions = np.empty((len(x_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    x_validation_predictions[:, index] = estimator.predict(x_val)

x_validation_predictions

In [None]:
random_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True)
random_forest_blender.fit(x_validation_predictions, y_val)
random_forest_blender.oob_score_

We just trained our blender and created a stacking ensemble.

Let's try it out on our test set and compare it to out voting classifier.

In [None]:
x_test_predictions = np.empty((len(x_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    x_test_predictions[:, index] = estimator.predict(x_test)
    
y_pred = random_forest_blender.predict(x_test_predictions)
accuracy_score(y_test, y_pred)

Our accuracy score is 0.9682 which is better than our voting classifier's accuracy of 0.9657.

Great!