In this project, we load the MNIST data and split it into a training set, a validation set, and a test set before performing emsemble learning by training various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM.

In [1]:
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
X_train, y_train = X_mnist[:50_000], y_mnist[:50_000]
X_valid, y_valid = X_mnist[50_000:60_000], y_mnist[50_000:60_000]
X_test, y_test = X_mnist[60_000:], y_mnist[60_000:]

In [10]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [4]:
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=17)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=17)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=17)
mlp_clf = MLPClassifier(random_state=17)

In [5]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=17)
Training the ExtraTreesClassifier(random_state=17)
Training the LinearSVC(max_iter=100, random_state=17, tol=20)
Training the MLPClassifier(random_state=17)


In [9]:
[estimator.score(X_valid, y_valid) for estimator in estimators]

[0.9731, 0.9753, 0.8728, 0.9627]

Next, combine \[the classifiers\] into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.

In [11]:
voting_clf = VotingClassifier([
    ('random_forest_clf', random_forest_clf),
    ('extra_trees_clf', extra_trees_clf),
    ('svm_clf', svm_clf),
    ('mlp_clf', mlp_clf)
])

voting_clf.fit(X_train, y_train)

In [12]:
voting_clf.score(X_valid, y_valid)

0.975

In [20]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_valid)
y_valid_encoded = y_valid.astype(np.int64)

In [26]:
[estimator.score(X_valid, y_valid_encoded) for estimator in voting_clf.estimators_]

[0.9731, 0.9753, 0.8728, 0.9627]

Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to `"drop"` using `set_params()` like this:

In [28]:
voting_clf.set_params(svm_clf='drop')

In [37]:
svm_clf_trained = voting_clf.named_estimators_.pop('svm_clf')

In [41]:
svm_clf_trained

In [48]:
voting_clf.estimators_.remove(svm_clf_trained)

In [50]:
voting_clf.estimators_

[RandomForestClassifier(random_state=17),
 ExtraTreesClassifier(random_state=17),
 MLPClassifier(random_state=17)]

In [51]:
voting_clf.score(X_valid, y_valid)

0.9764

In [52]:
voting_clf.voting = 'soft'
voting_clf.score(X_valid, y_valid)

0.9684

So, hard voting performs better.

In [53]:
voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9723

In [54]:
[estimator.score(X_test, y_test.astype(np.int64))
 for estimator in voting_clf.estimators_]

[0.9694, 0.9705, 0.9655]

The voting classifier reduced the error rate of the best model from about 3% to 2.7%, which means 10% less errors.