# Chapter 7 Answers

## Voting classifier
Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for val‐
idation, and 10,000 for testing). Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine
them into an ensemble that outperforms them all on the validation set, using a
soft or hard voting classifier. Once you have found one, try it on the test set. How
much better does it perform compared to the individual classifiers?


In [38]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier , ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm  import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble  import VotingClassifier
from sklearn.metrics  import accuracy_score

In [2]:
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

In [3]:
x_train,x_test, y_train  , y_test =  train_test_split(mnist.data,mnist.target,test_size = 10000)

In [4]:
x_train,x_val, y_train  , y_val =  train_test_split(x_train,y_train,test_size = 10000)

In [6]:
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [7]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(x_train, y_train)

Training the RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
Training the ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                 

In [11]:
for estimator in estimators :
  print(estimator.score(x_val,y_val))

0.9674
0.9715
0.0982
0.9645


In [14]:
voting_clf =  VotingClassifier([
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),                 
])

In [15]:
voting_clf.fit(x_train,y_train)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_lea

In [16]:
voting_clf.score(x_val,y_val)

0.9731

In [19]:
[estimator.score(x_val, y_val) for estimator in voting_clf.estimators_]


[0.9674, 0.9715, 0.0982, 0.9645]

In [20]:
del voting_clf.estimators_[2]

In [21]:
voting_clf.score(x_val,y_val)

0.9695

In [23]:
voting_clf.voting =  "soft"
voting_clf.score(x_val,y_val)

0.9695

In [24]:
# hard is better than soft voting
voting_clf.voting =  "hard"

voting_clf.score(x_val,y_val)

0.9731

In [27]:
voting_clf.score(x_test,y_test)

0.971

In [29]:
[estimator.score(x_test, y_test) for estimator in voting_clf.estimators_]


[0.9656, 0.9696, 0.9643]

## Stacking Ensemble
Exercise: Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set.

In [30]:
X_val_predictions = np.empty((len(x_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(x_val)

In [31]:
X_val_predictions


array([[5., 5., 9., 5.],
       [6., 6., 9., 6.],
       [5., 5., 9., 5.],
       ...,
       [0., 0., 9., 0.],
       [5., 5., 9., 5.],
       [9., 9., 9., 9.]], dtype=float32)

In [33]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [34]:
rnd_forest_blender.oob_score_


0.969

In [36]:
X_test_predictions = np.empty((len(x_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(x_test)

In [37]:
y_pred = rnd_forest_blender.predict(X_test_predictions)


In [39]:
accuracy_score(y_test, y_pred)


0.97

This stacking ensemble does not perform as well as the voting classifier we trained earlier, it's not quite as good as the best individual classifier.

