# Hands on Machine Learning - Chapter 07 - Ensemble Methods - Exercises

# Exercise 8 - Voting Classifier

Load the MNIST data and split it into training, test, and validation sets (50k/10k/10k). 

Then train various classifiers such as Random Forest classifier, an Extra Trees classifier, and an SVM.

Next, try to combine them into an ensemble that outperforms them all on the validation set using a soft or hard voting classifier. 

Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers? 

## Part 1 - Loading `MNIST` Data

In [0]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, VotingClassifier

# Multilayer Perceptron classifier from sklearn's neural network module
from sklearn.neural_network import MLPClassifier

In [0]:
# Fetch the MNIST data with examples as 784-dimensional feature vectors
mnist = fetch_openml(name='mnist_784', version=1)

In [0]:
# Splitting into training/val and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, 
                                                            test_size=10000, random_state=42)

In [0]:
# Further splitting the training/val set into distinct training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, 
                                                  test_size=10000, random_state=42)

In [0]:
# Confirming that train/test/val sets have the right number of samples 
NUM_TRAIN = 50000
NUM_TEST = 10000
NUM_VAL = 10000
NUM_FEATURES = 784

# Training set 
assert len(X_train) == NUM_TRAIN 
assert len(y_train) == NUM_TRAIN
assert X_train.shape[-1] == NUM_FEATURES

# Test set 
assert len(X_test) == NUM_TEST
assert len(y_test) == NUM_TEST 
assert X_test.shape[-1] == NUM_FEATURES

# Validation set 
assert len(X_val) == NUM_VAL
assert len(y_val) == NUM_VAL
assert X_val.shape[-1] == NUM_FEATURES

## Part 2 - Training Individual Models

In [0]:
# Instantiate
svm_clf = LinearSVC(random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [38]:
# Combine into list 
estimators = [svm_clf, rf_clf, et_clf, mlp_clf]

# Train all models individually
for estimator in estimators:
  print("Training estimator", estimator)
  estimator.fit(X_train, y_train)

Training estimator LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
          verbose=0)




Training estimator RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
Training estimator ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
     

In [39]:
# Evaluating individual estimators on validation test 
[estimator.score(X_val, y_val) for estimator in estimators]

[0.8397, 0.9692, 0.9715, 0.9639]

Order of scores: `[SVM, RF, ET, MLP]`
- SVM is outperformed by all other models, probably because a linear decision boundary introduces too much bias: it is not possible to fit a linear decision boundary that appropriately separates all 10 digits from each other.
- RF has good performance. It minimises overfitting by decreasing variance and only slightly increasing bias by randomly sampling a subset of features from which to choose a random feature for making a split across a large number of `DecisionTreeClassifier`s with bagging. 
- ET introduces even more randomness by choosing a random threshold value for each feature at each split. Surprisingly, this leads to better performance compared to a random forest with the same number of trees. 
- MLP has a lot of non-linearity, and is our first (unofficial) deep learning model. So it outperforms SVM, but underperforms RF and ET because it is not an ensemble classifier, and thus does not benefit from 'the wisdom of the crowd'.


Still keeping `LinearSVC` because it is very different from all other classifiers, and thus unlikely to make the same errors as them. As such, it may improve scores in an ensemble.

## Part 3 - Ensemble Classifier - Hard Voting

In [0]:
# List of tuples linking each estimator to a string of its name
named_estimators = [
        ('random_forest_clf', rf_clf), 
        ('extra_trees_clf', et_clf), 
        ('mlp_clf', mlp_clf),
        ('svm_clf', svm_clf),
]

In [0]:
# Build a hard voting classifier using this list of tuples
voting_clf = VotingClassifier(named_estimators)

In [73]:
# Fit the voting classifier to data
voting_clf.fit(X_train, y_train)



VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_lea

In [75]:
# Evaluating on validation set
voting_clf.score(X_val, y_val)

0.9706

### Effect of SVM 

Does removing the `LinearSVC` from the ensemble improve performance? 

In [76]:
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_lea

In [81]:
# This will have updated the list of estimators in the voting classifier
voting_clf.estimators

[('random_forest_clf',
  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                         criterion='gini', max_depth=None, max_features='auto',
                         max_leaf_nodes=None, max_samples=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators=100,
                         n_jobs=None, oob_score=False, random_state=42, verbose=0,
                         warm_start=False)),
 ('extra_trees_clf',
  ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fracti

In [82]:
# But the list of traind estimators has not been updated!
voting_clf.estimators_    # _ at the end means list of trained estimators

[RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=42, verbose=0,
                        warm_start=False),
 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                      criterion='gini', max_depth=None, max_features='auto',
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs

In [0]:
# Delete the svm classifier from the list of trained classifiers
del voting_clf.estimators_[-1]

In [87]:
# Confirming that I deleted the right estimator
voting_clf.estimators_

[RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=42, verbose=0,
                        warm_start=False),
 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                      criterion='gini', max_depth=None, max_features='auto',
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs

In [88]:
# Did it improve validation score?
voting_clf.score(X_val, y_val)

0.9736

## Part 4 - Soft Voting Classifier

A soft voting classifier will make predictions based on class probabilities rather than just majority class votes. The predicted class for a given example is the one with the highest predicted class probability averaged across all classes.

In [0]:
# Simply set voting parameter to soft 
voting_clf.voting = 'soft'

In [90]:
# Evaluate on the validation set again
voting_clf.score(X_val, y_val)

0.97

## Part 5 - Test Set Performance

In [0]:
# Resetting voting method to `hard` because majority votes led to better val set performance
voting_clf.voting = 'hard'

In [103]:
# Evaluate on the test set
voting_clf.score(X_test, y_test)

0.9704

In [97]:
# Compare with results of individual estimators
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.0, 0.0, 0.0]

In [0]:
fitted_estimators = voting_clf.estimators_[:]

Throughout this exercise, I have been unable to get non-zero scores for the fitted estimators through list comprehension as well as through for loops. 

I found that 
- soft voting classifiers do not always outperform hard voting classifiers. In this case, a hard voting classifier had a validation set score of 0.9736, whereas a soft voting classifier had a score of 0.97.
- removing the `LinearSVC` classifier improved validation set performance.

# Exercise 9 - Stacking Ensemble

Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions. 

Each training instance is a vector containing the set of predictions from all your classifiers for an image and the tartget is the image's class. 

This is a **blender**, and together with the classifiers, it forms a stacking ensemble. 

Evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensembles predictions. 

How does it compare to the voting classifier you trained earlier?

## Part 1 - Individual Classifiers 

Will recreate and retrain the 4 classifiers from the previous exercise.

In [0]:
# Make new classifiers
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(random_state=42)
mlp_clf = MLPClassifier(random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)

In [0]:
# Wrap them up in a list
estimators = [rnd_clf, mlp_clf, extra_trees_clf, svm_clf]

In [114]:
# Fit them to the training data
for estimator in estimators:
  estimator.fit(X_train, y_train)



## Part 1 - Predictions as Features

- Ensemble voting classifier consists of four classifiers (before dropping the SVM).
- Each of these classifiers will be used to make a prediction (a predicted class from `0` - `9`). 
- This means the predictions will be a 4-D vector `[pred_1, pred_2, pred_3, pred_4]`.
- There will be one such prediction for each validation set example `m`.
  - Not the training set. This will ensure that predictions made by the first layer of estimators are 'clean'.
- So the matrix of predictions will have shape `(m, 3)`. 
- These predictions will be used as features for training a **meta model**: a model that learns how best to combine predictions made by individual predictors so that the ensemble's prediction is as accurate as possible.

In [0]:
# One row per validation set sample, one column per estimator
NUM_ESTIMATORS = len(estimators)
X_val_predictions = np.empty((NUM_VAL, NUM_ESTIMATORS), dtype=np.float32)

In [0]:
# Store predictions for each validation set sample in the predictions array
for index, estimator in enumerate(estimators):
  X_val_predictions[:, index] = estimator.predict(X_val)

In [117]:
# Confirming predictions
X_val_predictions

array([[5., 5., 5., 8.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

## Part 3 - Training Meta Model 

Will create a `RandomForestClassifier` that will use the predictions made on the validation set as training data to learn the mapping from validation set predictions to validation set classes.

Since we don't have a hold-out validation set, and since the random forest estimator uses bagging (meaning 37% of samples, on average, are never seen during training), we can use out of bag evaluation to get validation scores.

In [0]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [120]:
# Fit to the data
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [121]:
# What is the validation set score?
rnd_forest_blender.oob_score_

0.97

## Part 4 - Evaluating Meta Model 

To evaluate the stacked model, will have to create predictions for samples on the test set using the ensemble's predictors.

These predictions will then be fed to the trained meta model. 

In [0]:
# Creating array for test set predictions
X_test_predictions = np.empty((NUM_TEST, NUM_ESTIMATORS), dtype=np.float32)

In [0]:
# Storing predictions
for index, estimator in enumerate(estimators):
  X_test_predictions[:, index] = estimator.predict(X_test)

In [0]:
# Use these as features for the blender 
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [127]:
accuracy_score(y_test, y_pred)

0.9656

- Even on the validation set, the random forest blender metamodel did not do as well as the hard and soft voting classifiers.
- This trend was also observed in the test set results.
- Which goes to show that stacking may not always improve results.
- Sometimes, a simple hard voting classifier may be the best solution for aggregating the results of an ensemble classifier.