## Exercises

1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

You could combine them in a forest to get better results. All of those algorithms will have their own bias that will make incorrect predictions for any given input. However, in an ensemble, the bias are all different and thus the majority will tend to vote the correct answer rather than give an occassional false one.

2. What is the difference between hard and soft voting classifiers?

A hard voting classifier will only take the counts from each classifier and pick the one with the highest votes, while a soft voting classifier will take the averaged estimated class probability for each class, and then pick the highest probability. Soft voting is only possible if every classifier used in the forest contains the ability to predict probabilities of the class.

3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?

Since every predictor in an ensemble is independent of one another, it is possible to speed up training of a bagging ensemble by distributing it across multiple servers. It is also possible with pasting ensembkes and Random Forests for very similar reasons. Stacking ensembles also can, but only under the condition that predictiors in a single layer can only be trained after the predictors in a previous layer have been trained. Boosting ensembles cannot be distributed across multiple servers as its success comes from the previous predictor and must be sequential.

4. What is the benefit of out-of-bag evaluation?

out-of-bag evaluation allows all the predictors in an ensemble to evaluate using data instances it was not trained on, allowing for unbiased evaluation from the ensemble without needing a validation set, ultimately allowing your data to go farther.

5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

Extra-Trees will split on data similary to Random Forests, however while Random Forests tend to split on the best boundary value, Extra-Trees will split on a random value, which provides not just more randomness but a form of regularization. Extra-Trees are quicker to train than Random Forests, however they are about the same in terms of speed when making predictions.

6. If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?

You can get your AdaBoost ensemble to better fit your data by increasing the number of estimators it is using, or reducing your regularization hyperparameter. You can also try increasing the learning rate.

7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

Per typical Gradient Boosting techniques, you can try early stopping to prevent overfitting, or decrease the learning rate so it doesn't hit the global optimum.

8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g. use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [5]:
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

In [6]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

In [7]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [8]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    estimator.fit(X_train, y_train)



In [9]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.8626, 0.9606]

In [11]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ('random_forest_clf', random_forest_clf),
    ('extra_trees_clf', extra_trees_clf),
    ('svm_clf', svm_clf),
    ('mlp_clf', mlp_clf)
]

In [12]:
voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)



VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_lea

In [13]:
voting_clf.score(X_val, y_val)

0.9713

In [14]:
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9692, 0.9715, 0.8626, 0.9606]

In [15]:
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_lea

In [16]:
voting_clf.estimators

[('random_forest_clf',
  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                         criterion='gini', max_depth=None, max_features='auto',
                         max_leaf_nodes=None, max_samples=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators=100,
                         n_jobs=None, oob_score=False, random_state=42, verbose=0,
                         warm_start=False)),
 ('extra_trees_clf',
  ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fracti

In [18]:
voting_clf.estimators_

[RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=42, verbose=0,
                        warm_start=False),
 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                      criterion='gini', max_depth=None, max_features='auto',
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs

In [19]:
del voting_clf.estimators_[2]

voting_clf.score(X_val, y_val)

0.9737

In [20]:
voting_clf.voting = 'soft'

voting_clf.score(X_val, y_val)

0.97

In [21]:
voting_clf.voting = 'hard'

voting_clf.score(X_test, y_test)

0.9711

In [22]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9645, 0.9691, 0.9586]

9. Run the individual classifiers from the previous exercise to make predicitions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?

In [23]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [24]:
X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [25]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, 
                                            oob_score=True, random_state=4)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=4, verbose=0,
                       warm_start=False)

In [26]:
rnd_forest_blender.oob_score_

0.97

In [27]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [28]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [29]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9669