# Chapter 7: Ensemble Learning and Random Forests

<i>Ensemble Learning</i> is the technique of aggregating the decisions made by many Machine Learning models in order to get a final result. An ensemble learning algorithm is called an <i>Ensemble method</i>. An ensemble of Decision Trees is called a <i>Random Forest</i>. Models that win Machine Learning competitions often combine several Ensemble methods, e.g. the winner of the [Netflix Prize competition](http://netflixprize.com/).

## Voting Classifiers

A <i>hard voting</i> classifier is a model that train multiple classifiers and then predicts the class that gets the most votes. Even if the models are <i>weak learners</i> (slightly better than random guessing) then an ensemble could be a <i>strong learner</i> provided there are enough diverse models in the ensemble.

This is possible due to the fact that even if the models are just slightly better than random guessing, the more models' decisions you consider the more likely that the majority will select the correct class. However, this only is true if the models are different enough to not make the same errors while classifying data.

In [1]:
# Example of a VotingClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
  estimators=[('lr',log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
  voting='hard')

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  print('{}:'.format(clf.__class__.__name__), clf.score(X_test, y_test))

LogisticRegression: 0.864
RandomForestClassifier: 0.88
SVC: 0.888
VotingClassifier: 0.896


## Bagging and Pasting

One way to use an Ensemble method is to train different types of classifiers, as shown above. Another is to train the same type of model on random subsets of the training set. Random sampling with replacement is called [bagging](http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf). Sampling without replacement is called [pasting](https://link.springer.com/article/10.1023/A:1007563306331).

Bagging and pasting classifiers using the statistical mode, just like hard voting classifiers. Regressors will tend to use the average. Each individual predictor has a higher bias on the whole training set, but aggregation reduces both bias and variance. Generally the bias remains similar to the bias of a single model but have a lower variance, so ensemble models are less likely to overfit the training data.

### Bagging and Pasting in Scikit-Learn

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Setting bootstrap=False will have the classifier use pasting
# instead of bagging.
bag_clf = BaggingClassifier(
  DecisionTreeClassifier(), n_estimators=500,
  max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
bag_clf.score(X_test, y_test)

0.912

Bootstrapping introduces more diversity into the subsets that each predictor is trained on, so bagging ends up with slightly higher bias than pasting, but lower variance. Generally, bagging results in better models thant pasting.

### Out-of-Bag Evaluation

Since bagging randomly selects proper subsets of the training set to train each model, the instances not included in a particular subset used for training a single model is called an <i>out-of-bag</i> instance. You can evaluate an ensemble by taking an average of how each model does on its oob instances.

In [3]:
# An example of including the oob_score of a BaggingClassifier in evaluation.

bag_clf = BaggingClassifier(
  DecisionTreeClassifier(), n_estimators=500,
  bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8986666666666666

In [4]:
# The score on the test set should be approximately the average oob_score_.

bag_clf.score(X_test, y_test)

0.912

In [5]:
# The oob_decision_function variable contains what the average probability
# of each instance belongs to each class when its an oob instance.

bag_clf.oob_decision_function_[:10]

array([[0.39408867, 0.60591133],
       [0.33      , 0.67      ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.10497238, 0.89502762],
       [0.35555556, 0.64444444],
       [0.00584795, 0.99415205],
       [1.        , 0.        ],
       [0.98265896, 0.01734104]])

## Random Patches and Random Subspaces

The `BaggingClassifier` also supports sampling the features as well, using the `max_features` and `bootstrap_features` hyperparameters. This is helpful for training sets with a large number of features.

Sampling both the training instances and features is called the <i>Random Patches</i> method. Keeping all training instances but sampling features is called the <i>Random Subspaces</i> method.

## Random Forest

A <i>Random Forest</i> is an ensemble of Decision Trees generally trained via the bagging method typically with `max_samples` set to the training set size.

In [6]:
# An example of training Scikit-Learn's RandomForest class.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
rnd_clf.score(X_test, y_test)

0.92

With a few exceptions, `RandomForestClassifier` has all of the hyperparameters of both a `DecisionTreeClassifier` and `BaggingClassifier`.

In [0]:
# The following BaggingClassifier is equivalent to the previous
# RandomForestClassifier.

bag_clf = BaggingClassifier(
  DecisionTreeClassifier(splitter='random', max_leaf_nodes=16),
  n_estimators=500, max_samples=1., bootstrap=True, n_jobs=-1)

### Extra-Trees

When you train Random Forests, you can add additional randomness by having the Decision Trees use random thresholds for each feature rather than trying to find the optimal threshold. These forests are called [Extremely Randomized Trees](https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) (or <i>Extra-Trees</i> for short).

Extra-Trees take much less time to train and introduce less variance into the system for the price of more bias. It is difficult to tell whether a `RandomForestClassifier` or an `ExtraTreesClassifier` will perform better for a certain Machine Learning problem, so often you have to try both to see which is a better model to use.

### Feature Importance

Random Forest classifiers can also be used to determine <i>feature importance</i> which Scikit-Learn measures as a weighted average how often a feature is used to reduce a node's Gini impurity. The weights are how many training instances are associated in a node or its descendants.

Scikit-Learn's `RandomForestClassifier` computes the feature importances automatically during training, then scales the result so that the sum of the importances equals 1.

In [8]:
# An example of using RandomForestClassifier to determine and compare
# feature importances of a training set.

from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rnd_clf.fit(iris.data, iris.target)
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
  print(name, score)

sepal length (cm) 0.13251843668220378
sepal width (cm) 0.03417936630806347
petal length (cm) 0.4239420508436492
petal width (cm) 0.4093601461660837


## Boosting

<i>Boosting</i> refers to an Ensemble method which trains each model sequentially, learning from the mistakes of the past model. The two most popular boosting algorithms are [AdaBoost](https://www.sciencedirect.com/science/article/pii/S002200009791504X) (short for <i>Adaptive Boosting</i>) and <i>Gradient Boosting</i>.

### AdaBoost

Adaptive Boosting (AdaBoost) is a Ensemble method where each successive model puts more weight on training instances that its predecessor gets wrong. The Ensemble generalizes like models that use bagging and pasting, except the models decisions are weighted based on their accuracy on the training set.

Initially, each training instance's weight, $w^{\,(i)}$ is set to $\frac{1}{m}$. A first predictor is trained and its weighted error rate, $r_1$ is computed using the equation

$$ r_j = \frac{\underset{\large{\hat{y}_j^{\,(i)}\neq y^{\,(i)}}}{\sum\limits_{i=1}^m}w^{\,(i)}}{\sum\limits_{i=1}^m w^{\,(i)}} $$

where $\hat{y}_j^{\,(i)}$ is the $j$<sup>th</sup> predictor's prediction on $i$<sup>th</sup> instance. The weight each predictor is given in the final result is given by

$$ \alpha_j = \eta \log{\frac{1-r_j}{r_j}} $$

where $\eta$ is the <i>learning rate</i> hyperparameter (defaults to 1). The more accurate a predictor is, the higher its weight will be. Random guessing yields a weight of 0, and classifiers which do worse than random guessing are given a negative weight.

The weights of each instances for training subsequent classifiers is given by

$$ w^{\,(i)} \leftarrow \left\{\begin{matrix}
w^{\,(i)} && \text{if} \;\; \hat{y}^{\,(i)} = y^{\,(i)} \\
w^{\,(i)}\exp\left(\alpha_j\right) && \text{if} \;\; \hat{y}^{\,(i)} \neq y^{\,(i)}
\end{matrix} \right. $$

and then are normalized so that their sum equals 1. The algorithm then repeats this process for each classifier, and predicts new instances by seeing which class gets the most weighted votes. The Ensemble stops training when either it finds a model that is a perfect predictor or if it trains the maximum number of models specified. The final prediction is given by

$$ \hat{y}\,(\mathbf{x}) = \underset{\large{k}}{\text{argmax}}
\underset{\large{\hat{y}_j(\mathbf{x})\,=\,k}}{\sum\limits_{j\,=\,1}^N} \alpha_j $$

where $N$ is the total number of predictors.

Scikit-Learn uses a multiclass variation of AdaBoost called [Stagewise Additive Modeling using a Multiclass Exponential log loss](https://web.stanford.edu/~hastie/Papers/samme.pdf) (SAMME) and a variant which relies on class probabilities and generally performs better called SAMME.R.

In [0]:
# An example of using AdaBoost with Scikit-Learn.

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
  DecisionTreeClassifier(max_depth=1), n_estimators=200,
  algorithm='SAMME.R', learning_rate=0.5).fit(X_train, y_train)

### Gradient Boosting

[Gradient Boosting](http://statistics.berkeley.edu/sites/default/files/tech-reports/486.pdf) is another boosting algorithm which tries to fit each new model to the <i>residual errors</i> made by the previous model. Gradient boosting with Decision Trees is known as <i>Gradient Tree Boosting</i>, or <i>Gradient Boosted Regression Trees</i> (GBRT).

Below is an example of working through a regression task using GBRT on a noisy quadratic dataset.

In [0]:
# Creating the dataset.

import numpy as np

m = 1000
X = 10 * np.random.rand(m, 1) - 5
y = (0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)).flatten()

In [14]:
# Training a single Decision Tree Regressor.

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

tree_reg1 = DecisionTreeRegressor(max_depth=2).fit(X_train, y_train)
y_pred = tree_reg1.predict(X_test)
mean_squared_error(y_test, y_pred) ** .5

2.243887427704452

In [15]:
# Training a second and third Decision Tree on the residual error of their
# predecessor. We see that this method improves the performance of the
# regressor.

y_train2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2).fit(X_train, y_train2)

y_train3 = y_train2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2).fit(X_train, y_train3)

y_pred = tree_reg1.predict(X_test) + tree_reg2.predict(X_test) + \
         tree_reg3.predict(X_test)
mean_squared_error(y_test, y_pred) ** .5

1.5995127185876792

In [16]:
# A simpler way is to train a Gradient Boosted Decision Tree
# using Sci-Kit Learn.

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.)
gbrt.fit(X_train, y_train)
y_pred = gbrt.predict(X_test)
mean_squared_error(y_test, y_pred) ** .5

1.5995127185876792

The `learning_rate` hyperparameter controls how much influence each tree has on the final decision. A smaller learning rate requires more trees to train but tends to generalize better.

In [17]:
# The staged_predict() method allows us to compare the MSE of the GBRT
# as a function of the number of trees used for training. This is used to
# implement early stopping.

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,
                                      n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=69, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [0]:
# Implementing early stopping during training.

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0
for n_estimators in range(1, 120):
  gbrt.n_estimators = n_estimators
  gbrt.fit(X_train, y_train)
  y_pred = gbrt.predict(X_val)
  val_error = mean_squared_error(y_val, y_pred)
  if val_error < min_val_error:
    min_val_error = val_error
    error_going_up += 1
  else:
    error_going_up += 1
    if error_going_up == 5:
      break

The `GradientBoostingRegressor` class also supports a `subsample` hyperparameter which lets you select the size of a random subset of the training data to train each tree. This method is called <i>Stochastic Gradient Boosting</i>.

## Stacking

The final Ensemble method is called <i>stacking</i> (short for <i>stacked generalization</i>) where we train a model to aggregate the results of each individual predictor in the ensemble.

One method of stacking is to use a <i>hold-out set</i>. While training, split the training set into 2 subsets. The predictors in the ensemble are trained with the first subset. Then, the predictors make predictions on the second subset and the aggregation model is trained using the predictors' outputs as its input features.

It is possible to split the training set into more than 2 subsets and train multiple layers of predictors using each subset. For instance, you can split the training set into 3 subsets. The first layer of predictors are trained using the first subset. Then their predictions on the 2<sup>nd</sup> subset are used the train the 2<sup>nd</sup> layer of predictors. Finally both layers make predictions on the 3<sup>rd</sup> subset to train the aggregation model which makes the final predictions.

## Exercises

### 1. If you have trained five different models on the exact same training data and they all have achieved 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

Yes, it is possible to aggregate the models to get a better result. If the models make mistakes on different training samples, then its possible that by aggregating the results with hard voting. But, if the models make mistakes on the same instances, aggregating the results will not improve the performance.

### 2. What is the difference between hard and soft voting classifiers?

Hard voting is when an ensemble of classifiers make predictions about the class of a new instance and then the model chooses the most frequent class. Soft voting is when the predictors in the ensemble output the probabilites that a new instance belongs to each class, and the model selects the prediction with the highest probability.

### 3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?

Yes, you can speed up bagging ensembles by distributing the work across multiple servers. Once the subsets are chosen, you can send copies of the data to each server, so even with replacement bagging can be distributed. Pasting ensembles can also be distributed across multiple servers.

Boosting ensembles cannot be improved by distributing the work. Since each predictor is trained successively based on the errors made by the previous one, each predictor needs to wait for the previous to finish in order.

Random forests, which are ensembles of decision trees trained using bagging, can have their performance improved by distributing the work.

Stacking ensembles can distribute the training of each layer of the ensemble across different servers, but each layer still needs to be trained in series.

### 4. What is the benefit of out-of-bag evaluation?

You can use out-of-bag evaluation in order to evaluate each predictor that is trained during each phase of a bagging ensemble.

### 5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

Extra-Trees are more random than Random Forests because they use random decision thresholds. This extra randomness reduces the model's variance, which helps if you are overfitting the training data. Extra-Trees are faster to train than regular Random Forests since they do not need to find the optimal splitting threshold at each node.

### 6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?

You can add more estimators to the AdaBoost ensemble. You can also try decreasing the learning rate of the ensemble so that the model more variance. Also if the current model is using the SAMME algorithm, you can try using the SAMME.R algorithm as well.

### 7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

You should increase the learning rate, since it will reduce the model's variance.

### 8. Load the MNIST dat and split it into a training set, a validation set, and a test set. Then train various classifiers such as Random Forest, Extra-Trees, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. How much better does it perform on the test set than the individual classifiers?

In [0]:
# Fetching the MNIST data set.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8)
sorted_indices = np.argsort(mnist.target)
X = mnist.data[sorted_indices]
y = mnist.target[sorted_indices]
rand_idx = np.random.permutation(len(X))
X, y = X[rand_idx], y[rand_idx]

In [20]:
# Using RandomizedSearchCV to find the best parameters for a RandomForest

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

X_train_val, X_test, y_train_val, y_test = \
  train_test_split(X, y, random_state=42, test_size=10000)

X_train, X_val, y_train, y_val = \
  train_test_split(X_train_val, y_train_val, random_state=42, test_size=10000)

param_dist = {
  'n_estimators': randint(200, 300),
  'max_depth': randint(5, 20),
  'max_features': randint(50, 150)
}

rnd_search = RandomizedSearchCV(RandomForestClassifier(), param_dist,
                                n_iter=10, cv=3)
rnd_search.fit(X_train[:1000], y_train[:1000])

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940ba390>, 'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940ba1d0>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940d6fd0>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [21]:
rnd_search.best_score_

0.881

In [22]:
rnd_search.best_params_

{'max_depth': 12, 'max_features': 52, 'n_estimators': 258}

In [23]:
rf_clf = rnd_search.best_estimator_
rf_clf.fit(X_train, y_train)
rf_clf.score(X_train, y_train)

0.98984

In [24]:
# The model is still overfitting, but able to get 96% accuracy on the validation
# set.

rf_clf.score(X_val, y_val)

0.9615

In [25]:
# Now training an Extra-Tree using RandomizedSearchCV.

from sklearn.ensemble import ExtraTreesClassifier

param_dist = {
  'n_estimators': randint(2, 10),
  'max_depth': randint(2, 10),
  'max_features': randint(10, len(X[0])),
}

rnd_search = RandomizedSearchCV(ExtraTreesClassifier(), param_dist,
                                n_iter=10, cv=5)
rnd_search.fit(X_train[:1000], y_train[:1000])

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940c56d8>, 'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940c5588>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa3940c5400>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [26]:
rnd_search.best_score_

0.821

In [27]:
rnd_search.best_params_

{'max_depth': 7, 'max_features': 247, 'n_estimators': 9}

In [28]:
et_clf = rnd_clf.base_estimator_
et_clf.fit(X_train, y_train)
et_clf.score(X_train, y_train)

1.0

In [29]:
et_clf.score(X_val, y_val)

0.8681

In [30]:
# Reusing the SVM from chapter 5 exercise 9.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm_clf = Pipeline([
  ('scaler', StandardScaler()),
  ('svm', SVC(C=8.43923386369637, gamma=0.002606261161097229))
])
svm_clf.fit(X_train, y_train)
svm_clf.score(X_train, y_train)

0.99992

In [31]:
svm_clf.score(X_val, y_val)

0.9639

In [32]:
# Aggregating the results using a VotingClassifier.

v_clf = VotingClassifier(estimators=[('rf', rf_clf), ('et', et_clf),
                                     ('svm', svm_clf)],
                         voting='hard', n_jobs=-1)
v_clf.fit(X_train, y_train)
v_clf.score(X_train, y_train)

0.99994

In [33]:
v_clf.score(X_val, y_val)

0.9662

In [34]:
# Comparing the performance of all of the classifiers on the test set.
# The voting classifier has a better result than all of the individual models.

for name, estimator in [('Random Forest', rf_clf), ('Extra-Trees', et_clf),
                        ('SVC', svm_clf), ('Voting Classifier', v_clf)]:
  print('{}:'.format(name), estimator.score(X_test, y_test))

Random Forest: 0.9571
Extra-Trees: 0.8707
SVC: 0.9623
Voting Classifier: 0.9648


### 9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions. For each image in the test set, make predictions with all of the classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier?

In [0]:
# Forming a test set using the individual models' predictions on the validation
# set.

X_blend_train = list(zip(
  rf_clf.predict(X_val),
  et_clf.predict(X_val),
  svm_clf.predict(X_val)))
y_blend_train = y_val

In [43]:
blend_clf = RandomForestClassifier(n_estimators=200)
blend_clf.fit(X_blend_train, y_blend_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [44]:
# We see the stack ensemble does slightly better on the test set than the voting
# classifier.

from sklearn.metrics import accuracy_score

X_blend_test = list(zip(
  rf_clf.predict(X_test),
  et_clf.predict(X_test),
  svm_clf.predict(X_test)))
y_pred = blend_clf.predict(X_blend_test)
accuracy_score(y_test, y_pred)

0.9664