# Ensemble Learning and Random Forests

If you aggregate the predictinos of a group of predictors (such as classifers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an <i> ensemble;</i> thus, this technique is called <i>Ensemble Learning,</i> and an Ensemble Learning algorithm is called an <i>Ensemble method</i>.

A <i>Random Forest</i> is one of the most powerful Machine Learning algorithms available today and by design it is an esemble of Decision Trees.  

You will often use Ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor.

### Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.  This majority-vote classifier is called <i> hard voting </i>. 

Somewhat suprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble.  In fact, even if each classifier is a <i> weak learner </i> (meaning it does only slightly better than random guessing), the ensemble can still be a <i> strong learner </i> (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.

Suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time (barely better than random guessing).  If you predict the majority voted class, you can hope for up to 75% accuracy!  However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data.

Ensemble methods work best when the predictors are as independent from one another as possible.  One way to get diverse classifiers is to train them using very different algorithms.  This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.  

#### Example 1: Training a voting classifier in Scikit-Learn composed of 3 diverse classifiers

In [1]:
from sklearn.datasets import load_iris, make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# iris_ = load_iris()
# X = iris_.data[:, 3:]
# y = iris_.target

X, y = make_moons(n_samples=10000, noise=0.15)

log_clf_ = LogisticRegression()
rnd_clf_ = RandomForestClassifier()
svm_clf_ = SVC()

voting_clf_ = VotingClassifier(
    estimators=[('lr', log_clf_), ('rf', rnd_clf_), ('svc', svm_clf_)],
    voting='hard'
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

voting_clf_.fit(X_train, y_train)
for clf in (log_clf_, rnd_clf_, svm_clf_, voting_clf_):
    clf.fit(X_train, y_train)
    y_pred_ = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred_))

LogisticRegression 0.8739393939393939
RandomForestClassifier 0.9896969696969697
SVC 0.990909090909091
VotingClassifier 0.9896969696969697


If all classifiers are able to estimate class probabilities (ie they have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.  This is called <i>soft voting</i>.  It often achieves higher performance than hard voting because it gives more weight to highly confident votes.  All you need to do is replace voting='hard' with voting='soft' and ensure that all classifiers can estimate class probabilities.  This is not the case for the SVC class by default, so you need to set its probability hyperparamter to True.

In [2]:
from sklearn.datasets import load_iris, make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# iris_ = load_iris()
# X = iris_.data[:, 3:]
# y = iris_.target

X, y = make_moons(n_samples=10000, noise=0.15)

log_clf_ = LogisticRegression()
rnd_clf_ = RandomForestClassifier()
svm_clf_ = SVC(probability=True)

voting_clf_ = VotingClassifier(
    estimators=[('lr', log_clf_), ('rf', rnd_clf_), ('svc', svm_clf_)],
    voting='soft'
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

voting_clf_.fit(X_train, y_train)
for clf in (log_clf_, rnd_clf_, svm_clf_, voting_clf_):
    clf.fit(X_train, y_train)
    y_pred_ = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred_))

LogisticRegression 0.8718181818181818
RandomForestClassifier 0.99
SVC 0.9921212121212121
VotingClassifier 0.9903030303030304


### Bagging and Pasting

Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set.  When sampling is performed <i> with </i> replacement, this method is called <i> bagging </i>.  When sampling is performed <i> without </i> replacement, it is called <i> pasting </i>.  Most on this here: https://homl.info/21

<b> In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor </b>

<u> Bagging and pasting involves training several predictors on different random samples of the training set </u>

Once all predictors are trained the aggregation function is typically the statistical mode (the most frequent prediction) for classification, or the average for regression.  In general, aggregation of predictions reduces both bias and variance.  The net result is that the ensemble has a simlair bias but lower variance than a single predictor trained on the original training set. 

<b> Bagging and Pasting scale very well </b>

#### Example 2: Bagging and Pasting in Scikit-Learn

In [3]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf_ = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1
)
bag_clf_.fit(X_train, y_train)
y_pred = bag_clf_.predict(X_test)

The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities.

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble's variance is reduced.  Overall, bagging often results in better models.

### Out-of-Bag Evaluation

By default a BaggingClassifier samples <i> m </i> training instances with replacement (bootstrap=True), where <i> m </i> is the size of the training set.  This means that only about 63% of the training instances are sampled on average for each predictor $1 - \exp(-1) = 0.63212$.  The remaining 37% of the training instances that are not sampled are called <i> out of bag (oob) </i> instances.  

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set.  

In Scikit-Learn you can set oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after training.  

The oob decision function is available for each training instance trhough the oob_decision_function_ variable.

#### Example 3: Bagging with Out-of-Bag Evaluation

In [4]:
bag_clf_ = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True
)

bag_clf_.fit(X_train, y_train)
y_pred_ = bag_clf_.predict(X_test)
accuracy_score_ = accuracy_score(y_test, y_pred_)
print(f'OOB Score: {bag_clf_.oob_score_}, Accuracy: {accuracy_score_}')

OOB Score: 0.9877611940298507, Accuracy: 0.9896969696969697


### Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well.  Sampling is controlled by two hyperparameters: <b>max_features</b> and <b>bootstrap_features</b>.  They work the same way as <b>max_samples</b> and <b>bootstrap</b>, but for feature sampling instead of instance sampling.  Thus, each predictor will be trained on a random subset of the input features.

<b> This technique is particularly useful when you are dealing with high-dimensional inputs (such as inputs). </b>

Sampling both trianing instances and features is called the <i> Random Patches </i> method.  Keeping all training instances (by setting bootstrap=False and max_samples=1.0) but sampling features (by setting bootstrap_features=True and/or max_features to a value smaller than 1.0) is called <i>Random Subspaces</i> method.

### Random Forests

#### Example 4: Random Forest

In [5]:
'With few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier plus all the hyperparameters of a BaggingClassifer to control the ensemble itself.'

# The following code uses all available CPU cores to train a RAndom Forest classifier with 500 trees, each tree limited to a maximum of 16 nodes.
from sklearn.ensemble import RandomForestClassifier

rnd_clf_ = RandomForestClassifier(
    n_estimators=500,
    max_leaf_nodes=16,
    n_jobs=-1
)

rnd_clf_.fit(X_train, y_train)
y_pred_rf_ = rnd_clf_.predict(X_test)

### Extra-Trees

When you are a growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting.  It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds.  A forest of such extremely random trees is called an <i> Extremely Randomized Trees Ensemble </i>.

Once again, this technique trades more bias for a lower variance (Recall that low bias and high variance is characteristic of underfitting. And high bias and low variance is characteristic of overfitting).  It also makes Extra-Tres much faster to train than regular Random Forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

<b>You can create an Extra-Trees classifier using Scikit-Learn's ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier class </b>.

It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier.  Generally, the only way to know is to try both and compare them using cross-validation.

### Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature.  Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average.  More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it. 

<b> You can access the result using the <i> feature_importances_ </i> variable </b>

#### Example 5: Access Random Forest Feature Importances

In [6]:
iris_ = load_iris()
rnd_clf_ = RandomForestClassifier(
    n_estimators=500,
    n_jobs=-1
)
rnd_clf_.fit(iris_['data'], iris_['target'])
for name, score in zip(iris_['feature_names'], rnd_clf_.feature_importances_):
    print(name, score)

sepal length (cm) 0.09711331661572171
sepal width (cm) 0.024867402780438135
petal length (cm) 0.4403408557941626
petal width (cm) 0.43767842480967756


### Boosting

<i> Boosting </i> refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.  <b> There are many boosting methods available, but by far the most popular are <i> AdaBoost </i> (short for Adaptive Boosting) and <i> Gradient Boosting </i>.

### AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted.  This is the technique used by AdaBoost.

For example, when training an AdaBoost classifer, the algorithm first trains a base classifier (such as a Decision Tree) and uses it to make predictions on the training set.  The algorithm then increases the relative weight of misclassified training instances.  Then it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on.

This sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor's parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

<b> There is one important drawback to this sequential learning technique: it cannot be parallelized, since each predictor can only be trained after the previous predictor has been trained and evaluated </b>

To make predictions, AdaBoost simply computes the predictions of the all the predictors and weighs them using the predictor weights $\alpha_j$.  The predicted class is the one that receives the majority of weighted votes.

<b>Scikit-Learn uses a multiclass version of AdaBoost called SAMME (Stagewise Additive Modeling using a Multiclaass Exponential loss function).  If the predictors can estimate class probabilities (predict_proba()), Scikit-Learn can use a variant of SAAME called SAMME.R which relies on class probabilities rather than predictions and generally performs better.</b>

A Decision Stump is a Decision Tree with max_depth=1 and is the default base estimator for the AdaBoostClassifier class.  If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularizing the base estimator.

#### Example 6: AdaBoostClassifer

In [7]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf_ = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm='SAMME.R',
    learning_rate=0.5
)
ada_clf_.fit(X_train, y_train)
y_pred_ = ada_clf_.predict(X_test)
accuracy_score_ = accuracy_score(y_test, y_pred_)
print(f'Accuracy: {accuracy_score_}')

Accuracy: 0.993030303030303


### Gradient Boosting

Another very popular boosting algorithm is <i>Gradient Boosting</i>.  Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.  However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the <i>residual errors</i> made by the previous predictor.

The learning_rate hyperparameter scales the contribution of each tree.  If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better.  This is a regularization technique called <i> shrinkage </i>.

In order to find the optimal number of trees, you can use early stopping.  A simple way to implement this is to use the staged_predict() method; it returns an iterator over the predictions made by the ensemble at each stage of the training.  It is also possible to implement early stopping by actually stopping training early.  You can do so by setting warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called, allowing incremental training.

The GradientBoostingRegressor class also suppports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. As you can probably guess by now, this technique trades a higher bias for a lower variance.  It also speeds up training considerably.  This is called <i> Stochastic Gradient Boosting </i>.

It is possible to use Gradient Boosting with other cost functions.  This is controlled by the loss hyperparameter.

It is worth noting that an optimized implementation of Gradient Boosting is available in the popular Python library XGBoost (Extreme Gradient Boosting).  XGBoost is often an important component of the winning entries in ML competitions.  XGBoost also offers several nice features, such as automatically taking care of early stopping.

#### Example 7: Manual Gradient Boostings and GradientBoostingRegressor

In [8]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score, mean_squared_error

X, y = load_diabetes(as_frame=True, scaled=True).get('data').loc[:, ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']], load_diabetes(as_frame=True, scaled=True).get('target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

tree_reg1_ = DecisionTreeRegressor(max_depth=2)
tree_reg1_.fit(X_train, y_train)

y2 = y_train - tree_reg1_.predict(X_train)
tree_reg2_ = DecisionTreeRegressor(max_depth=2)
tree_reg2_.fit(X_train, y2)

y3 = y2 - tree_reg2_.predict(X_train)
tree_reg3_ = DecisionTreeRegressor(max_depth=2)
tree_reg3_.fit(X_train, y3)

y_pred_ = sum(tree.predict(X_test).round() for tree in (tree_reg1_, tree_reg2_, tree_reg3_))
r2_ = r2_score(y_test, y_pred_)
print(f'R2: {r2_}')


R2: 0.4235663441968759


In [9]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt_ = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.7)
gbrt_.fit(X_train, y_train)
y_pred_ = gbrt_.predict(X_test)
r2_ = r2_score(y_test, y_pred_)
print(f'R2: {r2_}')

R2: 0.536799085946732


#### Example 8: Early Stopping

In [10]:
' Implementing staged_predict()'
import numpy as np

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=42)

gbrt_ = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt_.fit(X_train, y_train)

errors = [
    mean_squared_error(y_val, y_pred) for y_pred in gbrt_.staged_predict(X_val)
]
bst_n_estimators_ = np.argmin(errors) + 1

gbrt_best_ = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators_)
gbrt_best_.fit(X_train, y_train)
y_pred_ = gbrt_best_.predict(X_test)
r2_ = r2_score(y_test, y_pred_)
print(f'R2: {r2_}')

R2: 0.5495590801562829


In [11]:
' Implementing warm_start=True'

gbrt_ = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error_ = float('inf')
error_going_up_ = 0
for n_estimators in range(1, 120):
    gbrt_.n_estimators = n_estimators
    gbrt_.fit(X_train, y_train)
    y_pred_ = gbrt_.predict(X_val)
    val_error_ = mean_squared_error(y_val, y_pred_)
    if val_error_ < min_val_error_:
        min_val_error_ = val_error_
        error_going_up_ = 0
    else:
        error_going_up_ += 1
        if error_going_up_ == 5:
            break # Early stopping mechanism

y_pred_ = gbrt_.predict(X_test)
r2_ = r2_score(y_test, y_pred_)
print(f'R2: {r2_}')

R2: 0.5562674641315657


In [12]:
#pip install xgboost

In [13]:
' Using XGBoost'
import xgboost

xgb_reg_ = xgboost.XGBRegressor()
xgb_reg_.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=5)
y_pred = xgb_reg_.predict(X_val)
r2_ = r2_score(y_test, y_pred_)
print(f'R2: {r2_}')

[0]	validation_0-rmse:66.19632
[1]	validation_0-rmse:59.65037
[2]	validation_0-rmse:55.09643
[3]	validation_0-rmse:55.44572
[4]	validation_0-rmse:54.75318
[5]	validation_0-rmse:54.72742
[6]	validation_0-rmse:55.46946
[7]	validation_0-rmse:55.31054
[8]	validation_0-rmse:55.98979
[9]	validation_0-rmse:56.59273
[10]	validation_0-rmse:56.73968
R2: 0.5562674641315657




### Stacking

The last Ensemble method we will discuss in this chapter is called <i> stacking </i> (short for stacked generalization).  It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don't we train a model to perform this aggregation?

Ex. Three predictors predcit values (3.1, 2.7, and 2.9) and the final predictor (called a <i> blender </i> or a <i> meta learner </i>) takes the predictions as inputs and makes the final prediction.  

To train a blender, a common approach is to use a hold-out set.  Alternatively, it is possible to use out-of-fold predictions.  In some contexts this is called stacking, while using a hold-out set is called blending.  For many people these terms are synonymous.

It is actually possible to train several blenders (e.g. one using Linear Regression, another using Random Forest, etc.) to get a whole layer of blenders.  The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer.)  Once this is done, we can make a prediction for a new instance by going through each layer sequentially.

Unfortunately, Scikit-Learn does not support stacking directly, but it is not too hard to roll out your own implementation (see the following exercises.)

### Exercises

<b> 1. If you have trained five different models on the exact same training data, and they all achieve a 95% precision, is there any chance that you can combine these models to get better results?  If so, how? If not, why?

My answer:

Ensemble models work best when the invidiual models have uncorrelated errors.  In other words, if they've been trained on different data or trained in different ways.  In this case, we don't know what the five different models are.  If they are all different models then yes, there is a good chance that combining them into an ensemble and utilizing hard or soft voting will yield getter results than any one model.  However, if they are all the same model, for example SVMs, trained on the exact same data then no it is unlikely that combining them will yield better results.

<b> 2. What is the difference between hard and soft voting classifiers?

My answer:

Hard voting is essentially taking the mode prediction from all of the individual models.  Soft voting occurs when each of the models in the ensemble can supply the probability of their prediction (predict_proba() for scikit-learn models).  In which case the ensemble prediction will be the probability-weighted prediction of the all of the models.

<b> 3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers?  What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?

My answer:

1. <b>Baggging</b>: Since bagging just randomizes the training set then pulls a sample from that randomization for each model to train on you can run it in parallel for effeciency gains.
1. <b>Pasting</b>: Since pasting divvys up the original training set and trains models on each subset you can run it in parallel for effeciency gains.
1. <b>Boosting</b>: Since boosting requires the outputs from the predecessor model as inputs to determine new weights it cannot be run in parallel.  
1. <b>Random Forests</b>: There's no reason you can't run various ML models in parallel.
1. <b>Stacking</b>: Since stacking relies on the predictions of trained models it an only be parallelized if those models are finished training.  Training the blender(s) on the dataset of predictions can be run in parallel.

<b> 4. What is the benefit of out-of-bag evaluation?

My answer:

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set.  

<b> 5. What makes Extra-Trees more random than regular Random Forests?  How can this extra randomness help?  Are Extra-Trees slower or faster than regular Random Forests?

My answer:

At each node of a Random Forest a random subset of features is selected and the optimal boundary is found for those features.  When utilizing Extra-Trees that boundary condition is randomized. You can technically create several Random Forests using Extra-Trees on the same dataset and get several models that have loosely correlated error thus increasing the predictive power of an ensemble of these models. Extra-Trees trades reduced variance for more bias and speeds up the training.  

<b> 6. If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?

My Answer:

Start by trying to increase n_estimators then if the model is still underfitting reduce other regularization hyperparameters on the base estimator.

<b> 7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

My Answer:

Decrease the learning rate.

<b> 8. Load the MNIST data and split it into a training set, a validation set, and a test set (50,000, 10,000, 10,000 samples).  Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier.  Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting.  Once you have found one, try it on the test set.  How much better does it perform compared to the individual classifiers?

In [36]:
from keras.datasets import mnist

(train_X, train_y), (test_X, test_y) = mnist.load_data()
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, train_size=50000)
train_X.shape, val_X.shape, test_X.shape

((50000, 28, 28), (10000, 28, 28), (10000, 28, 28))

In [37]:
train_X = train_X.reshape(train_X.shape[0], train_X.shape[1] * train_X.shape[2])
val_X = val_X.reshape(val_X.shape[0], val_X.shape[1] * val_X.shape[2])
test_X = test_X.reshape(test_X.shape[0], test_X.shape[1] * test_X.shape[2])
train_X.shape, val_X.shape, test_X.shape

((50000, 784), (10000, 784), (10000, 784))

In [38]:
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

rnd_clf_ = RandomForestClassifier(n_estimators=100, max_depth=500, min_samples_split=5)
scv_clf_ = SVC(C=1.0, kernel='rbf', degree=3, probability=True)
ext_clf_ = ExtraTreesClassifier(n_estimators=100, max_depth=500, min_samples_split=5)
log_clf_ = LogisticRegression(C=0.5, penalty="l1", solver="saga", tol=0.1)

rnd_clf_.fit(train_X, train_y)
scv_clf_.fit(train_X, train_y)
ext_clf_.fit(train_X, train_y)
log_clf_.fit(train_X, train_y)

In [40]:
rnd_preds_ = rnd_clf_.predict(val_X)
scv_preds_ = scv_clf_.predict(val_X)
ext_preds_ = ext_clf_.predict(val_X)
log_preds_ = log_clf_.predict(val_X)

In [59]:
from scipy import stats
all_preds_ = np.array([rnd_preds_, scv_preds_, ext_preds_, log_preds_])
hard_vote_ = stats.mode(all_preds_, keepdims=True)[0].T
hard_vote_

array([[7],
       [1],
       [5],
       ...,
       [5],
       [1],
       [1]], dtype=uint8)

In [58]:
rnd_accuracy_score_ = accuracy_score(val_y, rnd_preds_)
scv_accuracy_score_ = accuracy_score(val_y, scv_preds_)
ext_accuracy_score_ = accuracy_score(val_y, ext_preds_)
log_accuracy_score_ = accuracy_score(val_y, log_preds_)
ensemble_accuracy_score_ = accuracy_score(val_y, hard_vote_.T)
print(f'rnd: {rnd_accuracy_score_}, svc: {scv_accuracy_score_}, ext: {ext_accuracy_score_}, log: {log_accuracy_score_}, hard_vote: {ensemble_accuracy_score_}')

rnd: 0.9669, svc: 0.9769, ext: 0.9665, log: 0.9183, hard_vote: 0.9707


In [61]:
eclf1 = VotingClassifier(
    estimators=[('rnd', rnd_clf_), ('svc', scv_clf_), ('ext', ext_clf_), ('log', log_clf_)], 
    voting='hard'
)
eclf1.fit(train_X, train_y)
hard_vote_ = eclf1.predict(val_X)

In [63]:
rnd_accuracy_score_ = accuracy_score(val_y, rnd_preds_)
scv_accuracy_score_ = accuracy_score(val_y, scv_preds_)
ext_accuracy_score_ = accuracy_score(val_y, ext_preds_)
log_accuracy_score_ = accuracy_score(val_y, log_preds_)
ensemble_accuracy_score_ = accuracy_score(val_y, hard_vote_)
print(f'rnd: {rnd_accuracy_score_}, svc: {scv_accuracy_score_}, ext: {ext_accuracy_score_}, log: {log_accuracy_score_}, hard_vote: {ensemble_accuracy_score_}')

rnd: 0.9669, svc: 0.9769, ext: 0.9665, log: 0.9183, hard_vote: 0.9697


In [64]:
eclf2 = VotingClassifier(
    estimators=[('rnd', rnd_clf_), ('svc', scv_clf_), ('ext', ext_clf_), ('log', log_clf_)], 
    voting='soft',
    n_jobs= -1
)
eclf2.fit(train_X, train_y)
soft_vote_ = eclf2.predict(val_X)

In [65]:
rnd_accuracy_score_ = accuracy_score(val_y, rnd_preds_)
scv_accuracy_score_ = accuracy_score(val_y, scv_preds_)
ext_accuracy_score_ = accuracy_score(val_y, ext_preds_)
log_accuracy_score_ = accuracy_score(val_y, log_preds_)
ensemble_accuracy_score_ = accuracy_score(val_y, soft_vote_)
print(f'rnd: {rnd_accuracy_score_}, svc: {scv_accuracy_score_}, ext: {ext_accuracy_score_}, log: {log_accuracy_score_}, soft_vote: {ensemble_accuracy_score_}')

rnd: 0.9669, svc: 0.9769, ext: 0.9665, log: 0.9183, soft_vote: 0.9703


<b> Basically what I need to do from here is use cross-validation to optimize each model.  Apply some of the concepts of the chapter like boosting, bagging, pasting, etc. to get these accuracies really tight.  And then the most likely outcome is the ensemble will outperform all models.  In the current situtation the SVC classifier is outperforming the ensemble by a little bit.  But training and optimizing these models can take hours or sometimes days and I would prefer to spend my time on other excerises and new chapters.

<b> 9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is your image's class.  Train a classifier on this new training set.  Congratuations, you have just trained a blender, and together with the classifers it forms a stacking ensemble!  Now evaluate the ensemble on the test set.  For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions.  How does it compare to the voting classifier you trained earlier?

In [72]:
val_preds_ = np.array([rnd_preds_, scv_preds_, ext_preds_, log_preds_]).T
svc_stack_ = SVC(C=1.0, kernel='rbf', degree=3, probability=True)
svc_stack_.fit(val_preds_, val_y)

In [73]:
rnd_preds_ = rnd_clf_.predict(test_X)
scv_preds_ = scv_clf_.predict(test_X)
ext_preds_ = ext_clf_.predict(test_X)
log_preds_ = log_clf_.predict(test_X)

test_preds_ = np.array([rnd_preds_, scv_preds_, ext_preds_, log_preds_]).T
stack_preds_ = svc_stack_.predict(test_preds_)
ensemble_accuracy_score_ = accuracy_score(stack_preds_, test_y)
print(f'stack_predictions: {ensemble_accuracy_score_}')

stack_predictions: 0.9654
