# Ensemble Learning

## Voting Classifiers

Create a hard voting classifier and train it on the scikit-learn's moons datasets

In [1]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression(random_state=42)),
        ("rf", RandomForestClassifier(random_state=42)),
        ("svc", SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

0,1,2
,estimators,"[('lr', ...), ('rf', ...), ...]"
,voting,'hard'
,weights,
,n_jobs,
,flatten_transform,True
,verbose,False

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


When you fit a VotingClassifier, it clones every estimator and fits the clones. The
original estimators are available via the estimators attribute, while the fitted clones are available via the estimators_ attribute. If you prefer a dict rather than a list, you can use named_estimators or named_estimators_ instead. To begin, let’s look at each fitted classifier’s accuracy on the test set:

In [2]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.864
rf = 0.896
svc = 0.896


In [3]:
# perform hard voting
voting_clf.predict(X_test[:1])[0]

1

In [4]:
# prediction of the individual classifiers
[clf.predict(X_test[:1])[0] for clf in voting_clf.estimators_]

[1, 1, 0]

In [5]:
# score of the hard vorting classifier
voting_clf.score(X_test, y_test)

0.912

In [6]:
# train a soft voting classifier
# need all classifiers to have predict_proba method
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True # create predict proba for SVC
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.92

## Bagging and Pasting Ensembles

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# example of bagging
# bootstrap=False for pasting
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            max_samples=100, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)

0,1,2
,estimator,DecisionTreeClassifier()
,n_estimators,500
,max_samples,100
,max_features,1.0
,bootstrap,True
,bootstrap_features,False
,oob_score,False
,warm_start,False
,n_jobs,-1
,random_state,42

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


A BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier
can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with
decision tree classifiers.

## Out-of-bag Evaluation

In [8]:
# to use out-of-bag evaluation, set oob_score=True
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

According to this OOB evaluation, this BaggingClassifier is likely to achieve about
89.6% accuracy on the test set. Let’s verify this:

In [9]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.92

The OOB decision function is also available the oob_decision_function_ attribute:

In [10]:
bag_clf.oob_decision_function_[:3]

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

## Random Forests

RandomForestClassifier class is more convenient and optimized for decision trees.

In [11]:
# Train a Random Forest Classifier with 500 trees, each limited to 16 leaf nodes.
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

In [12]:
# the following bagging classifier is equivalent to the preivous RandomForestClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

## Feature Importance

See each feature's importance of the Iris dataset using RandomForest 

In [13]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score,2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


Random forests are very handy to get a quick understanding of what features actually
matter, in particular if you need to perform feature selection.

## Boosting

### AdaBoost

In [14]:
# train an AdaBoost classifier
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30,
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

0,1,2
,estimator,DecisionTreeC...r(max_depth=1)
,n_estimators,30
,learning_rate,0.5
,algorithm,'deprecated'
,random_state,42

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,1
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


### Gradient Boosting

In [15]:
# create a noisy quadratic dataset and fit a DecisionTreeRegressor to it
import numpy as np
from sklearn.tree import DecisionTreeRegressor

X = np.random.default_rng(seed=42).random((100, 1)) - 0.5
#3x² + Gaussian noise
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.default_rng(seed=42).standard_normal(100)
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,2
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [16]:
# train a second DecisionTreeRegressor on the residual errors made by the first predictor
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,2
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,43
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [17]:
# train a third regressor on the residual errors made by the second
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,2
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,44
,max_leaf_nodes,
,min_impurity_decrease,0.0


Now we have an ensemble containing three trees. It can make predictions on a new
instance simply by adding up the predictions of all the trees:

In [18]:
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.44237378, 0.02652534, 0.65823772])

In [19]:
# following code create the same ensemble as the previous one:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,
                                 learning_rate=1, random_state=42)
gbrt.fit(X, y)

0,1,2
,loss,'squared_error'
,learning_rate,1
,n_estimators,3
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,2
,min_impurity_decrease,0.0


In [20]:
# gradient boosting with early stopping
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500,
    n_iter_no_change=10, random_state=42)
gbrt_best.fit(X, y)

0,1,2
,loss,'squared_error'
,learning_rate,0.05
,n_estimators,500
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,2
,min_impurity_decrease,0.0


In [21]:
gbrt_best.n_estimators_ # not 500 thanks to early stopping

114

### Histogram-Based Gradient Boosting

In [22]:
# train a HBGB tree regressor and a complete pipeline for the California housing dataset
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

hgb_reg = make_pipeline(
    make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]),
                            remainder="passthrough"),
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)
# example; data is not in this notebook
# hgb_reg.fit(housing, housing_labels) 

The whole pipeline is just as short as the imports! No need for an imputer, scaler, or a
one-hot encoder, so it’s really convenient.

## Stacking

In [23]:
# train a stacking classifier on the moons dataset created earlier
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ("lr", LogisticRegression(random_state=42)),
        ("rf", RandomForestClassifier(random_state=42)),
        ("svc", SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5 # number of cross-validation folds
)
stacking_clf.fit(X_train, y_train)

0,1,2
,estimators,"[('lr', ...), ('rf', ...), ...]"
,final_estimator,RandomForestC...ndom_state=43)
,cv,5
,stack_method,'auto'
,n_jobs,
,passthrough,False
,verbose,0

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,True
,tol,0.001
,cache_size,200
,class_weight,

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


# Questions

8. Load the MNIST dataset (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for
validation, and 10,000 for testing). Then train various classifiers, such as a random
forest classifier, an extra-trees classifier, and an SVM classifier. Next, try to
combine them into an ensemble that outperforms each individual classifier on the
validation set, using soft or hard voting. Once you have found one, try it on the test
set. How much better does it perform compared to the individual classifiers? 

In [24]:
# fetch_openml("mnist_784") not working (http error) so I am using keras
from tensorflow import keras

(X_train_valid, y_train_valid), (X_test, y_test) = keras.datasets.mnist.load_data() # loads with shape (instances_num, 28, 28)

In [25]:
X_train_valid = X_train_valid.reshape(60_000, 784)
X_test = X_test.reshape(10_000, 784)

In [26]:
X_train, X_valid,  y_train, y_valid = train_test_split(X_train_valid, y_train_valid, train_size=50_000, random_state=42,stratify=y_train_valid) # stratified split

In [27]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler

models = [("Random Forest", RandomForestClassifier(random_state=42, n_jobs=-1)), ("Extra Trees", ExtraTreesClassifier(random_state=42, n_jobs=-1)),("Logistic Regression", LogisticRegression(random_state=42, max_iter=500))]

# need to scale data because of Logistic Regression and, surprinsingly, mnist features do not have the same range, according to https://www.openml.org/search?type=data&status=active&id=554
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

In [28]:
for name, model in models:
    print("training", name)
    model.fit(X_train_scaled, y_train)
    print("Scaled Validation accuracy:", model.score(X_valid_scaled, y_valid), "\n")

training Random Forest
Scaled Validation accuracy: 0.9666 

training Extra Trees
Scaled Validation accuracy: 0.9701 

training Logistic Regression
Scaled Validation accuracy: 0.9124 



In [29]:
voting_clf = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression(random_state=42, max_iter=500)),
        ("rf", RandomForestClassifier(random_state=42, n_jobs=-1)),
        ("extra", ExtraTreesClassifier(random_state=42, n_jobs=-1))
    ],
    n_jobs=-1
)
voting_clf.fit(X_train_scaled, y_train)
print("Hard voting classifier score:", voting_clf.score(X_valid_scaled, y_valid))

Hard voting classifier score: 0.9673


In [30]:
voting_clf.voting = "soft"
print("Soft voting classifier score:", voting_clf.score(X_valid_scaled, y_valid))

Soft voting classifier score: 0.9518


Well, the only hyperparameters votingclassifier has is Voting. It does not outperforms Extremely randomized trees but it got close.

In [31]:
voting_clf.voting = "hard"
models.append(("Voting Classifier", voting_clf))

Let's see all the model performances on the test set:

In [32]:
for name, model in models:
    print(name, "test set accuracy:", model.score(X_test_scaled,y_test))

Random Forest test set accuracy: 0.9694
Extra Trees test set accuracy: 0.9712
Logistic Regression test set accuracy: 0.919
Voting Classifier test set accuracy: 0.9702


So close... but not enough. Voting classifier ensemble is just 0.001 behind Extra trees

9. Run the individual classifiers from the previous exercise to make predictions on the
validation set, and create a new training set with the resulting predictions: each
training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Train a classifier on this
new training set. Congratulations—you have just trained a blender, and together
with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on
the test set. For each image in the test set, make predictions with all your
classifiers, then feed the predictions to the blender to get the ensemble’s
predictions. How does it compare to the voting classifier you trained earlier? Now
try again using a StackingClassifier instead. Do you get better performance? If
so, why?

In [33]:
# create a function that will take the predictions of each model 
def create_pred_arr(X, models):
    arr = [model.predict(X).reshape((-1,1)) for name, model in models]
    return np.concatenate(arr, axis=1)

In [34]:
extra_blender = ExtraTreesClassifier(random_state=42, n_jobs=-1)

extra_blender.fit(create_pred_arr(X_valid_scaled, models[:-1]), y_valid) # fit blender

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


In [35]:
print("my stacking ensemble test accuracy:", extra_blender.score(create_pred_arr(X_test_scaled, models[:-1]), y_test))

my stacking ensemble test accuracy: 0.9687


In [36]:
stacking = StackingClassifier(
    estimators=models[:-1],
    final_estimator=ExtraTreesClassifier(random_state=42, n_jobs=-1),
    n_jobs=-1,
    cv=5)
stacking.fit(X_train_scaled, y_train)

0,1,2
,estimators,"[('Random Forest', ...), ('Extra Trees', ...), ...]"
,final_estimator,ExtraTreesCla...ndom_state=42)
,cv,5
,stack_method,'auto'
,n_jobs,-1
,passthrough,False
,verbose,0

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,500

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


In [37]:
print("sklearn's Stacking ensemble:", stacking.score(X_test_scaled, y_test))

sklearn's Stacking ensemble: 0.9761


Finally another ensemble outperformed extra trees... Their implementation did better than mine because they used cross valid to get out-of-sample predictions and so it worked with a lot more data (50k samples vs 10k samples)