# Chapter 5.7 - Ensemble Learning: Bagging, Boosting and Stacking

These methods are **Ensemble Learning** techniques. These models are machine learning
paradigms where multiple models (often called “weak learners”) are trained to **solve the same
problem** and **combined** to get **better** results. The main hypothesis is that when **weak models**
are **correctly combined** we can obtain **more accurate and/or robust models**.

Usually, ensemble models are used in order to :

    • decrease the variance for bagging (Bootstrap Aggregating) technique
    • reduce bias for the boosting technique
    • improving the predictive force for stacking technique.

In [4]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#import warnings
#warnings.filterwarnings("ignore")

# Bagging
(Standing for “**b**ootstrap **aggr**egat**ing**”). Aims at producing an ensemble model that is morerobust than the individual models composing it.

The idea of bagging is then simple: we want to fit several independent models and “average”their predictions in order to obtain a model with a lower variance. However, we can’t, inpractice, fit fully independent models because it would require too much data. So, we rely onthe good “approximate properties” of bootstrap samples (representativity and independence)to fit models that are almost independent.

Bagging consists in fitting several base models on different bootstrap samples and build anensemble model that “average” the results of these weak learners.

### Bagged Decision Trees for Classification

In [7]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url,names=names)

df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [8]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

array = df.values  # Convert the pandas DataFrame 'df' into a NumPy array.
x = array[:, 0:8]  # Select all rows and columns 0 to 7 (features) from the array (first 8 columns).
y = array[:, 8]    # Select all rows and column 8 (target/label) from the array (9th column, used for classification).

max_features = 3    # Set the maximum number of features that each base estimator (decision tree) can use for splitting.

# Define the cross-validation strategy using KFold 
kfold = model_selection.KFold(n_splits=10,              # 10 splits - each fold will be split into training and testing sets.
                              shuffle=True,             # Shuffle the data before splitting into folds.
                              random_state=2020)        # Use a fixed random state for reproducibility.

# Initialize a decision tree classifier as the base estimator for bagging
rf = DecisionTreeClassifier(max_features=max_features)  # limit of 3 features for splitting.

num_trees = 100   # Define the number of trees (or estimators) in the bagging ensemble model.

# Initialize a bagging classifier with 100 base estimators (decision trees) and a fixed random state.
model = BaggingClassifier(estimator=rf,          # Base estimator: decision tree classifier
                          n_estimators=num_trees,  # Define the number of trees (100)
                          random_state=2020)       # The bagging model will use decision trees as its base estimator.

# Perform 10-fold cross-validation on the model using the feature set 'x' and target labels 'y'.
results = model_selection.cross_val_score(model, x, y, cv=kfold)

# Print the mean accuracy and standard deviation of the model across the 10 folds.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))  # Results are reported as mean accuracy +/- standard deviation.

Accuracy: 0.77 (+/- 0.04)


### Random Forest Classification

In [10]:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier


# Dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"  # URL of the Pima Indians Diabetes dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']  # Column names for the dataset
df = pd.read_csv(url, names=names)  # Load the dataset from the URL into a pandas DataFrame

# Convert the DataFrame into a NumPy array
array = df.values  # Extract the values of the DataFrame as a NumPy array
x = array[:, 0:8]  # Select all rows and the first 8 columns (features) for input (X)
y = array[:, 8]    # Select all rows and the 9th column (target labels: class) for output (Y)

# Define the cross-validation strategy using KFold
kfold = model_selection.KFold(n_splits=10,        # 10 splits - each fold will be split into training and testing sets.
                              shuffle=True,       # Shuffle enabled
                              random_state=2020)  # Use a fixed random state for reproducibility. 

# Specify the number of trees (estimators) for the RandomForest model
num_trees = 100  # Use 100 decision trees in the RandomForest ensemble

# Specify the maximum number of features used to split each node in the RandomForest model
max_features = 3  # Use 3 features for splitting each node in the trees

# Initialize the RandomForestClassifier model
model = RandomForestClassifier(n_estimators=num_trees,     # 100 trees
                               max_features=max_features)  # limit to 3 features per split

# Perform cross-validation on the RandomForest model
results = model_selection.cross_val_score(model, x, y, cv=kfold)        # Perform cross-validation and store the results (accuracy for each fold)

# Print the mean accuracy and standard deviation from the 10-fold cross-validation
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))  # Print the average accuracy and its standard deviation across the folds

Accuracy: 0.77 (+/- 0.04)


# Boosting

In **sequential methods** the different combined weak models are **no longer** fitted **independently** from each others. The idea is to fit models **iteratively** such that the training of model at a given step depends on the models fitted at the previous steps. “Boosting” is the most famousof these approaches and it produces an ensemble model that is in general **less biased** than the weak learners that compose it.

Boosting methods work in the same spirit as bagging methods: we build a family of models
that are aggregated to obtain a strong learner that performs better.

However, unlike bagging that mainly aims at reducing variance, boosting is a technique
that consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset that
were badly handled by the previous models in the sequence. Intuitively, each new model
focus its efforts on the most difficult observations to fit up to now, so that we obtain, at the
end of the process, a strong learner with **lower bias** (even if we can notice that boosting can
also have the effect of reducing variance).

Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble model and“update” the training dataset to better take into account the strengths and weakness of thecurrent ensemble model when fitting the next base model.

### Adaboost Classifier

In [15]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score


# Load the breast cancer dataset from sklearn
breast_cancer = load_breast_cancer()

# Convert the data into a DataFrame for easier manipulation, using feature names as column headers
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)

# Convert the target into a categorical type, using the target names as labels
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)

# Label encode the target to convert it into integer labels (0 and 1)
encoder = LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))  # Fit the encoder and transform the labels to integers

# Split the dataset into training and testing sets, using 75% for training and 25% for testing
train_x, test_x, train_y, test_y = train_test_split(x, binary_encoded_y, random_state=1)

# Initialize an AdaBoost classifier
clf_boosting = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),  # Base estimator is a decision tree with a maximum depth of 1
    n_estimators=200,                     # Use 200 decision trees (weak learners) in the AdaBoost ensemble
    algorithm='SAMME'                     # Use SAMME to avoid the warning
)

# Fit the AdaBoost classifier to the training data
clf_boosting.fit(train_x, train_y)

# Make predictions on the test set
predictions = clf_boosting.predict(test_x)

# Print the F1 score and accuracy of the model
print("For Boosting : F1 Score {}, Accuracy {}".format(
    round(f1_score(test_y, predictions), 2),       # Compute the F1 score (rounded to 2 decimal places)
    round(accuracy_score(test_y, predictions), 2)  # Compute the accuracy score (rounded to 2 decimal places)
))

For Boosting : F1 Score 0.93, Accuracy 0.95


### Random Forest as a Bagging Classifier

In [17]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier


# Load the breast cancer dataset from sklearn
breast_cancer = load_breast_cancer()

# Convert the breast cancer data into a pandas DataFrame with feature names as columns
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)

# Convert the target values into categorical labels using the target names
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)

# Transforming string target (i.e., 'malignant', 'benign') to an integer (0 or 1)
encoder = LabelEncoder()                                # Initialize the LabelEncoder
binary_encoded_y = pd.Series(encoder.fit_transform(y))  # Encode target labels to integers and store in a pandas Series

# Split the dataset into training and testing sets (75% train, 25% test)
train_x, test_x, train_y, test_y = train_test_split(x,                 # Features
                                                    binary_encoded_y,  # Target labels
                                                    random_state=1)    # Set random_state for reproducibility

# Initialize a RandomForestClassifier with 200 decision trees and max depth of 1 (similar to bagging with weak learners)
clf_bagging = RandomForestClassifier(n_estimators=200,  # Number of decision trees (estimators)
                                     max_depth=1)       # Set max depth of the decision trees to 1 (weak learners)

# Train (fit) the RandomForestClassifier on the training data
clf_bagging.fit(train_x, train_y)

# Make predictions on the test data
predictions = clf_bagging.predict(test_x)

# Print the F1 score and accuracy score of the model
print("For Bagging : F1 Score {}, Accuracy {}".format(
    round(f1_score(test_y, predictions), 2),       # Calculate F1 score and round to 2 decimal places
    round(accuracy_score(test_y, predictions), 2)  # Calculate accuracy score and round to 2 decimal places
))

For Bagging : F1 Score 0.85, Accuracy 0.9


| Metric | Bagging | Boosting |
|:---------|:--------:|---------:|
|  Accuracy   |  0.91   |  0.95   |
|  F1-Score   |  0.88   |  0.93   |

# Stacking

**Stacking** mainly differ from **Bagging** and **Boosting** on two points : - First stacking often considers heterogeneous weak learners (different learning algorithms are combined) whereas bagging and boosting consider mainly homogeneous weak learners. - Second, stacking learns
to combine the base models using a meta-model whereas bagging and boosting combine weak learners following deterministic algorithms.

In [21]:
from sklearn.ensemble import AdaBoostClassifier  
from sklearn.tree import DecisionTreeClassifier  
from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix  
from sklearn.preprocessing import LabelEncoder  
from sklearn.metrics import accuracy_score 
from sklearn.metrics import f1_score 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 

# Load the breast cancer dataset
breast_cancer = load_breast_cancer()

# Create a DataFrame for the features of the breast cancer dataset
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names) 

# Create a Categorical object for the target values
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)  # Convert the target labels to categorical format

# Encode the categorical target labels into binary (0 or 1)
encoder = LabelEncoder()                                # Initialize the label encoder
binary_encoded_y = pd.Series(encoder.fit_transform(y))  # Apply the encoder to transform the target labels into binary format

# Split the dataset into training and testing sets (75% train, 25% test)
train_x, test_x, train_y, test_y = train_test_split(x,  # Input features
                                                    binary_encoded_y,   # Target labels
                                                    random_state=2020)  # Use a fixed random state for reproducibility

# Initialize the AdaBoostClassifier with a DecisionTreeClassifier as the base estimator (with a maximum depth of 1)
boosting_clf_ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),  # Weak learner for boosting (decision stump)
                                            n_estimators=3,                       # Use 3 weak learners (decision stumps) in boosting
                                            algorithm='SAMME')                    # Use SAMME algorithm to avoid the FutureWarning

# Initialize the RandomForestClassifier for bagging (using decision trees with a max depth of 1)
bagging_clf_rf = RandomForestClassifier(n_estimators=200,   # Use 200 decision trees in the RandomForest model
                                        max_depth=1,        # Use weak learners (max depth = 1) for each tree
                                        random_state=2020)  # Use a fixed random state for reproducibility

# Another RandomForestClassifier to use for stacking (same settings as bagging_clf_rf)
clf_rf = RandomForestClassifier(n_estimators=200,   # 200 decision trees for RandomForest
                                max_depth=1,        # Weak learners (max depth = 1)
                                random_state=2020)  # Use fixed random state for reproducibility

# Initialize the AdaBoostClassifier for stacking
clf_ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1,         # Weak learner for boosting (decision stump)
                                                          random_state=2020),  # Fixed random state
                                   n_estimators=3,     # 3 weak learners (decision stumps)
                                   algorithm='SAMME')  # Use SAMME algorithm to avoid the warning

# Initialize LogisticRegression for stacking as the final classifier (meta-learner)
clf_logistic_reg = LogisticRegression(solver='liblinear',  # Use liblinear solver for binary classification
                                      random_state=2020)   # Use fixed random state for reproducibility

# Custom exception to handle cases where the number of classifiers in stacking is less than 2
class NumberOfClassifierException(Exception):
    pass  # Define a custom exception to raise when the number of classifiers is insufficient

# Create a stacking class
class Stacking():
    '''
    This is a test class for stacking! Please feel free to modify it to fit your needs.
    We assume that at least the first N-1 classifiers have a predict_proba function.
    '''
    def __init__(self,classifiers):
        # Ensure there are at least 2 classifiers
        if(len(classifiers) < 2):                                                                          # Raise an error if there are not 
            raise NumberOfClassifierException("You must fit your classifier with at least 2 classifiers")  # enough classifiers
        else:
            self._classifiers = classifiers  # Store the list of classifiers
        
    def fit(self, data_x, data_y):
        # Initialize stacked data with original features
        stacked_data_x = data_x.copy()       # Create a copy of the input data (features)
        
        # Train N-1 classifiers and stack their predicted probabilities
        for classifier in self._classifiers[:-1]:
            classifier.fit(data_x, data_y)  # Fit each classifier to the data
            stacked_data_x = np.column_stack((stacked_data_x, classifier.predict_proba(data_x))) # Stack the predicted probabilities of each classifier
        
        # Fit the final (meta) classifier on the stacked data
        last_classifier = self._classifiers[-1]      # Select the last classifier (meta-learner)
        last_classifier.fit(stacked_data_x, data_y)  # Fit the meta-learner on the stacked features and probabilities
        
    def predict(self, data_x):
        # Initialize stacked data with original features
        stacked_data_x = data_x.copy()  # Create a copy of the input data (features)
        
        # Stack predictions from N-1 classifiers
        for classifier in self._classifiers[:-1]:
            prob_predictions = classifier.predict_proba(data_x)                   # Get predicted probabilities
            stacked_data_x = np.column_stack((stacked_data_x, prob_predictions))  # Stack the predicted probabilities
        
        # Use the final (meta) classifier to make the final predictions
        last_classifier = self._classifiers[-1]         # Select the last classifier (meta-learner)
        return last_classifier.predict(stacked_data_x)  # Return the predictions made by the meta-learner

# Train the bagging classifier on the training data
bagging_clf_rf.fit(train_x, train_y)          # Fit the RandomForest bagging model on the training data

# Train the boosting classifier on the training data
boosting_clf_ada_boost.fit(train_x, train_y)  # Fit the AdaBoost model on the training data

# List of classifiers to use for stacking (RandomForest, AdaBoost, and LogisticRegression)
classifiers_list = [clf_rf, clf_ada_boost, clf_logistic_reg]  # Define the classifiers used in stacking

# Create an instance of the Stacking class
clf_stacking = Stacking(classifiers_list)  # Create a Stacking model with the classifiers
clf_stacking.fit(train_x, train_y)         # Train the stacking model on the training data

# Make predictions on the test data using the bagging model
predictions_bagging = bagging_clf_rf.predict(test_x)           # Predict the test labels using the bagging model

# Make predictions on the test data using the boosting model
predictions_boosting = boosting_clf_ada_boost.predict(test_x)  # Predict the test labels using the boosting model

# Make predictions on the test data using the stacking model
predictions_stacking = clf_stacking.predict(test_x)            # Predict the test labels using the stacking model

# Calculate the F1 score and accuracy
bagging_f1 = f1_score(test_y, predictions_bagging)         # Calculate the F1 score for the bagging model
bagging_ac = accuracy_score(test_y, predictions_bagging)   # Calculate the accuracy for the bagging model
boosting_f1 = f1_score(test_y, predictions_boosting)       # Calculate the F1 score for the boosting model
boosting_ac = accuracy_score(test_y, predictions_boosting) # Calculate the accuracy for the boosting model
stacking_f1 = f1_score(test_y, predictions_stacking)       # Calculate the F1 score for the stacking model
stacking_ac = accuracy_score(test_y, predictions_stacking) # Calculate the accuracy for the stacking model

# Print the F1 score and accuracy for the bagging, boosting, and stacking models
print("For Bagging : F1 Score {}, Accuracy {}".format(round(bagging_f1, 2), round(bagging_ac, 2)))     # Print F1 and accuracy for bagging
print("For Boosting : F1 Score {}, Accuracy {}".format(round(boosting_f1, 2), round(boosting_ac, 2)))  # Print F1 and accuracy for boosting
print("For Stacking : F1 Score {}, Accuracy {}".format(round(stacking_f1, 2), round(stacking_ac, 2)))  # Print F1 and accuracy for stacking

For Bagging : F1 Score 0.88, Accuracy 0.9
For Boosting : F1 Score 0.93, Accuracy 0.94
For Stacking : F1 Score 0.98, Accuracy 0.98


| Metric | Bagging | Boosting | Stacking |
|:---------|:--------:|:--------:|---------:|
|  Accuracy   |  0.90   |  0.94   |  0.98   |
|  F1-Score   |  0.88   |  0.93   |  0.98   |