[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/9_ensemble_learning_solutions.ipynb) 

# Tutorial 9 - Ensemble Learning 

<span style="font-weight: bold; color: red;">This version includes solutions to the exercises. </span>

In this tutorial, we revisit our lecture on ensemble learning, which has explained new algorithms, namely Random Forest (RF) and Gradient Boosting (GBM), and, more generally, introduced us to the space of *ensemble learners*, which generate a composite forecast by integrating the predictions of multiple *base models*. We will cover the following topics:

#### Contents

1. [Foundations of Ensemble Modeling and Forecast Combination](#foundations)  
2. [Bagging and Random Forest](#bagging-rf)  
3. [Gradient Boosting](#gradient-boosting)  
4. [Hyperparameter Tuning for Ensemble Models](#modelsel)

Within each section, you will find:
- A **Concept Recap** highlighting the key ideas from the lecture.
- A **Programming Demo** illustrating these ideas in Python.
- **Exercises** to deepen your understanding and practice these methods.


## Preliminaries

### Standard imports

In [2]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### The HMEQ data set
We continue using the "Home Equity" data set (HMEQ), which we can nicely load and prepare using the helper function `get_HMEQ_credit_data`, which is available in our courses module `bads_helper_functions.py`

In [None]:
import bads_helper_functions as bads  # import module with bads helper functions
X, y = bads.get_HMEQ_credit_data()  # load the data 

print("Data loaded. Shape of X: ", X.shape, "Shape of y:", y.shape)

In [None]:
X  # preview the data   

<a id="foundations"></a>
# 1. Foundations of Ensemble Modeling and Forecast Combination

## 1.1 Concept Recap
In ensemble modeling, multiple models (referred to as “base learners” or “weak learners”) are combined to produce a single predictive model. The motivation behind ensemble modeling is that by combining various learners, we often get better generalization performance than using a single learner. Ensemble models help to:

- Reduce variance (stabilize predictions).
- Potentially reduce bias (when certain conditions are met).
- Provide more robust and reliable predictions across different problem domains.

Techniques include:

- Averaging Methods: e.g., Bagging (Bootstrap Aggregating)
- Boosting Methods: e.g., Gradient Boosting
- Heterogeneous approaches, which integrate different learning algorithms to farm the base models. Examples include the Stacking algorithm, which combines base models using a meta-model (see, e.g., [this tutorial for a demo](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/) if interested). The lecture did not elaborate on heterogeneous ensemble learners; neither will this tutorial. We mainly mention them for the sake of completeness. 

## 1.2 Programming Demo: Simple Averaging Ensemble
Below is a simple demonstration of how one might combine three different models (a Logistic Regression and two Decision Trees) by averaging their probability estimates. This is actually a heterogeneous ensemble, as we combine different types of models. The example is meant to convey the generality of ensemble modeling, which the lecture illustrated as follows: <br>
<br>
<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/ensemble_learning.png" width="854" height="480" alt="Ensemble Learning">

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=888)  # partition the data into 75% training and 25% test

# Define base models
base_learners = [
    LogisticRegression(max_iter=1000, random_state=888),
    DecisionTreeClassifier(max_depth=3),
    DecisionTreeClassifier(max_depth=10)
]

# Train the models
for m in base_learners:
    m.fit(X_train, y_train)

# Predict probabilities
probs = []
for m in base_learners:
    yhat = m.predict_proba(X_test)[:, 1]
    probs.append(yhat)

# Simple average forecast 
ensemble_prob = np.mean(probs, axis=0)

# Performance evaluation
print('Performance evaluation in terms of AUC:')
print('-' * 50)
print(f'Ensemble \t {roc_auc_score(y_true=y_test, y_score=ensemble_prob):.3f}')
print('-' * 50)
for i, p in enumerate(probs):
    print(f'Base model {i} \t {roc_auc_score(y_true=y_test, y_score=p):.3f}')
print('-' * 50)


<a id="bagging-rf"></a>

# 2. Bagging and Random Forest

## 2.1 Concept Recap
Bagging (Bootstrap Aggregating):

In bagging, we train multiple instances of the same model class on different bootstrap samples (randomly sampled with replacement) of the original training set.
The final prediction is usually made by majority voting (classification) or averaging (regression).

Random Forest:

Random Forest is an extension of bagging, which uses Decision Trees as base learners. Random Forest also uses feature sub-sampling at each split to decorrelate individual trees. Much empirical evidence suggests that Random Forest is a very effective and robust algorithm for a wide range of prediction tasks. It is also relatively simple to tune.

## 2.2 Programming Demo: Random Forest
Below, we train a simple Random Forest, measure its AUC and plot the ROC curve. To showcase the effectiveness of Random Forest, we add the results of the previous demo, our heterogeneous ensemble, to the ROC plot. 


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay

# Instantiate and train RandomForest
rf_model = RandomForestClassifier(n_estimators=100,  # the number of base model trees
                                   max_depth=5,      # the maximum depth of each tree
                                   random_state=888  # Random number seed. Recall that RF is stochastic due to bootstrap sampling and random subspace
                                   )
rf_model.fit(X_train, y_train)

# Evaluate
rf_probs = rf_model.predict_proba(X_test)[:,1]
print(f"Random Forest AUC: {roc_auc_score(y_true=y_test, y_score=rf_probs):.3f}")


In [None]:
# Visualize RF performance and compare to benchmarks
f, ax = plt.subplots(figsize=(8,6))
RocCurveDisplay.from_estimator(estimator=rf_model, X=X_test, y=y_test, ax=ax)
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=ensemble_prob, ax=ax, name="Simple avg. ensemble")
for i, p in enumerate(probs):
    RocCurveDisplay.from_predictions(y_true=y_test, y_pred=p, ax=ax, name=f"Base model {i}")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()


## 2.1 Exercise 1
Unlike RF, the Bagging algorithm is not specific to Decision Trees. It can be applied to any base learner.

Here is your task:<br>
- Train two Bagging classifiers, one using logistic regression and one using decision trees to train the base models. This allows for a comparison of which algorithm, logistic regression or decision tree, benefits more from bagging. An implementation of the Bagging algorithm is available in the `sklearn.ensemble` module. 
- For each Bagging classifier, use 50 base learners and set the random state to 888. 
  - For the logistic regression base learner, set `max_iter=1000`, and `random_state=888`
  - For the decision tree base learner, set `max_depth=3`
  - Note that these are the same settings as used in our first demo on the simple average ensemble. 
- Create an ROC chart to compare the models. Specifically:
  - Plot the ROC curve for the bagged logistic regression
  - Plot the ROC curve for the bagged decision tree
  - Plot the ROC curve for the non-bagged logistic regression trained in the first demo. To do this, simply reuse the already stored predictions `probs[0]`
  - Also add a ROC curve for the non-bagged decision tree trained. Again, simply reuse the available predictions `probs[1]`
- Add a legend to the plot to distinguish the different models.
    

<details> <summary>Hint on bagging </summary> Use the following scaffolding to configure the Bagging classifier. :

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=___), 
    n_estimators=___, 
    random_state=___
)
bagging_model.fit(X_train, y_train)
# Evaluate your model
```	
</details>

<details> <summary>Hint on the ROC curve </summary> Note that the previous ROC curve example illustrates how to draw curves for the non-bagged models. Assuming you have executed previous code cells, you can reuse the stored predictions `probs[0]` and `probs[1]` to plot the ROC curves for the non-bagged models as follows:

```python
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=probs[0], ax=ax, name=f"Logistic regression")
```	
</details>

In [None]:
from sklearn.ensemble import BaggingClassifier

# Define the Bagging classifiers
bagging_logreg = BaggingClassifier(
    base_estimator=LogisticRegression(max_iter=1000, random_state=888),
    n_estimators=50,
    random_state=888
)

bagging_dtree = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=50,
    random_state=888
)

# Train the Bagging classifiers
bagging_logreg.fit(X_train, y_train)
bagging_dtree.fit(X_train, y_train)

# Predict probabilities
yhat_logreg = bagging_logreg.predict_proba(X_test)[:, 1]
yhat_dtree = bagging_dtree.predict_proba(X_test)[:, 1]


# Plot ROC curves
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=yhat_logreg, ax=ax, name="Bagged Logistic Regression")
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=yhat_dtree,  ax=ax, name="Bagged Decision Tree")
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=probs[0],    ax=ax, name=f"Logistic regression")
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=probs[1],    ax=ax, name=f"Decision Tree")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()

Examining the result of Exercise 1, what can you say about the performance of the bagged logistic regression and the bagged decision tree compared to the non-bagged models?

More interestingly, what factors might explain the observed effectiveness of bagging (or lack thereof)?

**Answer:**
- The bagged version of a classifier and the non-bagged version perform almost identical. This implies that bagging was not effective in increasing predictive performance.
- It is not surprising that bagging failed in this case.
  - Logistic regression is a stable classifier. Fitting multiple instances on bootstrapped samples does yield must diversity. All base models will show very similar coefficients and, therefore, produce similar forecasts. Averaging over very similar forecasts does not boost performance. Thus, bagging is not expected to improve the performance of logistic regression or, more generally, a **stable classifier**.
  - Decision trees, on the other hand, are unstable classifiers. They are sensitive to small changes in the training data. Therefore, we can legitimately expect Bagging to boost performance. The reasons it failed in this case is our configuration of the tree learning algorithm. Setting `max_depth=3`, we allow it to only produce shallow trees. This limits the degree to which bootstrap sampling can yield diverse trees substantially. Therefore, the base model trees in our ensemble are also very similar; like with logistic regression. 

  To verify you could conduct yet another experiment comparing the performance of a deep tree to the performance of a bagging ensemble with deep trees as base learners. Exercise 1 provides 95% of all codes needed to do this.  

In [None]:
# EXTRA: Comparison of deep tree to bagged deep tree
bagging_deep_tree = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=50,
    random_state=888
)

bagging_deep_tree.fit(X_train, y_train)
yhat = bagging_deep_tree.predict_proba(X_test)[:, 1]

# Plot ROC curves. Note that we reuse the predictions from the deep tree trained
# in the demo on the simple average ensemble
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=yhat, ax=ax, name="Bagged Deep Tree")
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=probs[2], ax=ax, name=f"Deep Tree")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()

# Compute the relative performance boost of bagging
auc_bag = roc_auc_score(y_true=y_test,   y_score=yhat)
auc_nobag = roc_auc_score(y_true=y_test, y_score=probs[2])
assert auc_bag >= auc_nobag
perf_boost = (auc_bag-auc_nobag)/auc_nobag
print(f'Bagging improved deep tree performance by {100*perf_boost:.2f}%')

<a id="gradient-boosting"></a>

# 3. Gradient Boosting
## 3.1 Concept Recap
Boosting builds an ensemble of weak learners in a sequential way, where each new learner attempts to correct the errors of the previous ensemble. Gradient Boosting is a general boosting framework:

- Models are added sequentially.
- Each subsequent model is trained to reduce the residual errors (or gradient of the loss) of the current ensemble.
- Learning rate (shrinkage) controls how strongly each new model influences the ensemble.

Common implementations:
- GradientBoostingClassifier (in sklearn.ensemble)
- XGBoost, LightGBM, CatBoost (popular specialized libraries)

## 3.2 Programming Demo: Gradient Boosting
The code snippet below demonstrates training and evaluation of a gradient boosting classifier using the `GradientBoostingClassifier` implementation from `sklearn.ensemble`. While this implementation suffices for our purposes, it is worth noting that the specialized libraries like XGBoost, LightGBM, and CatBoost are preferred in practice due to their superior performance and scalability. Thus, when working on real-world projects and with larger data sets, you should consider using one of these libraries.

In [None]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(n_estimators=100,
                                      learning_rate=0.1,
                                      max_depth=3,
                                      random_state=888)
gb_model.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(estimator=gb_model, X=X_test, y=y_test, ax=ax)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()



## 3.3 Exercise 2
All previous demos and exercises have used default hyperparameters for the ensemble models. However, hyperparameter tuning is crucial for achieving optimal performance.

Here is your task:
Drawing on the above demo of GBM, experiment with different settings for the hyperparameters `learning_rate` and `n_estimators`. Record the AUC values of different models with different hyperparameters to study how the hyperparameters affect performance.

<details> <summary>Hint</summary> 
One way to approach exercise 2 is to loop over different candidate settings for each hyperparameter and record the AUC in each round. For example, you could:<br><br>
1. Loop over a small set of learning_rate values, e.g. `[0.01, 0.1, 0.2]`. <br>
2. Loop over a small set of n_estimators values, e.g. `[50, 100, 200]`. <br>
3. Measure the the AUC of the corresponding GBM model using `roc_auc_score()`. <br>
4. Keep track of the highest AUC value obtained thus far and the corresponding hyperparameters. <br>
  - To do this, you could use a variable `best_auc`, which you compare to the AUC obtained in the current iteration.<br>
  - Any time you observe the current AUC to be higher than the best AUC, update the best AUC. <br>
  - You can proceed in the same way with the hyperparameters.<br>

</details>


In [None]:
# Solution to Exercise 2
# Experiment with different settings for learning_rate and n_estimators
learning_rates = [0.01, 0.1, 0.2]
n_estimators_list = [50, 100, 200]

best_auc = 0
best_params = {}

for lr in learning_rates:
    for n_estimators in n_estimators_list:
        gb_model = GradientBoostingClassifier(n_estimators=n_estimators,
                                              learning_rate=lr,
                                              max_depth=3,
                                              random_state=42)
        gb_model.fit(X_train, y_train)
        yhat = gb_model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_true=y_test, y_score=yhat)
        
        if auc > best_auc:
            best_auc = auc
            best_params = {'learning_rate': lr, 'n_estimators': n_estimators}
        
        print(f'Learning rate: {lr}, n_estimators: {n_estimators}, AUC: {auc:.3f}')

print(f'\nBest AUC: {best_auc:.3f} with parameters: {best_params}')

<a id="modelsel"></a>

# 4. Model Selection and Hyperparameter Tuning
Exercise 2 already touched on the task of model selection and hyperparameter tuning. This is a crucial step in the machine learning pipeline. Therefore, `sklearn` offers powerful functionality to support this step, which we will now explore. 

## 4.1 Concept Recap
Machine learning models rely on various parameters to make predictions. These parameters fall into two main categories:

### Model Parameters: 
These are learned directly from the data during training. Examples include the coefficients in linear regression or the split thresholds in a decision tree.

### Hyperparameters:
These are configurable settings chosen before training by the data scientist. They influence the training process and model behavior. Examples include the learning rate in gradient boosting, the number of trees in a Random Forest, or maximum tree depth in decision trees.

Hyperparameter tuning (aka model selection) is the process of finding the optimal hyperparameters for a given model and data set. This is typically done using a search algorithm, such as grid search or random search. Recall our brief discussion of these methods in the lecture.
<br>
<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/gridsearch.png" width="854" height="480" alt="Grid Search">


## 4.2 Programming Demo: Hyperparameter Tuning
A versatile implementation of grid search is available in the the `GridSearchCV` class in `sklearn.model_selection`. It allows us to perform an exhaustive search over a specified parameter grid. The code snippet below demonstrates a solution to Exercise 2 using `GridSearchCV` to tune the hyperparameters of a gradient boosting classifier.


In [None]:
# Tune GBM hyperparameters with GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Perform grid search: this may take a while since we assess 3*3=9 models using 5-fold cross-validation for each
gb_model = GradientBoostingClassifier(max_depth=3, random_state=888)
grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, scoring='roc_auc', verbose=2)
grid_search.fit(X_train, y_train)

# Print performance statistic and configuration of the best model found during grid search
print(f'Best AUC: {grid_search.best_score_:.3f}')
print(f'Best parameters: {grid_search.best_params_}')



**Remarks:**
- When defining the parameter grid, you can specify multiple values for each hyperparameter. The grid search will then evaluate all possible combinations of these values. Note that the data type of the grid is a `dictionary`. It is crucial that you use the correct hyperparameter names as keys in the dictionary. These names must match the hyperparameter names of the model you are tuning.

- Beyond the learning algorithm and the corresponding parameter grid, `GridSearchCV` provides several other arguments to configure the search. Above, we used the `cv` argument to configure the number of cross-validation folds to use. A look into the comprehensive documentation illustrates all available options.

- Grid search is essentially an algorithm for optimization. We search for the *best* hyperparameters. This requires deciding on an objective. By default, `GridSearchCV` will use classification accuracy and $R^2$ for classification and regression models, respectively. To overwrite this default and search for the hyperparameters yielding the highest AUC, we set the `scoring` argument to `roc_auc`. You can find a list of available metrics in the [sklearn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
<br><br>

In case you seek more insights into the performance of different hyperparameters, you can extract a `dictionary` with detailed results as shown below.  

In [None]:
# Extract dictionary with results from grid search
results = grid_search.cv_results_

# Visualize performance grid as a heatmap
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(results['mean_test_score'].reshape(3, 3), annot=True, fmt='.3f', xticklabels=learning_rates, yticklabels=n_estimators_list, ax=ax)
plt.xlabel('Learning rate')
plt.ylabel('N estimators')
plt.title('Grid search performance')
plt.show()

In [None]:
# Lastly, we can extract the best model from the grid search and evaluate it on the test set
best_model = grid_search.best_estimator_
yhat = best_model.predict_proba(X_test)[:, 1]
print(f'Test AUC: {roc_auc_score(y_true=y_test, y_score=yhat):.3f}')

fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(estimator=grid_search.best_estimator_, X=X_test, y=y_test, ax=ax)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()

Note how grid search has facilitated a small further increase of the AUC compared to the untuned GBM classifier.

## 4.3 Exercise 3
In Exercise 2, you experimented with different hyperparameters for the GBM model. Now, you will do the same for the Random Forest model. More specifically,

- Identify three hyperparameters of the Random Forest model that you want to tune.
- Define a parameter grid with multiple values for each hyperparameter.
- Use `GridSearchCV` to tune the hyperparameters of a Random Forest classifier that yield the maximum **F1 score**.
- Report the performance of each hyperparameter combination and identify the best-performing combination. A simple print statement will suffice.
- Compare the performance of the tuned Random Forest model to the Random Forest model that we created in the demo 2.2. Compare the two classifiers using ROC analysis and report the relative improvement in F1 due to hyperparameter tuning. 


In [None]:
from sklearn.metrics import f1_score
# Define the parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [10, 100, 250],
    'max_depth': [3, 10, 20],
    'max_features': [1, 2, 4, 8]
}

# Perform grid search with RF classifier and F1 score as the evaluation metric (may take a while)
grid_search_rf = GridSearchCV(estimator=RandomForestClassifier(random_state=888), 
                              param_grid=param_grid_rf, 
                              cv=5, scoring='f1', verbose=2)
grid_search_rf.fit(X_train, y_train)



# Print the best parameters and the corresponding F1 score
print(f'Best F1 Score: {grid_search_rf.best_score_:.3f}')
print(f'Best parameters: {grid_search_rf.best_params_}')
rf_tuned = grid_search_rf.best_estimator_  # Extract the best model from the grid search

# Recreate Random Forest from 2.2 for comparison
rf_untuned = RandomForestClassifier( n_estimators=100,  # the number of base model trees
                                     max_depth=5,      # the maximum depth of each tree
                                     random_state=888  # Random number seed. Recall that RF is stochastic due to bootstrap sampling and random subspace
                                   )
rf_untuned.fit(X_train, y_train)

# Compare the performance of the tuned Random Forest model to the untuned Random Forest model using ROC analysis
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(estimator=rf_tuned,   X=X_test, y=y_test, ax=ax, name="Tuned Random Forest")
RocCurveDisplay.from_estimator(estimator=rf_untuned, X=X_test, y=y_test, ax=ax, name="Untuned Random Forest")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()

# Compute the relative improvement in F1 score (reusing the instance of the untuned RF from Section 2.2)
f1_tuned = f1_score(  y_true=y_test, y_pred=rf_tuned.predict(X_test))
f1_untuned = f1_score(y_true=y_test, y_pred=rf_untuned.predict(X_test))
print(f'Test F1 Score with (without) tuning: {f1_tuned:.3f} ({f1_untuned:.3f})')

rel_improve = (f1_tuned - f1_untuned) / f1_untuned * 100
print(f'Relative improvement in F1 Score: {rel_improve:.2f}%')

In [None]:
# Report the performance of each hyperparameter combination
results_rf = grid_search_rf.cv_results_
for i in range(len(results_rf['params'])):
    print(f'Parameters: {results_rf["params"][i]}, F1 Score: {results_rf["mean_test_score"][i]:.3f}')