[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/9_ensemble_learning_tasks.ipynb) 

# Tutorial 9 - Ensemble Learning 

In this tutorial, we revisit our lecture on ensemble learning, which has explained new algorithms, namely Random Forest (RF) and Gradient Boosting (GBM), and, more generally, introduced us to the space of *ensemble learners*, which generate a composite forecast by integrating the predictions of multiple *base models*. We will cover the following topics:

#### Contents

1. [Foundations of Ensemble Modeling and Forecast Combination](#foundations)  
2. [Bagging and Random Forest](#bagging-rf)  
3. [Gradient Boosting](#gradient-boosting) 

Within each section, you will find:
- A **Concept Recap** highlighting the key ideas from the lecture.
- A **Programming Demo** illustrating these ideas in Python.
- **Exercises** to deepen your understanding and practice these methods.



## Preliminaries

### Standard imports

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### The HMEQ data set
We continue using the "Home Equity" data set (HMEQ), which we can nicely load and prepare using the helper function `get_HMEQ_credit_data`, which is available in our courses module `bads_helper_functions.py`

In [None]:
import bads_helper_functions as bads  # import module with bads helper functions
X, y = bads.get_HMEQ_credit_data()  # load the data 

print("Data loaded. Shape of X: ", X.shape, "Shape of y:", y.shape)

In [None]:
X  # preview the data   

<a id="foundations"></a>
# 1. Foundations of Ensemble Modeling and Forecast Combination

## 1.1 Concept Recap
In ensemble modeling, multiple models (referred to as “base learners” or “weak learners”) are combined to produce a single predictive model. The motivation behind ensemble modeling is that by combining various learners, we often get better generalization performance than using a single learner. Ensemble models help to:

- Reduce variance (stabilize predictions).
- Potentially reduce bias (when certain conditions are met).
- Provide more robust and reliable predictions across different problem domains.

Techniques include:

- Averaging Methods: e.g., Bagging (Bootstrap Aggregating)
- Boosting Methods: e.g., Gradient Boosting
- Heterogeneous approaches, which integrate different learning algorithms to farm the base models. Examples include the Stacking algorithm, which combines base models using a meta-model (see, e.g., [this tutorial for a demo](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/) if interested). The lecture did not elaborate on heterogeneous ensemble learners; neither will this tutorial. We mainly mention them for the sake of completeness. 


# 1.2 Programming Demo: Simple Averaging Ensemble
Below is a simple demonstration of how one might combine three different models (a Logistic Regression and two Decision Trees) by averaging their probability estimates. This is actually a heterogeneous ensemble, as we combine different types of models. The example is meant to convey the generality of ensemble modeling, which the lecture illustrated as follows: <br>
<br>
<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/ensemble_learning.png" width="854" height="480" alt="Ensemble Learning">

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=888)  # partition the data into 75% training and 25% test

# Define base models
base_learners = [
    LogisticRegression(max_iter=1000, random_state=888),
    DecisionTreeClassifier(max_depth=3),
    DecisionTreeClassifier(max_depth=10)
]

# Train the models
for m in base_learners:
    m.fit(X_train, y_train)

# Predict probabilities
probs = []
for m in base_learners:
    yhat = m.predict_proba(X_test)[:, 1]
    probs.append(yhat)

# Simple average forecast 
ensemble_prob = np.mean(probs, axis=0)

# Performance evaluation
print('Performance evaluation in terms of AUC:')
print('-' * 50)
print(f'Ensemble \t {roc_auc_score(y_true=y_test, y_score=ensemble_prob):.3f}')
print('-' * 50)
for i, p in enumerate(probs):
    print(f'Base model {i} \t {roc_auc_score(y_true=y_test, y_score=p):.3f}')
print('-' * 50)


<a id="bagging-rf"></a>

# 2. Bagging and Random Forest

## 2.1 Concept Recap
Bagging (Bootstrap Aggregating):

In bagging, we train multiple instances of the same model class on different bootstrap samples (randomly sampled with replacement) of the original training set.
The final prediction is usually made by majority voting (classification) or averaging (regression).

Random Forest:

Random Forest is an extension of bagging, which uses Decision Trees as base learners. Random Forest also uses feature sub-sampling at each split to decorrelate individual trees. Much empirical evidence suggests that Random Forest is a very effective and robust algorithm for a wide range of prediction tasks. It is also relatively simple to tune.

## 2.2 Programming Demo: Random Forest
Below, we train a simple Random Forest, measure its AUC and plot the ROC curve. To showcase the effectiveness of Random Forest, we add the results of the previous demo, our heterogeneous ensemble, to the ROC plot. 


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay

# Instantiate and train RandomForest
rf_model = RandomForestClassifier(n_estimators=100,  # the number of base model trees
                                   max_depth=5,      # the maximum depth of each tree
                                   random_state=888  # Random number seed. Recall that RF is stochastic due to bootstrap sampling and random subspace
                                   )
rf_model.fit(X_train, y_train)

# Evaluate
rf_probs = rf_model.predict_proba(X_test)[:,1]
print(f"Random Forest AUC: {roc_auc_score(y_true=y_test, y_score=rf_probs):.3f}")


In [None]:
# Visualize RF performance and compare to benchmarks
f, ax = plt.subplots(figsize=(8,6))
RocCurveDisplay.from_estimator(estimator=rf_model, X=X_test, y=y_test, ax=ax)
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=ensemble_prob, ax=ax, name="Simple avg. ensemble")
for i, p in enumerate(probs):
    RocCurveDisplay.from_predictions(y_true=y_test, y_pred=p, ax=ax, name=f"Base model {i}")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()


## Exercise 1
Unlike RF, the Bagging algorithm is not specific to Decision Trees. It can be applied to any base learner.

Here is your task:<br>
- Train two Bagging classifiers, one using logistic regression and one using decision trees to train the base models. This allows for a comparison of which algorithm, logistic regression or decision tree, benefits more from bagging. An implementation of the Bagging algorithm is available in the `sklearn.ensemble` module. 
- For each Bagging classifier, use 50 base learners and set the random state to 888. 
  - For the logistic regression base learner, set `max_iter=1000`, and `random_state=888`
  - For the decision tree base learner, set `max_depth=3`
  - Note that these are the same settings as used in our first demo on the simple average ensemble. 
- Create an ROC chart to compare the models. Specifically:
  - Plot the ROC curve for the bagged logistic regression
  - Plot the ROC curve for the bagged decision tree
  - Plot the ROC curve for the non-bagged logistic regression trained in the first demo. To do this, simply reuse the already stored predictions `probs[0]`
  - Also add a ROC curve for the non-bagged decision tree trained. Again, simply reuse the available predictions `probs[1]`
- Add a legend to the plot to distinguish the different models.
    

<details> <summary>Hint on bagging </summary> Use the following scaffolding to configure the Bagging classifier. :

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=___), 
    n_estimators=___, 
    random_state=___
)
bagging_model.fit(X_train, y_train)
# Evaluate your model
```	
</details>

<details> <summary>Hint on the ROC curve </summary> Note that the previous ROC curve example illustrates how to draw curves for the non-bagged models. Assuming you have executed previous code cells, you can reuse the stored predictions `probs[0]` and `probs[1]` to plot the ROC curves for the non-bagged models as follows:

```python
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=probs[0], ax=ax, name=f"Logistic regression")
```	
</details>

In [7]:
# Solution to Exercise 1

Examining the result of Exercise 1, what can you say about the performance of the bagged logistic regression and the bagged decision tree compared to the non-bagged models?

More interestingly, what factors might explain the observed effectiveness of bagging (or lack thereof)?

**Answer:**


<a id="gradient-boosting"></a>

# 3. Gradient Boosting
## 3.1 Concept Recap
Boosting builds an ensemble of weak learners in a sequential way, where each new learner attempts to correct the errors of the previous ensemble. Gradient Boosting is a general boosting framework:

- Models are added sequentially.
- Each subsequent model is trained to reduce the residual errors (or gradient of the loss) of the current ensemble.
- Learning rate (shrinkage) controls how strongly each new model influences the ensemble.

Common implementations:
- GradientBoostingClassifier (in sklearn.ensemble)
- XGBoost, LightGBM, CatBoost (popular specialized libraries)

## 3.2 Programming Demo: Gradient Boosting
The code snippet below demonstrates training and evaluation of a gradient boosting classifier using the `GradientBoostingClassifier` implementation from `sklearn.ensemble`. While this implementation suffices for our purposes, it is worth noting that the specialized libraries like XGBoost, LightGBM, and CatBoost are preferred in practice due to their superior performance and scalability. Thus, when working on real-world projects and with larger data sets, you should consider using one of these libraries.

In [None]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(n_estimators=100,
                                      learning_rate=0.1,
                                      max_depth=3,
                                      random_state=42)
gb_model.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(estimator=gb_model, X=X_test, y=y_test, ax=ax)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()



## Exercise 2
All previous demos and exercises have used default hyperparameters for the ensemble models. However, hyperparameter tuning is crucial for achieving optimal performance.

Here is your task:
Drawing on the above demo of GBM, experiment with different settings for the hyperparameters `learning_rate` and `n_estimators`. Record the AUC values of different models with different hyperparameters to study how the hyperparameters affect performance.

<details> <summary>Hint</summary> 
One way to approach exercise 2 is to loop over different candidate settings for each hyperparameter and record the AUC in each round. For example, you could:<br><br>
1. Loop over a small set of learning_rate values, e.g. `[0.01, 0.1, 0.2]`. <br>
2. Loop over a small set of n_estimators values, e.g. `[50, 100, 200]`. <br>
3. Measure the the AUC of the corresponding GBM model using `roc_auc_score()`. <br>
4. Keep track of the highest AUC value obtained thus far and the corresponding hyperparameters. <br>
  - To do this, you could use a variable `best_auc`, which you compare to the AUC obtained in the current iteration.<br>
  - Any time you observe the current AUC to be higher than the best AUC, update the best AUC. <br>
  - You can proceed in the same way with the hyperparameters.<br>
</details>


In [9]:
# Solution to Exercise 2