[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/10_ensemble_learning_tasks.ipynb) 

# Tutorial 10 - Ensemble Learning 

In this tutorial, we revisit the ensemble learning lecture, which introduced *ensemble learners*, approaches that generate a composite forecast by integrating the predictions of multiple *base models*. In terms of concrete algorithms, we learned about Random Forest (RF) and Gradient Boosting (GBM).

This tutorial covers the following topics:

#### Table of Contents

1. [Foundations of Ensemble Modeling and Forecast Combination](#foundations)  
2. [Bagging and Random Forest](#bagging-rf)  
3. [Gradient Boosting](#gradient-boosting) 

Within each section, you find:
- A **Concept Recap** highlighting the key ideas from the lecture.
- A **Programming Demo** illustrating these ideas in Python.
- **Exercises** to deepen your understanding and practice these methods.



## Preliminaries

### Standard imports

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### The HMEQ data set
We continue using the "Home Equity" data set (HMEQ), which was introduced in [Tutorial 6](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/6_classification_tasks.ipynb). That tutorial also provided a helper function,`prepare_hmeq_data`,  to load and prepare the data. To avoid copy/pasting that (comprehensive) helper function into this and future tutorials, we moved it into a Python module `bads_helper_functions.py`. The module is available in the [BADS GitHub repository](https://github.com/Humboldt-WI/bads). Below, we first import our custom module and then call the helper function from that module. For thw code to function on your machine, you have to make sure that the **module is found by your Python interpreter**. While there are different ways to ensure this, the simplest way would be to save the module to the **same folder** to which you also saved **this notebook**.  

In [None]:
import bads_helper_functions as bads  # import module with bads helper functions
# Load the data directly from GitHub
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq.csv'
hmeq = pd.read_csv(data_url)

X, y = bads.prepare_hmeq_data(hmeq)  
print("Data loaded. Shape of X: ", X.shape, "Shape of y:", y.shape)
X.info()

<a id="foundations"></a>
# 1. Foundations of Ensemble Modeling and Forecast Combination

## 1.1 Concept Recap
In ensemble modeling, multiple models (referred to as “base learners” or “weak learners”) are combined to produce a single predictive model. The motivation behind ensemble modeling is that by combining various learners, we often get better generalization performance than using a single learner. Ensemble models help to:

- Reduce variance (stabilize predictions).
- Potentially reduce bias (when certain conditions are met).
- Provide more robust and reliable predictions across different problem domains.

Techniques include:

- Averaging Methods: e.g., Bagging (Bootstrap Aggregating)
- Boosting Methods: e.g., Gradient Boosting
- Heterogeneous approaches, which integrate different learning algorithms to farm the base models. Examples include the Stacking algorithm, which combines base models using a meta-model (see, e.g., [this tutorial for a demo](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/) if interested). The lecture did not elaborate on heterogeneous ensemble learners; neither will this tutorial. We mainly mention them for the sake of completeness. 


## 1.2 Programming Demo: Simple Averaging Ensemble
Below is a simple demonstration of how one might combine three different models (a Logistic Regression and two Decision Trees) by averaging their probability estimates. This is actually a heterogeneous ensemble, as we combine different types of models. The example is meant to convey the generality of ensemble modeling, which the lecture illustrated as follows: <br>
<br>
<p align="left" class="alert">
  <img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/ensemble_learning.png" width="640" alt="Ensemble Learning">
</p>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

fix_seed = 888  # seed for random number generator

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=fix_seed)  # partition the data into 75% training and 25% test

# Define base models
base_learners = [
    LogisticRegression(max_iter=1000,    random_state=fix_seed),
    DecisionTreeClassifier(max_depth=3,  random_state=fix_seed),
    DecisionTreeClassifier(max_depth=10, random_state=fix_seed)    
]

# Train the models
for m in base_learners:
    m.fit(X_train, y_train)

# Predict probabilities
probs = []
for m in base_learners:
    yhat = m.predict_proba(X_test)[:, 1]
    probs.append(yhat)

# Simple average forecast 
ensemble_prob = np.mean(probs, axis=0)

# Performance evaluation
print('Performance evaluation in terms of AUC:')
print('-' * 100)
print(f'Ensemble \t {roc_auc_score(y_true=y_test, y_score=ensemble_prob):.3f}')
print('-' * 100)
for i, p in enumerate(probs):
    print(f'Base model {i} \t {roc_auc_score(y_true=y_test, y_score=p):.3f} \t ({base_learners[i]})')
print('-' * 100)


# Also report the pairwise correlation among the base learners
pd.DataFrame({"Logit": probs[0], "Shallow Tree": probs[1], "Deep Tree": probs[2]}).corr()


<a id="bagging-rf"></a>

# 2. Bagging and Random Forest

## 2.1 Concept Recap
*Bagging (Bootstrap Aggregating)*:

In bagging, we train multiple instances of the same model class on different bootstrap samples (randomly sampled with replacement) of the original training set.
The final prediction is usually made by majority voting (classification) or averaging (regression).

*Random Forest (RF)*:

RF is an extension of bagging, which uses decision trees as base learners. RF also uses feature sub-sampling at each split to decorrelate individual trees. Much empirical evidence suggests that RF is an effective and robust algorithm for a wide range of prediction tasks. It is also relatively simple to tune.

## 2.2 Programming Demo: Random Forest
To illustrate RF, we train a classification model, measure its AUC, and plot the ROC curve. To showcase the effectiveness of Random Forest, we add the results of the previous demo, our heterogeneous ensemble, to the ROC plot. Recall from the lecture that RF exhibits some hyperparameter, which may deserve some tuning. For the demo, we use the default hyperparameter settings. 


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay

# Instantiate and train RandomForest
rf_model = RandomForestClassifier(random_state=fix_seed)  # Random number seed. Recall that RF is stochastic due to bootstrap sampling and random subspace)
rf_model.fit(X_train, y_train)

# Evaluate
rf_probs = rf_model.predict_proba(X_test)[:,1]
print(f"Random Forest AUC: {roc_auc_score(y_true=y_test, y_score=rf_probs):.3f}")


In [None]:
# Visualize RF performance and compare to benchmarks
f, ax = plt.subplots(figsize=(8,6))
RocCurveDisplay.from_estimator(estimator=rf_model, X=X_test, y=y_test, ax=ax)
RocCurveDisplay.from_predictions(y_true=y_test, y_pred=ensemble_prob, ax=ax, name="Simple avg. ensemble")
for i, p in enumerate(probs):
    RocCurveDisplay.from_predictions(y_true=y_test, y_pred=p, ax=ax, name=f"Base model {i}")
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()


## Exercise 1
Unlike RF, the Bagging algorithm is not specific to Decision Trees. It can be applied to any base learner.

Here is your task:<br>
- Train two Bagging classifiers, one using logistic regression and one using decision trees to train the base models. This allows for a comparison of which algorithm, logistic regression or decision tree, benefits more from bagging. An implementation of the Bagging algorithm is available in the `sklearn.ensemble` module. 
- For each Bagging classifier, use 50 base learners and set the random state to 888. 
  - For the logistic regression base learner, set `max_iter=1000`, and `random_state=888`
  - For the decision tree base learner, set `max_depth=3`
  - Note that these are the same settings as used in the above demo on the simple average ensemble. 
- Create an ROC chart to compare the models. Specifically:
  - Plot the ROC curve for the bagged logistic regression
  - Plot the ROC curve for the bagged decision tree
  - Plot the ROC curve for the non-bagged logistic regression trained in the first demo. To do this, simply reuse the already stored predictions `probs[0]`
  - Also add a ROC curve for the non-bagged decision tree trained. Again, simply reuse the available predictions `probs[1]`
- Make sure all ROC curves are shown in one chart by setting the argument `ax` of the RocCurveDisplay.from_estimator()` function, as is already illustrated in the above ROC curve. 
- Add a legend to the plot to distinguish the different models.
    

<details> <summary>Hint on bagging </summary> Use the following scaffolding to configure the Bagging classifier. :

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_model = BaggingClassifier(
    base_estimator=___, 
    n_estimators=___, 
    random_state=___
)
bagging_model.fit(X_train, y_train)
# Evaluate your model

```	
</details>


In [None]:
# Solution to Exercise 1


Examining the result of Exercise 1, what can you say about the performance of the bagged logistic regression and the bagged decision tree compared to the non-bagged models?

More interestingly, what factors might explain the observed effectiveness of bagging (or lack thereof)?

**Your Answer:**


<a id="gradient-boosting"></a>

# 3. Gradient Boosting
## 3.1 Concept Recap
Boosting builds an ensemble of weak learners in a sequential way, where each new learner attempts to correct the errors of the previous ensemble. Gradient Boosting (GB) is a general boosting framework that suggests a principled way of how to implement the *correction of errors* idea. Specifically, GB defines the (negative) gradient of the loss function with respect to the current ensemble to be the target variable for base model training. Note that the the gradient of the loss function equates to model residuals, $y - \hat{y}$, in the specific case of the least-squares loss function. Then, GB fits base models to predict (i.e., reduce) the residual of the present ensemble. For other loss functions, for example, cross-entropy, the equivalence between gradients and residuals does not hold. Therefore, characterizing GB as a boosting approach that incrementally fits base models to the residuals of the current ensemble is helpful to build understanding, while the description of GB incrementally fitting base models to loss gradients is more general. Either way, the GB algorithm incorporates a learning rate (shrinkage) hyperparameter that controls how strongly each new model influences the ensemble. A smaller learning rate means that each new model has a smaller impact, which can lead to better generalization but requires more base models (and thus more computation).

Common GB implementations include:
- GradientBoostingClassifier (in sklearn.ensemble)
- XGBoost, LightGBM, CatBoost (popular specialized libraries)

## 3.2 Programming Demo: Gradient Boosting
The following code snippet demonstrates the training and evaluation of a GB classifier using the `GradientBoostingClassifier` implementation from `sklearn.ensemble`. While this implementation suffices for our purposes, it is worth noting that the specialized libraries like XGBoost, LightGBM, and CatBoost are preferred in practice due to their superior performance and scalability. Thus, when working on real-world projects or larger datasets, you should consider using one of these libraries.

In [None]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(n_estimators=100,
                                      learning_rate=0.1,
                                      max_depth=3,
                                      random_state=fix_seed)
gb_model.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(estimator=gb_model, X=X_test, y=y_test, ax=ax)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()



## Exercise 2
Previous demos and exercises did not elaborate on hyperparameters. Hyperparameter tuning is often useful when working with ensemble algorithms, particularly GB.

**Here is your task**:

Drawing on the GB demo, experiment with different settings for the hyperparameters `learning_rate` and `n_estimators`. Record the AUC values of different models with different hyperparameters to study how the hyperparameters affect performance.

One way to achieve this involves looping over different candidate settings for each hyperparameter and recording the corresponding AUC in each round. Alternatively, [tutorial 8]() on ML theory and practice introduced you the `GridSearchCV` class from `sklearn.model_selection`, which facilitates hyperparameter tuning using grid search. Following a tuning run, the class provides detailed results from the candidate hyperparameter evaluations and can, therefore, also be used to solve this exercise. 

<details><summary>Hints on <code>GridSearchCV</code></summary> You can draw on the following scaffolding to configure the search grid for your GB classifier

```python
from sklearn.model_selection import GridSearchCV

# Set the hyperparameter grid
gb_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2]
}

grid_search = GridSearchCV(param_grid=gb_grid, ___)
grid_search.fit(___)

```	




</details>

In [None]:
# Solution to Exercise 2
