# Ensemble Methods

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F8_ensemble_methods.ipynb)

<div class="alert alert-block alert-warning">
    <b>Prerequisites</b>
    
- 
</div>

<div class="alert alert-block alert-info">
    <b>Learning Outcomes</b>

- Combining several models together;
- Understand the principles behind bootstrapping and boosting;
- Get intuitions with specific models such as random forest and gradient boosting;
- Identify the important hyperparameters of random forest and gradient boosting decision trees as well as their typical values.
</div>

Ensemble methods:
- Bagging
- Boosting
- Random forest
- Voting / Stacking
- Bayesian methods for non-parametric regression

## Intuitions on tree-based methods

See [Scikit-learn course - Intuitions on tree-based methods](https://inria.github.io/scikit-learn-mooc/trees/slides.html).

## Intuition on ensemble of tree-based methods



### The two tasks of ensemble learning

1. Develop a population of base learners from the training data;
2. Combine them to form a composite predictor.

## Ensemble methods based on bootstrapping (bagging)

- With cross-validation and boostrapping (shuffled version), we have resampled the training data to assess the accuracy of a prediction or a parameter estimate.

- *Bagging* is the use of bootstrapping to improve the estimate or prediction itself.

### Bagging method

Given training data $\mathbf{Z} = \{(x_1, y_1), \ldots, (x_N, y_N)\}$,

- Get $B$ boostrap samples $\mathbf{Z}^{*b}, b = 1, \ldots, B$;
- Fit a model $\hat{f}^{*b}$ to each bootstrap sample;
- Given a new input $x$, generate predictions $\hat{f}^{*b}(x)$ from each model;
- Average the predictions to get the bagging prediction:

\begin{equation}
\hat{f}_\mathrm{bag}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^{*b}(x).
\end{equation}

### Why bagging?

<div class="alert alert-block alert-info">
   Bagging reduced the variance.
</div>

- The base estimator $\hat{f}$ is too complex and overfits;
- Each bootstrap estimate $\hat{f}^{*b}$ also overfits;
- But averaging them reduces their there variance and thus their tendency to overfit.

### Bagging usage

- For regression / classification;
- The base learner can be any model, often a decision tree;
- The number of models $B$ is a hyperparameter controlling the regularization;

### How to generate the bootstrap samples?

Typically:
- Randomly draw $N$ observations from $\mathbf{Z}$ with replacement. To do so:
  - Use a random number generator to draw $N$ integers from $1$ to $N$. The same integer may be drawn multiple times (replacement);
  - Take these integers as indices to select input-output pairs in $\mathbf{Z}$;
- Repeat $B$ times.

However, a smaller number observations may be drawn with or without replacement.

This is illustrated below, where the number of observations per boostrap sample is `max_samples` and where `replace` controls whether observations are replaced or not. 

In [6]:
# Numerical analysis modules
import numpy as np
import pandas as pd
# Plot modules
import matplotlib.pyplot as plt
import hvplot.pandas
import panel as pn
pn.extension()
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Number of training samples
N = 20
    
# Noise standard deviation
sigma = 4.

# Initialize random number generator
rng = np.random.RandomState(2)

# Plot configuration
xlabel = 'Input'
ylabel = 'Target'
xlim = 3 * np.array([-1, 1])
ylim = np.array([-3.1, 2.])

def generate_data(n_samples=N):
    """Generate synthetic dataset. Returns `data_train`, `data_test`,
    `target_train`."""
    x_min, x_max = xlim
    x = rng.uniform(x_min, x_max, size=n_samples)
    noise = sigma * rng.randn(n_samples)
    y = x**3 - 0.5 * (x+1)**2 + noise
    y /= y.std()

    input_train = pd.Series(x, name=xlabel)
    input_test = pd.Series(
        np.linspace(x_max, x_min, num=300), name=xlabel)
    target_train = pd.DataFrame(y, index=x, columns=[ylabel])
    target_train.index.name = input_train.name

    return input_train, input_test, target_train

# Generate training set
input_train, input_test, target_train = generate_data(n_samples=N)

def get_bootstrap_sample(max_samples=N, replace=True):
    # Get random indices
    bootstrap_indices = rng.choice(N, size=max_samples, replace=replace)
    
    # Generate boostrap sample
    input_b = input_train.iloc[bootstrap_indices]
    target_b = target_train.iloc[bootstrap_indices]
    
    return input_b, target_b

def plot_bootstrap_sample(b, max_samples=N, replace=True):
    input_b, target_b = get_bootstrap_sample(max_samples, replace)

    # Plot with hvplot
    return (target_train.hvplot.scatter(size=75, label='Training set') *
            target_b.hvplot.scatter(size=25, label='Bootstrap sample',
            xlim=xlim, ylim=ylim))
    
button = pn.widgets.Button(name='Resample bootstrap', button_type='primary')

In [7]:
pn.interact(plot_bootstrap_sample, b=button, max_samples=np.arange(N) + 1)

In [9]:
# Machine-learning modules
from sklearn import linear_model, tree, preprocessing, pipeline

# Make these estimators available
linear_reg = linear_model.LinearRegression()
poly_reg = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(degree=4), linear_reg)
spline_reg = pipeline.make_pipeline(
    preprocessing.SplineTransformer(), linear_reg)
tree_reg = tree.DecisionTreeRegressor(max_depth=3, random_state=0)

def get_bootstrap_predictions(
    base_estimator, n_estimators, max_samples, replace):
    # Get bootstrap estimates and predictions
    X_pred = input_test.values[:, None]
    y_preds = np.empty((n_estimators, X_pred.shape[0]))
    for b in range(n_estimators):
        input_b, target_b = get_bootstrap_sample(max_samples, replace)
        X_b = input_b.values[:, None]
        y_b = target_b[ylabel].values
        base_estimator.fit(X_b, y_b)
        y_preds[b] = base_estimator.predict(X_pred)
        
    return y_preds
        
def plot_bag(b, base_estimator=linear_reg, n_estimators=5,
             max_samples=N, replace=True, plot_bootstrap_predictions=True):
    y_preds = get_bootstrap_predictions(
        base_estimator, n_estimators, max_samples, replace)
    
    # Get bagged prediction
    target_pred_bag = pd.DataFrame(
        y_preds.mean(0), index=input_test, columns=[ylabel])
    
    p = (target_train.hvplot.scatter(size=75, label='Training set') *
            target_pred_bag.hvplot(label='Bagging prediction',
                                 xlim=xlim, ylim=ylim))
    
    # Add bootstrap predictions
    if plot_bootstrap_predictions:
        for b in range(n_estimators):
            target_pred_b = pd.DataFrame(
                y_preds[b], index=input_test, columns=[ylabel])
            label = 'Bootstrap prediction {}'.format(b)
            p *= target_pred_b.hvplot(line_dash='dashed', label=label)
        
    return p

button_multi = pn.widgets.Button(
    name='Resample all bootstraps', button_type='primary', width=700)
base_estimators = [linear_reg, poly_reg, spline_reg, tree_reg]
pn.interact(plot_bag, b=button_multi, base_estimator=base_estimators,
            n_estimators=np.arange(N * 2) + 1, max_samples=np.arange(N) + 1)

## Ensemble based on boosting

## To go further

- Bayesian models (Bishop 2006);
- Relationship between bootstrap, maximum likelihood and Bayesian methods (Chap. 8 in Hastie *et al.* 2009).

## References

- Chap. 8-10, 15 and 16 in [Hastie, T., Tibshirani, R., Friedman, J., 2009. *The Elements of Statistical Learning*, 2nd ed. Springer, New York.](https://doi.org/10.1007/978-0-387-84858-7)
- [Bishop, C., 2006. Pattern Recognition and Machine Learning, Information Science and Statistics. Springer-Verlag, New York.](https://www.cs.uoi.gr/~arly/courses/ml/tmp/Bishop_book.pdf)
- [Du, P., 2019. Ensemble Machine Learning-Based Wind Forecasting to Combine NWP Output With Data From Weather Station. *IEEE Transactions on Sustainable Energy* 10, 2133-2141.](https://doi.org/10/gnbgvj)

***
## Credit

[//]: # "This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education)."
Contributors include Bruno Deremble and Alexis Tantet.
Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).

<br>

<div style="display: flex; height: 70px">
    
<img alt="Logo LMD" src="images/logos/logo_lmd.jpg" style="display: inline-block"/>

<img alt="Logo IPSL" src="images/logos/logo_ipsl.png" style="display: inline-block"/>

<img alt="Logo E4C" src="images/logos/logo_e4c_final.png" style="display: inline-block"/>

<img alt="Logo EP" src="images/logos/logo_ep.png" style="display: inline-block"/>

<img alt="Logo SU" src="images/logos/logo_su.png" style="display: inline-block"/>

<img alt="Logo ENS" src="images/logos/logo_ens.jpg" style="display: inline-block"/>

<img alt="Logo CNRS" src="images/logos/logo_cnrs.png" style="display: inline-block"/>
    
</div>

<hr>

<div style="display: flex">
    <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0; margin-right: 10px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
    <br>This work is licensed under a &nbsp; <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div>