# Lecture 6.3: Ensemble Methods

This lecture, we are going to train and compare a random forest and a xgboost model on a real dataset.

**Learning goals:**
- train a random forest classifier
- train an adaboost classifier
- visualize and compare the model decision boundaries
- analyse the effect of regularization parameters
- train a random forest regressor

## 1. Introduction

Let's try to improve our fake banknote detector from lecture 5.3. 🕵️‍♀️ We'll use the same [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication), and try to solve the fake/genuine classification task.

## 2. Classification

### 2.1 Data Munging

Let's load our `.csv` into a pandas `DataFrame`, and have a look at the dataset:

In [None]:
import pandas as pd

df = pd.read_csv('bank_note.csv')

df.head()

In [None]:
df.describe()

Recall that we are dealing with 4 features, and one binary label. The features are standardized, so no further preprocessing is necessary.

We can create our feature matrix, `X`, and our label vector, `y`:

In [None]:
X = df[['feature_2', 'feature_4']].values
y = df['is_fake'].values

And we can visualize the dataset to remember the complexity of the classification task:

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(5,5), dpi=120)
ax = fig.add_subplot()

scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k', alpha=0.5)
ax.set_xlabel('feature_2')
ax.set_ylabel('feature_4')
ax.set_title('Banknote Classification')
handles, labels = scatter.legend_elements()
ax.legend(handles=handles, labels=['genuine', 'fake']);

The most important aspects to notice are:
* the data is _not separable_
* the relationship between `feature_2` and `feature_4` is _non-linear_

Just like last lecture, this should be a good test for our ensemble models!

### 2.2 Training


#### 2.2.1 Random Forests

`sklearn` separates random forest models for classification and regression. For this binary classification task, let's train a `RandomForestClassifier`:

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=0)
forest_clf = forest_clf.fit(X, y)

🧠 Can you list all the steps that sklearn had to go through to train this random forest? Take your time, there a lot of things going on in that `.fit()` function!

Recall that random forests are an _ensemble_ of decision trees, and we can retrieve each tree with the `.estimators_` field:

In [None]:
print(f'This random forest is an ensemble of {len(forest_clf.estimators_)} decision trees')

In [None]:
forest_clf.estimators_[5]

Since we don't have the space to visualize 100 decision tree flow chart, let's directly plot the random forest's decision boundary with our helper functions:

In [None]:
import numpy as np
from matplotlib.lines import Line2D

def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

def plot_decision_boundary(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    plot_decision_boundary(ax, clf, xx, yy, **params)


def plot_classification(ax, X, y, clf):
    X0, X1 = X[:, 0], X[:, 1]
    xx, yy = make_meshgrid(X0, X1)
    plot_contours(ax, clf, xx, yy,
                      cmap=plt.cm.coolwarm, alpha=0.8)
    scatter = ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k', alpha=1.0)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xlabel('x1')
    ax.set_ylabel('x2')
    ax.set_title('Bank Notes Classification')
    handles, labels = scatter.legend_elements()
    ax.legend(handles=handles, labels=['genuine', 'fake'])

In [None]:
fig = plt.figure(figsize=(5,5), dpi=120)
ax = fig.add_subplot()
plot_classification(ax, X, y, forest_clf)

The decision boundary is considerably different from the single decision tree. It is still only made of vertical and horizontal lines, but this time it is far more _detailed_. This is because these are the combined lines of 100 decision boundaries.

Recall that these predictions are made by majority voting. For each point in this graph, all the predictions "votes" from the 100 decision trees are rounded up: 
- those where there are more `fake` votes than `genuine` are shown in red
- those where there are more `genuine` than `fake` votes, are shown in blue

#### 2.2.2 AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(random_state=0)
ab_clf = ab_clf.fit(X, y)

🧠 Can you list all the steps that sklearn had to go through to train this boosted ensemble? Take your time, there a lot of things going on in that `.fit()` function!

Just like random forests, we can retrieve the different base models from the ensemble:

In [None]:
print(f'This AdaBoost model is an ensemble of {len(ab_clf.estimators_)} decision trees')

In [None]:
ab_clf.estimators_[5]

Notice how this estimator has `max_depth=1`: it is a "decision stump", i.e a very underfit model which will shine when boosted! ✨

In fact we can visualise the decision boundary of a few of our decision stumps:

In [None]:
n_estimators_values = [0, 1, 2, 3]
decision_stumps = ab_clf.estimators_[:4]
titles = [f'estimator #{n}' for n in n_estimators_values]

compare_classification(X, y, decision_stumps, titles)

Hard to imagine how this could combine into an accurate non-linear model... let's visualise the decision boundary of our boosted ensemble:

In [None]:
fig = plt.figure(figsize=(5,5), dpi=120)
ax = fig.add_subplot()
plot_classification(ax, X, y, ab_clf)

Notice that the decision boundary is very much _not_ linear! By combining the decision stumps with different weights, the boosted ensemble effectively lets them "focus" on different areas of the dataset. And the result is decent 💪

### 2.3 Prediction

Let's test our models by asking them to predict a banknote in the small `genuine` cluster on the left hand side of the graphs above.  We'll use $feature\_1 = -1; feature\_2 = 0$:

In [None]:
x_predict = np.array([-1, 0]).reshape(1, 2)
print(f'Features: {x_predict}')

forest_clf_prediction = forest_clf.predict(x_predict)
print(f'Random Forest prediction: {forest_clf_prediction}')


ab_clf_prediction = ab_clf.predict(x_predict)
print(f'AdaBoost prediction: {ab_clf_prediction}')

### 2.4 Analysis

#### 2.4.1 Regularization: number of decision trees

One important hyperparameter in the training of a random forest is the number of decision trees which form the ensemble. In sklearn, this is controlled with the `n_estimators` argument. Let's check out its effects on the combined model:

In [None]:
def train_forest(X, y, **kwargs):
    clf = RandomForestClassifier(random_state=0, **kwargs)
    return clf.fit(X, y)

n_estimators_values = [3, 10, 100]
forests = [train_forest(X, y, n_estimators=n) for n in n_estimators_values]
titles = [f'n_estimators={n}' for n in n_estimators_values]

compare_classification(X, y, forests, titles)

A bigger ensemble means less overfitting, and a more accurate combined model. However, more `n_estimators` increases the training time! Try training a random forest with 5000 decision trees and count the seconds tick away 😪

In [None]:
train_forest(X, y, n_estimators=5000)

#### 2.4.2 Regularization: minimum samples per leaf

💪💪 Investigate the effect of the `min_samples_leaf` argument on a random forest.
- you can use the exact same code structure as the section above
- you don't have to redefine the function `.train_forest()`, since `**kwargs` will work with any _named argument_.
- pick a suitable range of parameter values. You can always change them and run the cell again!
- the unit test is having nice looking graphs 🙃


🧠 Define the effect of the `min_samples_leaf` parameter. It might help to check out the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# INSERT YOUR CODE HERE

🧠🧠 How does `min_samples_leaf` affect the model's generalization? Why?

#### 2.4.3 Regularization: learning rate

On top of `n_estimators`, the regularization of boosted ensembles can be controlled with `learning_rate`. This shapes the weights applied to each estimator at each boosting iteration. Higher learning rates increase the contribution of each classifier:

In [None]:
def train_adaboost(X, y, **kwargs):
    clf = AdaBoostClassifier(random_state=0, **kwargs)
    return clf.fit(X, y)

n_estimators_values = [0.01, 0.1, 1]
adaboosts = [train_adaboost(X, y, learning_rate=n, n_estimators=500) for n in n_estimators_values]
titles = [f'learning_rate={n}' for n in n_estimators_values]

compare_classification(X, y, adaboosts, titles)

Notice that lower learning rates act as a _regularisation_ method, but too much regularisation affects the accuracy of the model!

## 3. Regression

As mentioned in the lecture slides, decision trees can solve regression tasks. They do so by using _variance reduction_ instead of _homogeneity metrics_ to split each node, and by assigning _numerical values_ to each leaf node.

We'd like to try this out this on our "instagram planning" dataset, and aim to predict the `actual_minutes` spent online from the originally `planned_minutes` (see notebook 3.7 for more a more detailed exploration of this dataset).

Except this time, _you_ are going to compare and analyse these regression models! 

💪💪💪 Train, analyse a decision tree regressor & a random forest regressor on the instagram planning dataset. Some helper functions are supplied so you can focus on the machine learning bits 😎. Here's a list of the steps you should be taking to lead your analysis:
- load the `instagram_planning.csv` dataset into a `DataFrame`
- optionally visualize this dataset to refresh your memory
- create a feature matrix, `X`, and a label vector, `y`
- fit a [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) model to the data (check the official documentation for details)
- optionally visualize the decision tree's nodes to understand its prediction logic with `.plot_tree()`
- fit a [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model to the data (check the official documentation for details)
- plot and compare their decision boundaries with `.compare_regression()`, which has the exact same interface as `.compare_classification()`.
- the unit test is having nice looking graphs 🙃


In [None]:
def plot_regression(ax, X, y, reg):

    # plot the examples
    ax.scatter(X, y, alpha=0.6)

    # create feature matrix
    xmin, xmax = ax.get_xlim()
    x_line = np.linspace(xmin, xmax, 30).reshape(-1, 1)
    
    # predict
    y_line = reg.predict(x_line)

    # plot the hypothesis
    ax.plot(x_line, y_line, c='g', linewidth=3)

    # formatting
    ax.set_xlim(xmin, xmax)
    ax.set_xlabel('planned online time (min)')
    ax.set_ylabel('time spent online (min)')
    ax.set_title('Online Procrastination');
    
def compare_regression(X, y, regs, titles):
    fig = plt.figure(figsize=(14, 4), dpi=100)
    for i, reg in enumerate(regs):
        ax = fig.add_subplot(1, len(regs), i+1)
        plot_regression(ax, X, y, reg)
        ax.set_title(titles[i])

In [None]:
# INSERT YOUR CODE HERE

🧠 Take your time to think about what happens in the `.fit()` method of the `DecisionTreeRegressor`. Can you list the main similarities and differences with a decision tree classifier?


🧠 When plotting the predictions of the decision tree and random forest with `.compare_regression()`, why are they so "bumpy" compare to linear or polynomial regression models?

🧠🧠 How are the predictions of each individual decision tree combined to make the numerical prediction of the random forest?

🧠 What do the `.compare_regression()` plots show about random forest regressors and regularization? 

💪💪 Feel free to investigate effect of various parameters on the regression models! You can use the same code structure as for the "analysis" section with the decision tree & random forest classifiers above. Remember that all parameters to play with are listed in the official documentation.

## 4. Summary

Today we learned about **ensemble learning**, a method for training and combining **weak learners** into superpowered **ensemble models**. We first described **bagging**, which randomly averages overfit models to decrease their variance and improve their generalisation properties. **Random forests** are a successful example of a bagging ensemble of decision trees. We then studied **boosting**, which iteratively retrains underfit models focusing on their errors, and predicts using a weighted combination of the ensemble. **AdaBoost** is a simple example of boosted decision stumps. We applied these models to our banknote classification dataset, and introduced several **regularization** procedures: the number of estimators, the minimum samples per leaf, and the learning rate. Finally, we also tested ensemble methods on a **regression** task with our instagram_planning dataset.

# Resources

## Core Resources

- [sklearn documentation - ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html)  
Official documentation about ensemble methods in sklearn
- [Introduction to random forests](https://victorzhou.com/blog/intro-to-random-forests/)  
Excellent visual blogpost which explains random forests in detail
- [Python data science handbook - random forests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html)  
Practical code-along post about implementation of random forests with sklearn


### Additional Resources

- [random forest python](https://github.com/kevin-keraudren/randomforest-python)  
Implementation of random forest from scratch in python
- [Gentle Introduction to Gradient Boosting](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
- [Introduction to XGBoost](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)  
Comprehensive description of one of the most successful algorithms in data science: xgboost
- [args and kwargs demystified](https://realpython.com/python-kwargs-and-args/)  
blog post about \*\*kwargs in python
