# Ensembles 1: Random forests

## Learning objectives
- understand 
    - bootstrapping datasets
    - building ensembles by aggregating predictions
    - bagging
    - random forests
- implement
    - your first esemble
    - a random forests

## Intro - ensembles
### "The collective is smarter than the individual"


If we attempted to build an ensemble with multiple instances of the same model on the same dataset, this wouldn't help. 
Why?
Because they would all be identical (other than the differences induced by any stochastisity in the optimisation process).
This means that they would all make the same mistakes, and combining their predictions would not give any improvement.

### Why do ensembles work?
Mathematically, ensembles work because the mistakes made between models are not correlated.
This is because all of the errors are correlated.

To make sure the model errors are not correlated, we can't just train the same model on the same dataset, otherwise all of the errors would be the same for each model.
We can make the predictions differ (and uncorrelate the errors) in different ways, some of which we discuss below. 

### Bootstrapping datasets
The first way that we can make the models differ is by training them on different data.
We can "bootstrap" new datasets by sampling with replacement from the existing datasets

Let's get our data and make some bootstrapped datasets

In [20]:
import numpy as np
np.set_printoptions(suppress=True)
import matplotlib.pyplot as plt
import sklearn.datasets
from get_colors import colors

def get_data(sd=6, m=10, n_features=2, n_clusters=2):
    X, Y = sklearn.datasets.make_blobs(n_samples=m, n_features=n_features, centers=n_clusters, cluster_std=sd)
    return X, Y

def create_bootstrapped_dataset(existing_X, existing_Y, sample_weights=None):
    """Create a single bootstrapped dataset"""
    idxs = np.random.choice(np.arange(len(existing_X)), replace=True)
    return existing_X[idxs], existing_Y[idxs]

def create_bootstrapped_datasets(X, Y, n):
    """Create n bootstrapped datasets"""
    datasets = []
    for d in range(n):
        x, y = create_bootstrapped_dataset(X, Y)
        datasets.append((x, y))
    return datasets

X, Y = get_data()
n_trees = 10
bootstrapped_datasets = create_bootstrapped_datasets(X, Y, n_trees)


## Random forests
Random forests are ensembles of decision trees (many trees make a forest).

### Learning on feature subsets
Another way that we can make the predictions differ is by only allowing the model to make predictions based on a limited number of the features. 
That is, we train each model on the features of examples from the dataset that have been projected into a subspace of the feature space.

Let's use the sklearn `DecisionTreeClassifier` as our model, and train an ensemble of them on different random subspaces of features to create a random forest.

In [22]:
def project_into_subspace(X, feature_idxs):
    return X[:, feature_idxs]

projected_X = project_into_subspace(X, [0])
print(projected_X)
print(X)

[[-7.95532263]
 [13.00869119]
 [ 5.21951674]
 [-1.65583386]
 [-4.43587736]
 [-5.07070575]
 [-6.05277874]
 [ 1.36745376]
 [-1.22256703]
 [11.27625644]]
[[ -7.95532263  -6.5324316 ]
 [ 13.00869119   6.24425574]
 [  5.21951674   6.51036873]
 [ -1.65583386  -1.3158393 ]
 [ -4.43587736   0.32247454]
 [ -5.07070575   2.00740652]
 [ -6.05277874  -1.4526738 ]
 [  1.36745376  -3.05795278]
 [ -1.22256703   1.28295427]
 [ 11.27625644 -14.02775716]]
