# <font color='#B31B1'> Ensemble Methods </font>

So far we've seen how to construct a single decision tree, now we'll see how to combine multiple trees together into a more powerful ensemble method.

In [21]:
from IPython.display import Image
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) 

## <font color='#B31B1'> California Housing Dataset </font>

We'll use the boston housing dataset, the goal of which is to predict house prices in California from scikit-learn.

In [22]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = data['data']
Y = data['target']

In [23]:
data_df = (pd.DataFrame(X, columns = data['feature_names'])
           .assign(Y = Y))

data_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Y
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## <font color='#B31B1'> Bagging </font>
Bagging is the process of generating a set of weak learners by training on random bootstrapped samples of our dataset (i.e. sampling a dataset from our training data with replacement). To show the power of bagging, we can use random trees: these trees use a *random* feature and *random* threshold to generate the split at each node and then predict the most common value at the leaf.

In [24]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

# Random trees are usually used just in ensemble methods
# so we have to manually specify we only want one to start
random_tree = ExtraTreesRegressor(n_estimators = 1)

In [25]:
ExtraTreesRegressor?

We can see that on its own, the random tree has a mean squared error of:

In [26]:
cross_val_score(random_tree, X, Y,
                scoring="neg_mean_squared_error", 
                cv=3).mean()

-0.9024777591605524

We could bag by randomly generating the bootstrap samples ourselves... or we could use scikit-learn's BaggingRegressor or BaggingClassifier! We simply need to specify the number of weak learners.

In [7]:
from sklearn.ensemble import BaggingRegressor

In [31]:
bagged_random_trees = BaggingRegressor(base_estimator = ExtraTreesRegressor(n_estimators = 1),
                                        n_estimators = 10
                                       )

Bagging the random trees together leads to a big jump in performance... even though they're random trees!

In [32]:
cross_val_score(bagged_random_trees, X, Y,
                scoring="neg_mean_squared_error", 
                cv=3).mean()

-0.47476478556319784

We can also see how the performance changes as we change the number of estimators.

In [33]:
bagged_random_trees = BaggingRegressor(base_estimator = ExtraTreesRegressor(n_estimators = 1),
                                        n_estimators = 100
                                       )
cross_val_score(bagged_random_trees, X, Y,
                scoring="neg_mean_squared_error", 
                cv=3).mean()

-0.4167362638559448

In [34]:
bagged_random_trees = BaggingRegressor(base_estimator = ExtraTreesRegressor(n_estimators = 1),
                                        n_estimators = 200
                                       )
cross_val_score(bagged_random_trees, X, Y,
                scoring="neg_mean_squared_error", 
                cv=3).mean()

-0.4118286946534265

Here 
* increasing from 10 to 100 estimators improves performance a lot!
* increasing from 100 to 200 estimators has almost no effect.

## <font color='#B31B1'> Random Forests </font>

Random forests is a bagging approach for trees that also randomly selects the set of features each tree can use (to help decorrelate results). Scikit-learn offers a great implementation of random forests.

In addition to all the decision tree hyperparameters, random forests also let us choose the number of trees, whether to use bootstrapped samples for each tree, and the max number of features every tree can use.

In [38]:
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(n_estimators = 100)

cross_val_score(random_forest, X, Y,
                scoring="neg_mean_squared_error", 
                cv=3).mean()

-0.46905893268797305

In [36]:
RandomForestRegressor?

## <font color='#B31B1'> Gradient Boosting </font>
Recall that boosting is the process of sequentially training weak learners to create a powerful prediction. In gradient boosting, each subsequent model is going to try to replicate the gradient of the loss function evaluated at the current model (almost mimicing gradient descent!). Let's try walking through a simple example manually.

In [39]:
#Start by splitting our data into training and testing
train_df = data_df.sample(frac=0.8)
test_df = data_df[~data_df.index.isin(train_df.index)]

X_tr = train_df.drop('Y',axis=1)
Y_tr = train_df['Y']

X_tst = test_df.drop('Y',axis=1)
Y_tst = test_df['Y']

We start by creating our initial predictions, here, by fitting a decision tree to our data.

In [40]:
# Start with our base prediction using a decision tree with only 5 layers
from sklearn.tree import DecisionTreeRegressor

base_tree = DecisionTreeRegressor(max_depth=5)

base_tree.fit(X_tr, Y_tr)

#Current MSE
print('Our initial training MSE is ', np.mean((base_tree.predict(X_tr) - Y_tr)**2))

Our initial training MSE is  0.4879983243722546


Next, we want to compute the gradient so we can construct a training dataset for our second tree. Since our objective is mean squared error, our gradient is going to be $\hat{y} - y$

In [46]:
residuals =  base_tree.predict(X_tr) - Y_tr

second_tree = DecisionTreeRegressor(max_depth=5)
second_tree.fit(X_tr, residuals)

DecisionTreeRegressor(max_depth=5)

Next we figure out the step size using line search (we'll just manually try gamma values)

In [48]:
best_mse = 99999
best_gamma = None

for gamma in np.linspace(0, 1, 100):
    mse =  np.mean((base_tree.predict(X_tr) - gamma*second_tree.predict(X_tr) - Y_tr)**2)
    if mse < best_mse:
        best_gamma = gamma
        best_mse = mse

print('The best step size was ', best_gamma,' for a new MSE of ', best_mse)

The best step size was  1.0  for a new MSE of  0.35254102596712683


We could now continue this process and try to add in a third tree and so on. Instead, let's show how to do this with scikit-learn.

In [17]:
from sklearn.ensemble import GradientBoostingRegressor

In [51]:
GradientBoostingRegressor?

The gradient boosted trees implementation allows us to pick a loss function, 
a fixed learning rate, and all the usual decision tree hyperparameters.

In [53]:
grad_boost_tree = GradientBoostingRegressor(
                    loss = 'ls',
                    learning_rate = 1)

grad_boost_tree.fit(X_tr, Y_tr)

print('The gradient boosted MSE is ', np.mean((grad_boost_tree.predict(X_tr) - Y_tr)**2))

The gradient boosted MSE is  0.16017001437539927


We can also compare the test set error:

In [54]:
print('The original tree MSE is ', np.mean((base_tree.predict(X_tst) - Y_tst)**2))
print('The one-step boosted tree MSE is ', np.mean((base_tree.predict(X_tst) + best_gamma*second_tree.predict(X_tst) - Y_tst)**2))
print('The gradient boosted test MSE is ', np.mean((grad_boost_tree.predict(X_tst) - Y_tst)**2))

The original tree MSE is  0.5142762445291711
The one-step boosted tree MSE is  0.9159771034480831
The gradient boosted test MSE is  0.2669416618696991
