# Aggregating: Averaging, Bagging, and Random Forests

In [None]:
import numpy as np
import pandas as pd
import os
import xlrd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,\
RandomForestClassifier, ExtraTreesRegressor, VotingRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder

## Agenda

SWBAT:

- use `sklearn` to build voting models;
- describe the algorithm of bagging;
- describe the differences among simple bagging, random forest, and extra trees algorithms;
- implement bagging models in `sklearn`.

## Intro

The basic idea of building an ensemble model is to build a "meta-estimator" that aggregates predictions from several "base learners".

There are several ways to do this.

Most simply, we could build several models and then take an **average** of their predictions. But there are also more sophisticated techniques, of which we shall explore two in depth:
- **bagging**, which depends on the idea of bootstrapping; and
- **boosting**, which depends on the idea of using a base estimator's errors to train the next base estimator.

In this lesson we'll discuss averaging and bagging.

## Averaging

In [None]:
wb = xlrd.open_workbook('data/Sales Report.xls',
                        logfile=open(os.devnull, 'w'))
sales = pd.read_excel(wb)

sales.head()

In [None]:
sales.info()

In [None]:
sales.isna().sum().sum()

In [None]:
sales = sales.dropna()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    sales[['Discount', 'Profit', 'Category', 'Sub-Category']],
    sales['Sales'], random_state=42)

In [None]:
ohe = OneHotEncoder(drop='first', sparse=False)
ohe.fit(X_train[['Category', 'Sub-Category']])

In [None]:
X_tr_cat = pd.DataFrame(ohe.transform(X_train[['Category', 'Sub-Category']]),
                                  columns=ohe.get_feature_names(),
                                    index=X_train.index)

X_tr_ohe = X_tr_cat.merge(X_train[['Discount', 'Profit']],
                       left_index=True,
                      right_index=True)

In [None]:
X_tr_ohe.head()

In [None]:
X_te_cat = pd.DataFrame(ohe.transform(X_test[['Category', 'Sub-Category']]),
                                  columns=ohe.get_feature_names(),
                                    index=X_test.index)

X_te_ohe = X_te_cat.merge(X_test[['Discount', 'Profit']],
                       left_index=True,
                      right_index=True)

In [None]:
X_te_ohe.head()

### Model 1

In [None]:
lr = LinearRegression()

lr.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=lr, X=X_tr_ohe,
                        y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
lr.score(X_te_ohe, y_test)

### Model 2

In [None]:
knn = KNeighborsRegressor()

knn.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=knn, X=X_tr_ohe,
                y=y_train, cv=10)
np.median(scores)

In [None]:
knn.score(X_te_ohe, y_test)

### Model 3

In [None]:
rt = DecisionTreeRegressor(random_state=42)

rt.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=rt, X=X_tr_ohe,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
rt.score(X_te_ohe, y_test)

### Averaging

To build our simple averagaing meata-estimator, we'll just average the predictions of the three base estimators:

In [None]:
meta_preds = sum([lr.predict(X_te_ohe), knn.predict(X_te_ohe),
                  rt.predict(X_te_ohe)]) / 3

Now we can evaluate our meta-estimator:

In [None]:
r2_score(y_test, meta_preds)

#### Building a VotingRegressor

In [None]:
avg = VotingRegressor(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('rt', rt)])
avg.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=avg, X=X_tr_ohe,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# This should be the same as above!

avg.score(X_te_ohe, y_test)

### Weighted Averaging

This meta-estimator is not as good as one of our base estimators, so in this case the averaging did not work very well. Realizing that the decision tree is performing better than the linear regression and the k-nearest-neighbors model, however, we might decide to build a meta-estimator by calculating a **weighted average** of the base estimators' predictions. And we can weight, or bias, this estimator in favor of the best-performing base estimator. Suppose we weight the tree 20%, the knn model 70%, and the linear regression 10%:

In [None]:
weighted_preds = sum([0.1 * lr.predict(X_te_ohe), 0.7 * knn.predict(X_te_ohe),
                     0.2 * rt.predict(X_te_ohe)])

Now we can evaluate this new meta-estimator:

In [None]:
r2_score(y_test, weighted_preds)

#### Weighted Averaging with the VotingRegressor

In [None]:
w_avg = VotingRegressor(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('rt', rt)],
    weights=[0.1, 0.7, 0.2])
w_avg.fit(X_tr_ohe, y_train)

In [None]:
scores = cross_val_score(estimator=w_avg, X=X_tr_ohe,
                        y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# This should be the same as above!

w_avg.score(X_te_ohe, y_test)

## Bagging

A single decision tree will often overfit your training data. Let's see if we have evidence of that in the current case:

In [None]:
rt.score(X_tr_ohe, y_train)

**Question**: What is this score? And why is it nearly equal to 1?

This nearly perfect score on the training data is already evidence of model overfitting. There are steps one can take to help with this, like limiting the "depth" of the nodes. And of course we can use cross-validation to get a more honest estimate of model quality:

In [None]:
scores = cross_val_score(estimator=rt, X=X_tr_ohe,
                y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
rt.score(X_te_ohe, y_test)

But it's often better to do something else: Plant another tree!

Of course, if a second tree is going to be of any value, it has to be *different from* the first. Here's a good algorithm for achieving that:

### Bagging Algorithm

Take a sample of your X_train and fit a decision tree to it. <br/>
Replace the first batch of data and repeat. <br/>
When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction. (Most obviously, we could take the average of our predictions, but there are other methods we might try.)

<br/>

Because we're resampling our data with replacement, we're *bootstrapping*. <br/>
Because we're making use of our many samples' predictions, we're *aggregating*. <br/>
Because we're bootstrapping and aggregating all in the same algorithm, we're *bagging*.

In [None]:
# Instatiate a BaggingRegessor

bag = BaggingRegressor(n_estimators=100,
                       verbose=1,
                       random_state=1)

In [None]:
# Fit it

bag.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=bag, X=X_tr_ohe,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

bag.score(X_te_ohe, y_test)

#### Change the base estimator

In [None]:
bag = BaggingRegressor(random_state=1,
                      base_estimator=knn)

In [None]:
bag.fit(X_tr_ohe, y_train)

In [None]:
bag.score(X_te_ohe, y_test)

### Fitting a Random Forest

Let's add an extra layer of randomization: Instead of using *all* the features of my model to optimize a branch at each node, I'll just choose a subset of my features.

That's the essence of a random forest model. Note that there are now **two** levels of random sampling happening: To build a new tree, I'll be taking only some of my data points; and at any branching point in a tree, I'll be using only some of my features to determine the split.

In [None]:
# Instantiate a RandomForestRegressor

rfr = RandomForestRegressor(max_features='sqrt',
                            max_samples=0.5,
                            verbose=1,
                            random_state=1)

In [None]:
# Fit it

rfr.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=rfr, X=X_tr_ohe,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

rfr.score(X_te_ohe, y_test)

### Fitting a Stand of Extremely Randomized Trees (Extra Trees)

Sometimes we might want even one more bit of randomization. Instead of always choosing the *optimal* branching path, we might just choose a branching path at random. If we're doing that, then we've got extremely randomized trees.

There are now **three** levels of randomization: sampling of data, sampling of features, and random selection of branching paths.

In [None]:
# Instantiate an ExtraTreesRegressor

etr = ExtraTreesRegressor(max_features='sqrt',
                         max_samples=0.5,
                         bootstrap=True,
                         random_state=1)

In [None]:
# Fit it

etr.fit(X_tr_ohe, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=etr, X=X_tr_ohe,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

etr.score(X_te_ohe, y_test)