# Ensembles and Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
import os
import xlrd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,\
RandomForestClassifier, ExtraTreesRegressor, VotingRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder

The basic idea of building an ensemble model is to build a "meta-estimator" that aggregates predictions from several "base learners".

There are several ways to do this.

Most simply, we could build several models and then take an **average** of their predictions. But there are also more sophisticated techniques, of which we shall explore two in depth:
- **bagging**, which depends on the idea of bootstrapping; and
- **boosting**, which depends on the idea of using a base estimator's errors to train the next base estimator.

Let's start with the averaging technnique:

## Averaging

In [None]:
wb = xlrd.open_workbook('data/Sales Report.xls',
                        logfile=open(os.devnull, 'w'))
sales = pd.read_excel(wb)

sales.head()

In [None]:
sales.info()

In [None]:
sales.isna().sum().sum()

In [None]:
sales = sales.dropna()

sales = sales.loc[:, sales.corr().columns].drop('Row ID', axis=1)

In [None]:
zip_dums = pd.get_dummies(sales['Postal Code'])
sales_zips = pd.concat([sales, zip_dums], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sales_zips.drop('Sales', axis=1),
                                                   sales_zips['Sales'],
                                                   random_state=6)

### Model 1

In [None]:
lr = LinearRegression()

lr.fit(X_train, y_train)

In [None]:
scores = cross_val_score(estimator=lr, X=X_train,
                        y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
lr.score(X_test, y_test)

### Model 2

In [None]:
knn = KNeighborsRegressor()

knn.fit(X_train, y_train)

Let's try cross-validating this model:

In [None]:
scores = cross_val_score(estimator=knn, X=X_train,
                y=y_train, cv=10)
np.median(scores)

In [None]:
knn.score(X_test, y_test)

### Model 3

In [None]:
rt = DecisionTreeRegressor(random_state=1)

rt.fit(X_train, y_train)

In [None]:
scores = cross_val_score(estimator=rt, X=X_train,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
rt.score(X_test, y_test)

### Averaging

To build our simple averagaing meata-estimator, we'll just average the predictions of the three base estimators:

In [None]:
meta_preds = sum([lr.predict(X_test), knn.predict(X_test),
                  rt.predict(X_test)]) / 3

Now we can evaluate our meta-estimator:

In [None]:
r2_score(y_test, meta_preds)

#### Building a VotingRegressor

In [None]:
avg = VotingRegressor(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('rt', rt)])
avg.fit(X_train, y_train)

In [None]:
scores = cross_val_score(estimator=avg, X=X_train,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# This should be the same as above!

avg.score(X_test, y_test)

### Weighted Averaging

This meta-estimator is not as good as one of our base estimators, so in this case the averaging did not work very well. Realizing that the decision tree is performing better than the linear regression and the k-nearest-neighbors model, however, we might decide to build a meta-estimator by calculating a **weighted average** of the base estimators' predictions. And we can weight, or bias, this estimator in favor of the best-performing base estimator. Suppose we weight the tree 70%, the knn model 20%, and the linear regression 10%:

In [None]:
weighted_preds = sum([0.1 * lr.predict(X_test), 0.2 * knn.predict(X_test),
                     0.7 * rt.predict(X_test)])

Now we can evaluate this new meta-estimator:

In [None]:
r2_score(y_test, weighted_preds)

#### Weighted Averaging with the VotingRegressor

In [None]:
w_avg = VotingRegressor(estimators=[
    ('lr', lr),
    ('knn', knn),
    ('rt', rt)],
                       weights=[0.1, 0.2, 0.7])
w_avg.fit(X_train, y_train)

In [None]:
scores = cross_val_score(estimator=w_avg, X=X_train,
                        y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# This should be the same as above!

w_avg.score(X_test, y_test)

## Bagging

A single decision tree will often overfit your training data. Let's see if we have evidence of that in the current case:

In [None]:
rt.score(X_train, y_train)

**Question**: What is this score? And why is it equal to 1?

This perfect score on the training data is already evidence of model overfitting. There are steps one can take to help with this, like limiting the "depth" of the nodes. And of course we can use cross-validation to get a more honest estimate of model quality:

In [None]:
scores = cross_val_score(estimator=rt, X=X_train,
                y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
rt.score(X_test, y_test)

But it's often better to do something else: Plant another tree!

Of course, if a second tree is going to be of any value, it has to be *different from* the first. Here's a good algorithm for achieving that:

### Bagging Algorithm

Take a sample of your X_train and fit a decision tree to it. <br/>
Replace the first batch of data and repeat. <br/>
When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction. (Most obviously, we could take the average of our predictions, but there are other methods we might try.)

<br/>

Because we're resampling our data with replacement, we're *bootstrapping*. <br/>
Because we're making use of our many samples' predictions, we're *aggregating*. <br/>
Because we're bootstrapping and aggregating all in the same algorithm, we're *bagging*.

In [None]:
# Instatiate a BaggingRegessor

bag = BaggingRegressor(max_features=0.5,
                       random_state=1)

In [None]:
# Fit it

bag.fit(X_train, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=bag, X=X_train,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

bag.score(X_test, y_test)

#### Change the base estimator

In [None]:
bag = BaggingRegressor(random_state=1,
                      base_estimator=knn)

In [None]:
bag.fit(X_train, y_train)

In [None]:
bag.score(X_test, y_test)

### Fitting a Random Forest

Let's add an extra layer of randomization: Instead of using *all* the features of my model to optimize a branch at each node, I'll just choose a subset of my features.

That's the essence of a random forest model. Note that there are now **two** levels of random sampling happening: To build a new tree, I'll be taking only some of my data points; and at any branching point in a tree, I'll be using only some of my features to determine the split.

In [None]:
# Instantiate a RandomForestRegressor

rfr = RandomForestRegressor(max_features='sqrt',
                            max_samples=0.5,
                            random_state=1)

In [None]:
# Fit it

rfr.fit(X_train, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=rfr, X=X_train,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

rfr.score(X_test, y_test)

### Fitting a Stand of Extremely Randomized Trees (Extra Trees)

Sometimes we might want even one more bit of randomization. Instead of always choosing the *optimal* branching path, we might just choose a branching path at random. If we're doing that, then we've got extremely randomized trees.

There are now **three** levels of randomization: sampling of data, sampling of features, and random selection of branching paths.

In [None]:
# Instantiate an ExtraTreesRegressor

etr = ExtraTreesRegressor(max_features='sqrt',
                         max_samples=0.5,
                         bootstrap=True,
                         random_state=1)

In [None]:
# Fit it

etr.fit(X_train, y_train)

In [None]:
# Cross-validation

scores = cross_val_score(estimator=etr, X=X_train,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

In [None]:
# Score on test

etr.score(X_test, y_test)

## Gridsearching

One method of hyperparameter tuning is **gridsearching**. The idea is to build mulitple models with different hyperparameter values and then see which one performs the best. The hyperparameters and the values to try form a sort of *grid* along which we are looking for the best performance.

Scikit-Learn has a `GridSearchCV` class whose `fit()` method runs this procedure. Note that this can be quite computationally expensive since:

- A model is constructed for each combination of hyperparameter values that we input; and
- Each model is cross-validated.

In [None]:
# GridSearching is computationally expensive, and the sales dataset is
# large, so we'll illustrate the tool with a smaller dataset.

penguins = sns.load_dataset('penguins')

In [None]:
penguins.info()

In [None]:
penguins.head()

### Data Prep

We'll try to predict species given the other columns' values. Let's dummy-out `island` and `sex`:

In [None]:
penguins.isna().sum().sum()

In [None]:
penguins = penguins.dropna()

In [None]:
y = penguins.pop('species')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    penguins, y, random_state=42)

In [None]:
X_train_cat = X_train.select_dtypes('object')

In [None]:
ohe = OneHotEncoder(
    drop='first',
sparse=False)
dums = ohe.fit_transform(X_train_cat)
dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                      index=X_train_cat.index)
X_train_clean = pd.concat([X_train.select_dtypes('float64'),
                 dums_df], axis=1)

In [None]:
X_train_clean.head()

In [None]:
rfc = RandomForestClassifier(random_state=1)

rfc.fit(X_train_clean, y_train)

In [None]:
scores = cross_val_score(estimator=rfc, X=X_train_clean,
               y=y_train, cv=10)
scores

In [None]:
np.median(scores)

### Preparing the Test Set

In [None]:
X_test_cat = X_test.select_dtypes('object')

test_dums = ohe.transform(X_test_cat)
test_dums_df = pd.DataFrame(test_dums,
                       columns=ohe.get_feature_names(),
                      index=X_test_cat.index)
X_test_clean = pd.concat([X_test.select_dtypes('float64'),
                 test_dums_df], axis=1)

In [None]:
rfc.score(X_test_clean, y_test)

### `GridSearchCV`

In [None]:
# Define the parameter grid

grid = {
    'max_features': ['sqrt', 'log2', 0.5],
    'criterion': ['gini', 'entropy']
}

**Question: How many models will we be constructing with this grid?**

In [None]:
# Initialize the gridsearch object with three-fold cross-validation

gs = GridSearchCV(estimator=rfc, param_grid=grid, cv=5)

In [None]:
gs.fit(X_train_clean, y_train)

In [None]:
gs.best_params_

In [None]:
gs.best_score_

In [None]:
gs.best_estimator_.score(X_test_clean, y_test)

In [None]:
gs.cv_results_

## Exercise

Use a Random Forest Classifier to predict the category of price range for the phones in this dataset. Try tuning some hyperparameters using GridSearch, and then write up a short paragraph about your findings.

In [None]:
phones_train = pd.read_csv('data/train.csv')

phones_test = pd.read_csv('data/test.csv')