## Stacking

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Base Learners](#Base-Learners)
    - [Ridge](#Ridge)
    - [KNN](#KNN)
    - [RF](#RF)
    - [Best Base Learner](#Best-Base-Learner)
- [Average](#Average)
- [Weighted Average](#Weighted-Average)
- [Model 1: OLS Average](#Model-1:-OLS-Average)
- [Model 2: RF Aggregation](#Model-2:-RF-Aggregation)

*************
# Preliminaries
[TOP](#Stacking)

We will be using the following base learners predicting `pct_d_rgdp` in an ensemble using the aggregation techniques listed in the table of contents:

1. Ridge Regression
2. KNN
3. RF

In [1]:
%run metrics.py

In [2]:
# utilties
import numpy as np
import pandas as pd
from tqdm import tqdm

# processing
from sklearn.model_selection import GridSearchCV, KFold, train_test_split

# algorithms
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

Loading in the data

In [9]:
df = pd.read_pickle('C:/Users/hubst/Econ490_group/class_data.pkl')

We are going to exclude the fixed effect features for `year` to reduce the number of features.

In [10]:
df = df.drop(columns = ['year', 'urate_bin', 'GeoName']).join([
    pd.get_dummies(df['urate_bin'], drop_first = True)
])

We are going to make the choice of standardizing all of our variables.

Remember, we need to obtain the data for

- `train1`
- `train2`
- `test`

In [11]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                   train_size = 2/3,
                                                   random_state = 490)

x_train1, x_train2, y_train1, y_train2 = train_test_split(x_train, y_train,
                                                         train_size = 1/2,
                                                         random_state = 490)
x_train1 = x_train1.apply(stdz)
x_train2 = x_train2.apply(stdz)
x_test = x_test.apply(stdz)

Removing what we do not need

In [12]:
%who

GridSearchCV	 KFold	 KNeighborsRegressor	 LinearRegression	 RandomForestRegressor	 RidgeCV	 acc	 df	 np	 
pd	 r2	 rmse	 stdz	 tqdm	 train_test_split	 x	 x_test	 x_train	 
x_train1	 x_train2	 y	 y_test	 y_train	 y_train1	 y_train2	 


In [13]:
del df, x_train, y_train

In [14]:
%who

GridSearchCV	 KFold	 KNeighborsRegressor	 LinearRegression	 RandomForestRegressor	 RidgeCV	 acc	 np	 pd	 
r2	 rmse	 stdz	 tqdm	 train_test_split	 x	 x_test	 x_train1	 x_train2	 
y	 y_test	 y_train1	 y_train2	 


***********
# Base Learners
[TOP](#Stacking)

In this demonstration, we are only going to use 3 base learners.
However, there is nothing stopping your from using more.
In fact, you may find that the more learners you have, the better your model.

However, once you start to include a larger number of base learners, you may want to consider using regularization to aggregate their predictions.

*************
## Ridge
[TOP](#Stacking)

We will be using a ridge regression function from `sklearn`, which means we do not need to append an intercept to the features.

In [15]:
reg_ridge = RidgeCV(alphas = 10.**np.linspace(-2, 5, num = 20),
                   cv = 5).fit(x_train1, y_train1)
reg_ridge.alpha_

615.8482110660254

In [16]:
r2_ridge = reg_ridge.score(x_test, y_test)
r2_ridge

0.030267374601945285

***************
# KNN
[TOP](#Stacking)

Remember that KNN is relatively slow at fitting and relatively slow at predicting.
All the other models we have used so far are at least relatively fast at predicting.

**Why is KNN slow at predicting?** *Hint: it is in its name!*

Let the CV begin! We are going to set a hard limit of 100 on the number of neighbors.

In [17]:
%%time
param_grid = {
    'n_neighbors': [5, 10, 25, 50, 75, 100]
}

knn_cv = KNeighborsRegressor()

grid_search = GridSearchCV(knn_cv, param_grid,
                          cv = 5,
                          scoring = 'neg_mean_squared_error',
                          verbose = 2,
                          n_jobs = 4).fit(x_train1, y_train1)

best_knn = grid_search.best_params_
best_knn

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Wall time: 6.75 s


[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:    6.3s finished


{'n_neighbors': 100}

And to refit the model.

In [18]:
reg_knn = KNeighborsRegressor(n_neighbors = best_knn['n_neighbors'])
reg_knn.fit(x_train1, y_train1)

r2_knn = reg_knn.score(x_test, y_test)
r2_knn

0.02748795325609321

*************
## RF
[TOP](#Stacking)

In [19]:
%%time
reg_rf = RandomForestRegressor(n_estimators = 500,
                              max_features = 'sqrt',
                              random_state = 490,
                              n_jobs = 4).fit(x_train1, y_train1)

r2_rf = reg_rf.score(x_test, y_test)
r2_rf

Wall time: 9.37 s


-0.11536385809625882

**************
## Best Base Learner
[TOP](#Stacking)

We can print out the base learners $R^2$ performance.

In [20]:
r2_base = {
    'r2_ridge': r2_ridge,
    'r2_knn': r2_knn,
    'r2_rf': r2_rf
}
print(r2_base, '\n')

best_base = max(r2_base, key = r2_base.get)

print(best_base, ':', r2_base[best_base])

{'r2_ridge': 0.030267374601945285, 'r2_knn': 0.02748795325609321, 'r2_rf': -0.11536385809625882} 

r2_ridge : 0.030267374601945285


**********
# Average
[TOP](#Stacking)

Remember that the coefficients (the wieghts) are predetermined for a simple average. 
They are specifically set to the inverse of the number of base learners. 
To see this, let $j$ denote the base learner index.

$$
\begin{align*}
    \bar{f_j}(x) & = \frac{1}{3}\sum_{j=1}^3 f_j(x)\\
    & = \frac{1}{3}f_1(x) + \frac{1}{3}f_2(x) + \frac{1}{3}f_3(x)\\
    & = w_1 f_1(x) + w_2 f_2(x) + w_3 f_3(x)
\end{align*}
$$


$$
MATH!!!!!
$$

In [21]:
df_test_yhat = pd.DataFrame({
    'ridge': reg_ridge.predict(x_test),
    'knn': reg_knn.predict(x_test),
    'rf': reg_rf.predict(x_test)
},
index = y_test.index)
df_test_yhat.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,ridge,knn,rf
fips,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
17195,2003,1.518422,0.961779,2.104121


In [22]:
r2_avg = r2(df_test_yhat.mean(axis = 1), y_test)
r2_avg

0.025247656239703153

************
# Weighted Average
[TOP](#Stacking)

In order to estimate a weighted average, we need to create a grid of weights such that they all add to one.

In [25]:
step_size = 0.1
wts = np.arange(0, 1 + step_size, step = step_size)

wts_grid = np.array([(x, y, z) for x in wts for y in wts for z in wts])

print(wts_grid.shape, '\n')

keep = wts_grid.sum(axis = 1) == 1
wts_grid = wts_grid[keep]
wts_grid.shape

(1331, 3) 



(62, 3)

We are going to be using the predicted values on `train2` to identify the optimal weights.

It is computationally efficient to only estimate them once, so we are going to create a data frame.

In [26]:
df_train2_yhat = pd.DataFrame({
    'ridge': reg_ridge.predict(x_train2),
    'knn': reg_knn.predict(x_train2),
    'rf': reg_rf.predict(x_train2)
},
index = y_train2.index)
df_train2_yhat.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,ridge,knn,rf
fips,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
48333,2014,2.683669,1.241488,9.149495


Now to identify the optimal weights

In [28]:
r2_grid = {}

i = 0
for w in tqdm(wts_grid):
    yhat = df_train2_yhat @ w.T
    r2_grid[i] = r2(yhat, y_train2)
    i += 1

100%|█████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 476.81it/s]


In [31]:
best_indx = max(r2_grid, key = r2_grid.get)
best_wts = wts_grid[best_indx]
best_wts

array([0.4, 0.5, 0.1])

Saving the $R^2$...

In [32]:
yhat = df_test_yhat @ best_wts.T

r2_wtd_avg = r2(yhat, y_test)
r2_wtd_avg

0.03451625498360489

**************
# Model 1: OLS Average
[TOP](#Stacking)



**How is OLS an average?**

Well with a slight abuse of notation, recall that in this case OLS takes the form

$$\hat{y} = \beta_0 + \hat{y}_1 \beta_1 + \hat{y}_2 \beta_2 + \hat{y}_3 \beta_3 $$

Here, $\beta_1$, $\beta_2$, and $\beta_3$ are acting as weights that do not sum to 1.
$\beta_0$ is a *bias* term. 

In [33]:
stack_ols = LinearRegression().fit(df_train2_yhat, y_train2)
print(stack_ols.intercept_, stack_ols.coef_)

-0.2047342283163347 [0.36456238 0.58770383 0.11487815]


In [34]:
stack_ols.intercept_ + sum(stack_ols.coef_)

0.8624101323576985

In [35]:
r2_stack_ols = stack_ols.score(df_test_yhat, y_test)
r2_stack_ols

0.034619873469761586

**************
# Model 2: RF Aggregation
[TOP](#Stacking)

We can also use different models as stackers.

Here we will use a random forest. 
We will use the usual `max_features = 'sqrt'`, however, we will also add 

In [37]:
stack_rf = RandomForestRegressor(n_estimators = 50,
                                max_features = 'sqrt',
                                max_depth = 2,
                                random_state = 490,
                                n_jobs = 3)
stack_rf.fit(df_train2_yhat, y_train2)

r2_stack_rf = stack_rf.score(df_test_yhat, y_test)
r2_stack_rf

0.031020593654738526

****************
# Comparison
[TOP](#Stacking)

In [38]:
%whos

Variable                Type                     Data/Info
----------------------------------------------------------
GridSearchCV            ABCMeta                  <class 'sklearn.model_sel<...>on._search.GridSearchCV'>
KFold                   ABCMeta                  <class 'sklearn.model_selection._split.KFold'>
KNeighborsRegressor     ABCMeta                  <class 'sklearn.neighbors<...>ion.KNeighborsRegressor'>
LinearRegression        ABCMeta                  <class 'sklearn.linear_mo<...>._base.LinearRegression'>
RandomForestRegressor   ABCMeta                  <class 'sklearn.ensemble.<...>t.RandomForestRegressor'>
RidgeCV                 ABCMeta                  <class 'sklearn.linear_model._ridge.RidgeCV'>
acc                     function                 <function acc at 0x0000022530A82310>
best_base               str                      r2_ridge
best_indx               int                      41
best_knn                dict                     n=1
best_wts              