![title](img/logo_white_full.png)

# Stacking Tutorial

Stacking (or Stacked Generalization) is a method of combining different models with the aim to **increase predictability**. In the simplest form, it uses the set of base estimators and then stacks up their predictions, which are used as training data in another model.

The method usually gives a little gain in performance, however, if for our clients **predictability** is crucial and 1% increase in accuracy can mean a huge competitive advantage - then stacking is great for them!

This part of the tutorial is greatly based on the Dawid Kopczyk [post about stacking](http://dkopczyk.quantee.co.uk/stacking/). Please read it before continuing, as this notebook contains just implementation.

The tutorial is based on [Allstate Claim Severity Kaggle competition data](https://www.kaggle.com/c/allstate-claims-severity).

In [20]:
import pickle # Load and save Python objects
from copy import copy as make_copy

import numpy as np # Arrays
import pandas as pd # Data-Frames
from plotly.offline import init_notebook_mode # Plotly

from sklearn.linear_model import LinearRegression, Lasso # Linear models
from sklearn.ensemble import RandomForestRegressor # Random Forest
from lightgbm import LGBMRegressor # LightGBM Regressor

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, make_scorer # MAE

import warnings # Ignore annoying warnings
warnings.filterwarnings('ignore')

# Required for Jupyter to produce in-line Plotly graphs
init_notebook_mode(connected=True)

---
## Data
Download the training and testing data that we have created during Data Processing Tutorial.

In [49]:
with open('data/data.pkl', 'rb') as f:
    X_train, X_test, y_train, y_test = pickle.load(f)
print(X_train.shape)
print(X_test.shape)

(141238, 130)
(47080, 130)


---
## A note on MAE
Mean Absolute Error (MAE) is the metric used for evaluation in this Kaggle competition. In Data Processing Tutorial, we log-transformed the losses to produce less skewed distribution. Since MAE should be calculated on untransformed losses, let's define a custom scorer that will be used for performance evaluation.

In [50]:
def mae_from_logs_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_from_logs_scorer = make_scorer(mae_from_logs_score, greater_is_better=False)

---
## Stacking
Let’s define base estimators of our stacking procedure and stacking regressor itself. Lasso, Random Forest and LightGBM regressors are used in base level and Linear Regression as stacking estimator.

In [52]:
base_reg = [Lasso(alpha=0.001, random_state=2019), 
            RandomForestRegressor(random_state=2019),
            LGBMRegressor(n_jobs=-1, random_state=2019)]
stck_reg = LinearRegression()

We can check each model MAE separately:

In [53]:
for reg in base_reg:
    
    # Fit model
    reg.fit(X_train, y_train)
    
    # Predict
    y_pred = reg.predict(X_test)
    
    # Calculate MAE
    print('The MAE calculated on testing data for {0} = {1:.2f}'.format(reg.__class__.__name__, 
                                                                        mae_from_logs_score(y_test, y_pred)))

The MAE calculated on testing data for Lasso = 1292.62
The MAE calculated on testing data for RandomForestRegressor = 1258.01
The MAE calculated on testing data for LGBMRegressor = 1152.14


The best model is LightGBM with MAE=1152.14. Can we improve it with stacking? :)

We will implement a function creating hold out predictions, which will be stacked together for each fold creating a column with meta feature for each estimator. We need to pass:
* ```ref```  – an object representing base regressor,
* ```X```, ```y```  – training data and target,
* ```cv``` – sklearn Cross Validation object, for example KFold
The recipe is simple – divide X and y into folds according to passed cv , for each hold out fold create predictions and save them to meta feature column.

In [57]:
def hold_out_predict(reg, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # Initilize
    meta_features = np.zeros(X.shape[0]) 
    n_splits = cv.get_n_splits(X, y)
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, reg.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]    
        y_train = y[train_idx]
        X_hold_out = X[hold_out_idx]

        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(reg)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features

Then, use just defined function to create meta features from all base estimators. We will use 4-fold CV. Additionally, we retrain the model on full training data and create testing meta features.

In [58]:
# Define 4-fold CV
cv = KFold(n_splits=4, random_state=2019)

# Loop over classifier to produce meta features
meta_train = []
for reg in base_reg:
    
    # Create hold out predictions for a classifier
    meta_train_reg = hold_out_predict(reg, X_train, y_train, cv)
    
    # Gather meta training data
    meta_train.append(meta_train_reg)
    
meta_train = np.array(meta_train).T

meta_test = []
for reg in base_reg:
    
    # Create hold out predictions for a classifier
    reg.fit(X_train, y_train)
    meta_test_reg = reg.predict(X_test)
    
    # Gather meta training data
    meta_test.append(meta_test_reg)
    
meta_test = np.array(meta_test).T

Starting hold out prediction with 4 splits for Lasso.
Starting hold out prediction with 4 splits for RandomForestRegressor.
Starting hold out prediction with 4 splits for LGBMRegressor.


Having ```meta_train```  and  ```meta_test```  we are ready to fit stacking regressor.

In [59]:
# Fit model
stck_reg.fit(meta_train, y_train)

# Predict
y_pred = stck_reg.predict(meta_test)
print('The MAE calculated on testing data for stacking {0} = {1:.2f}'.format(stck_reg.__class__.__name__, 
                                                                             mae_from_logs_score(y_test, y_pred)))

The MAE calculated on testing data for stacking LinearRegression = 1149.12


**The MAE from stacking regressor is better than the single best base model!**