## Stacked Generalization

The procedure for a 5 fold stacking may be described as follows:

1. Split the total training set into two disjoint sets (here train and holdout)

2. Train several base models on the first part (train)

3. Predict these base models on the second part (holdout)

4. Repeat step 1-3 five times and use the holdout predictions as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.


- For the test set, we could either average the predictions of all base models on the test data or refit the model using the whole training set and then predict. Generally speaking, either way is fine because the test set hasn't seen the training set.
- If we ran 10 models using the same procedure, our meta model will have 10 input features.

![img](https://s3.amazonaws.com/nycdsabt01/stacking.jpg)

Borrowed from [Faron](https://www.kaggle.com/getting-started/18153#post103381)

As a quick note, one should try a few diverse models. To my experience, a good stacking solution is often composed of at least:
- 2 or 3 GBMs/XGBs/LightGBMs (one with low depth, one with medium and one with high)
- 1 or 2 Random Forests (again as diverse as possible–one low depth, one high)
- 1 linear model

In [37]:
from sklearn.ensemble import VotingClassifier

In [38]:
from sklearn.linear_model import ElasticNet, LinearRegression as lr
from sklearn.ensemble import GradientBoostingRegressor as gbr, RandomForestRegressor as rfr

In [39]:
# Useful if you are debugging the function inside another .py script
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [40]:
import pandas as pd

houses_train = pd.read_csv('../Data/encoded_houses_train.csv')
houses_test = pd.read_csv('../Data/encoded_houses_test.csv')

In [41]:
houses_test.SalePrice.head()

0    0
1    0
2    0
3    0
4    0
Name: SalePrice, dtype: int64

In [42]:
X_train = houses_train.loc[1:20, houses_train.columns != "SalePrice"].values # convert to np.array
y_train = houses_train.loc[1:20, houses_train.columns == "SalePrice"].values

X_test = houses_test.loc[1:20, houses_train.columns != "SalePrice"].values # convert to np.array

In [43]:
y_train

array([[ 181500.],
       [ 223500.],
       [ 140000.],
       [ 250000.],
       [ 143000.],
       [ 307000.],
       [ 200000.],
       [ 129900.],
       [ 118000.],
       [ 129500.],
       [ 345000.],
       [ 144000.],
       [ 279500.],
       [ 157000.],
       [ 132000.],
       [ 149000.],
       [  90000.],
       [ 159000.],
       [ 139000.],
       [ 325300.]])

In [44]:
from stacking import stacking_regression
from sklearn.metrics import mean_squared_error
import numpy as np

In [45]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(np.log(y), np.log(y_pred)))

In [46]:
models = [
    # linear model, ElasticNet = lasso + ridge
    ElasticNet(random_state=0),
    
    # conservative random forst model
    rfr(random_state=0,
        n_estimators=1000, max_depth=6,  max_features='sqrt'),
    
    # aggressive random forst model
    rfr(random_state=0, 
        n_estimators=1000, max_depth=9,  max_features='auto'),
    
    # conservative gbm model
    gbr(random_state=0, learning_rate = 0.005, max_features='sqrt',
        min_samples_leaf=15, min_samples_split=10, 
        n_estimators=3000, max_depth=3),
    
    # aggressive gbm model
    gbr(random_state = 0, learning_rate = 0.01, max_features='sqrt',
        min_samples_leaf=10, min_samples_split=5, 
        n_estimators = 1000, max_depth = 9)
    
#     XGBRegressor(max_depth=3, 
#                         learning_rate=0.05, 
#                         n_estimators=1000, # Number of boosted trees to fit
#                         silent=False, # print messages while running 
#                         objective='reg:linear', 
#                         booster='gbtree', # Specify which booster to use: gbtree, gblinear or dart
#                         #for dart see http://xgboost.readthedocs.io/en/latest/tutorials/dart.html 
#                         n_jobs=-1, # Number of parallel threads used to run xgboost. (replaces nthread)
#                         gamma=0,  # Minimum loss reduction required to make a further partition on a leaf node of the tree.
#                         min_child_weight=1, # Minimum sum of instance weight(hessian) needed in a child
#                         max_delta_step=0, # Maximum delta step we allow each tree’s weight estimation to be
#                         subsample=1, # Subsample ratio of the training instance
#                         colsample_bytree=1, # Subsample ratio of columns when constructing each tree
#                         colsample_bylevel=1, # Subsample ratio of columns for each split, in each level
#                         reg_alpha=0, # L1 regularization term on weights
#                         reg_lambda=1, # L2 regularization term on weights
#                         scale_pos_weight=1, # Balancing of positive and negative weights
#                         base_score=0.5, # The initial prediction score of all instances, global bias
#                         random_state=743, 
#                         missing=None) 
    
    
    ]

meta_model = lr(normalize=True)

In [47]:
%%time
stacking_prediction = stacking_regression(models, ElasticNet, X_train, y_train, X_test,
                               transform_target=np.log1p, transform_pred = np.expm1, 
                               metric=rmsle, verbose=1)

metric: [rmsle]

model 0: [ElasticNet]
    ----
    MEAN:   [0.38737070]

model 1: [RandomForestRegressor]


  instance.fit(X_tr, transformer(y_tr, func = transform_target))
  instance.fit(X_tr, transformer(y_tr, func = transform_target))
  instance.fit(X_tr, transformer(y_tr, func = transform_target))


    ----
    MEAN:   [0.22332059]

model 2: [RandomForestRegressor]


  instance.fit(X_tr, transformer(y_tr, func = transform_target))
  instance.fit(X_tr, transformer(y_tr, func = transform_target))
  instance.fit(X_tr, transformer(y_tr, func = transform_target))


    ----
    MEAN:   [0.22897977]

model 3: [GradientBoostingRegressor]


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


    ----
    MEAN:   [0.38424706]

model 4: [GradientBoostingRegressor]


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


    ----
    MEAN:   [0.38424706]



  y = column_or_1d(y, warn=True)


TypeError: fit() missing 1 required positional argument: 'y'

In [48]:
S_train.head()

NameError: name 'S_train' is not defined

In [None]:
from stacking import transformer

ElasticNet.fit(S_train, transformer(y_train, func = transform_target))

transformer(y_train, func = np.log1p)

**Having more models than necessary in ensemble may hurt.**

Lets say we have a library of created models. Usually greedy-forward approach works well:
- Start with a few well-performing models’ ensemble
- Loop through each other model in a library and add to current ensemble
- Determine best performing ensemble configuration
- Repeat until metric converged

If you are using linear regression as the meta model, make sure you have **diverse/uncorrelated** first layer models

During each loop iteration it is wise to consider only a subset of library models, which could work as a regularization for model selection.

Repeating procedure few times and bagging results reduces the possibility of overfitting by doing model selection.

Another [great walkthrough](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python) of stacking on Kaggle using the famous Titanic dataset.

R users can call the `stackedEnsemble()` function from the [H2o package](https://h2o-release.s3.amazonaws.com/h2o/rel-ueno/2/docs-website/h2o-docs/data-science/stacked-ensembles.html) directly.

### Success formula (personal opinion)

50% - feature engineering

30% - model diversity

10% - luck

10% - proper ensembling
 - Voting
 - Averaging
 - Stacking