# Stacking concept + Pictures + Code
***

# Stacking concept
***

1. We want to predict train set and test set with some 1st level model(s), and then use these predictions as features for 2nd level model(s).  
2. Any model can be used as 1st level model or 2nd level model.
3. To avoid overfitting (for train set) we use cross-validation technique and in each fold we predict out-of-fold (OOF) part of train set.
4. The common practice is to use from 3 to 10 folds.
5. Predict test set:
   * **Variant A:** In each fold we predict test set, so after completion of all folds we need to find mean (mode) of all temporary test set predictions made in each fold. 
   * **Variant B:** We do not predict test set during cross-validation cycle. After completion of all folds we perform additional step: fit model on full train set and predict test set once. This approach takes more time because we need to perform one additional fitting.
6. As an example we look at stacking implemented with single 1st level model and 3-fold cross-validation.
7. Pictures:
   * **Variant A:** Three pictures describe three folds of cross-validation. After completion of all three folds we get single train feature and single test feature to use with 2nd level model. 
   * **Variant B:** First three pictures describe three folds of cross-validation (like in Variant A) to get single train feature and fourth picture describes additional step to get single test feature.
8. We can repeat this cycle using other 1st level models to get more features for 2nd level model.
9. You can also look at animation of Variant A and Variant B.

# Variant A
***
![Fold 1 of 3](https://github.com/vecxoz/vecstack/raw/master/pic/dia1.png "Fold 1 of 3")
***
![Fold 2 of 3](https://github.com/vecxoz/vecstack/raw/master/pic/dia2.png "Fold 2 of 3")
***
![Fold 3 of 3](https://github.com/vecxoz/vecstack/raw/master/pic/dia3.png "Fold 3 of 3")

# Variant A. Animation
***

![Variant A. Animation](https://github.com/vecxoz/vecstack/raw/master/pic/animation1.gif "Variant A. Animation")

# Variant B
***
![Step 1 of 4](https://github.com/vecxoz/vecstack/raw/master/pic/dia4.png "Step 1 of 4")
***
![Step 2 of 4](https://github.com/vecxoz/vecstack/raw/master/pic/dia5.png "Step 2 of 4")
***
![Step 3 of 4](https://github.com/vecxoz/vecstack/raw/master/pic/dia6.png "Step 3 of 4")
***
![Step 4 of 4](https://github.com/vecxoz/vecstack/raw/master/pic/dia7.png "Step 4 of 4")

# Variant B. Animation
***

![Variant B. Animation](https://github.com/vecxoz/vecstack/raw/master/pic/animation2.gif "Variant B. Animation")

# Code
***

# Import

In [1]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import make_scorer
from sklearn.linear_model import LinearRegression
from vecstack import stacking

# Prepare data

In [2]:
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Stacking. Variant A. Regression task

## 1. Implementation A from scratch

In [3]:
# 1st level model
model = LinearRegression()
# Number of folds
n_folds = 3
# Empty array to store out-of-fold predictions (single column)
S_train_A_scratch = np.zeros((X_train.shape[0], 1))
# Empty array to store temporary test set predictions made in each fold
S_test_temp = np.zeros((X_test.shape[0], n_folds))
# Empty list to store scores from each fold
scores = []
# Split initialization
kf = KFold(n_splits=n_folds, shuffle=False, random_state=0)

# Loop across folds
for fold_counter, (tr_index, te_index) in enumerate(kf.split(X_train, y_train)):
    
    # Split data and target
    X_tr = X_train[tr_index]
    y_tr = y_train[tr_index]
    X_te = X_train[te_index]
    y_te = y_train[te_index]
    
    # Fit
    _ = model.fit(X_tr, y_tr)
    
    # Predict out-of-fold part of train set
    S_train_A_scratch[te_index, :] = model.predict(X_te).reshape(-1, 1)
    
    # Predict test set
    S_test_temp[:, fold_counter] = model.predict(X_test)
    
    # Print score of current fold
    score = mean_absolute_error(y_te, S_train_A_scratch[te_index, :])
    scores.append(score)
    print('fold %d: [%.8f]' % (fold_counter, score))
    
# Compute mean of temporary test set predictions to get final test set prediction
S_test_A_scratch = np.mean(S_test_temp, axis=1).reshape(-1, 1)

# Mean OOF score + std
print('\nMEAN:   [%.8f] + [%.8f]' % (np.mean(scores), np.std(scores)))

# Full OOF score
# !!! FULL score slightly differs from MEAN score because folds contain
# different number of examples (404 can't be divided by 3)
# If we set n_folds=4 scores will be identical for given metric
print('FULL:   [%.8f]' % (mean_absolute_error(y_train, S_train_A_scratch)))

fold 0: [3.38044832]
fold 1: [3.21036931]
fold 2: [3.49229353]

MEAN:   [3.36103705] + [0.11591064]
FULL:   [3.36071216]


## 2. Implementation A using standard Scikit-learn tools

There are no suitable standard Scikit-learn tools for Variant A.  
Please see corresponding implementation for Variant B below.

## 3. Implementation A using vecstack

In [4]:
models = [LinearRegression()]
S_train_A_vecstack, S_test_A_vecstack = stacking(models, 
                                                 X_train, y_train, X_test, 
                                                 regression=True, 
                                                 n_folds=n_folds,
                                                 mode='oof_pred_bag', 
                                                 random_state=0, 
                                                 verbose=2)

task:       [regression]
metric:     [mean_absolute_error]
mode:       [oof_pred_bag]
n_models:   [1]

model 0:    [LinearRegression]
    fold 0: [3.38044832]
    fold 1: [3.21036931]
    fold 2: [3.49229353]
    ----
    MEAN:   [3.36103705] + [0.11591064]
    FULL:   [3.36071216]



## Compare results

In [5]:
print('%s\n\n%s' % (S_train_A_scratch[:5], S_train_A_vecstack[:5]))

[[ 32.87287178]
 [ 22.02957522]
 [ 27.16855956]
 [ 23.77791521]
 [  7.70569251]]

[[ 32.87287178]
 [ 22.02957522]
 [ 27.16855956]
 [ 23.77791521]
 [  7.70569251]]


In [6]:
print('%s\n\n%s' % (S_test_A_scratch[:5], S_test_A_vecstack[:5]))

[[ 24.95478501]
 [ 23.63277494]
 [ 29.34879363]
 [ 12.0744784 ]
 [ 21.46079309]]

[[ 24.95478501]
 [ 23.63277494]
 [ 29.34879363]
 [ 12.0744784 ]
 [ 21.46079309]]


## Results from two implementations above are identical

In [7]:
print('Train:\n')
print(all(S_train_A_scratch == S_train_A_vecstack))

print('\nTest:\n')
print(all(S_test_A_scratch == S_test_A_vecstack))

Train:

True

Test:

True


# Stacking. Variant B. Regression task

## 1. Implementation B from scratch

In [8]:
# 1st level model
model = LinearRegression()
# Number of folds
n_folds = 3
# Empty array to store out-of-fold predictions (single column)
S_train_B_scratch = np.zeros((X_train.shape[0], 1))
# Empty list to store scores from each fold
scores = []
# Split initialization
kf = KFold(n_splits=n_folds, shuffle=False, random_state=0)

# Loop across folds
for fold_counter, (tr_index, te_index) in enumerate(kf.split(X_train, y_train)):
    
    # Split data and target
    X_tr = X_train[tr_index]
    y_tr = y_train[tr_index]
    X_te = X_train[te_index]
    y_te = y_train[te_index]
    
    # Fit
    _ = model.fit(X_tr, y_tr)
    
    # Predict out-of-fold part of train set
    S_train_B_scratch[te_index, :] = model.predict(X_te).reshape(-1, 1)
    
    # Print score of current fold
    score = mean_absolute_error(y_te, S_train_B_scratch[te_index, :])
    scores.append(score)
    print('fold %d: [%.8f]' % (fold_counter, score))
    
# Fit on full train set and predict test set once
_ = model.fit(X_train, y_train)
S_test_B_scratch = model.predict(X_test).reshape(-1, 1)

# Mean OOF score + std
print('\nMEAN:   [%.8f] + [%.8f]' % (np.mean(scores), np.std(scores)))

# Full OOF score
# !!! FULL score slightly differs from MEAN score because folds contain
# different number of examples (404 can't be divided by 3)
# If we set n_folds=4 scores will be identical for given metric
print('FULL:   [%.8f]' % (mean_absolute_error(y_train, S_train_B_scratch)))

fold 0: [3.38044832]
fold 1: [3.21036931]
fold 2: [3.49229353]

MEAN:   [3.36103705] + [0.11591064]
FULL:   [3.36071216]


## 2. Implementation B using standard Scikit-learn tools

In [9]:
# 1st level model
model = LinearRegression()

# Scorer for cross_val_score
scorer = make_scorer(mean_absolute_error)

# Fit and predict out-of-fold parts of train set
S_train_B_sklearn = cross_val_predict(model, 
                                      X_train, y=y_train, 
                                      cv=n_folds, n_jobs=1, 
                                      verbose=0).reshape(-1, 1)

# Fit on full train set and predict test set once
_ = model.fit(X_train, y_train)
S_test_B_sklearn = model.predict(X_test).reshape(-1, 1)

# Compute scores
# !!! We need additional run of cross_val_score, because at current point 
# cross_val_predict hasn't ability to show scores
scores = cross_val_score(model, 
                         X_train, y=y_train,                          
                         cv=n_folds, n_jobs=1, 
                         scoring=scorer, verbose=0)

# Print score of each fold
for fold_counter, score in enumerate(scores):
    print('fold %d: [%.8f]' % (fold_counter, score))
    
# Mean OOF score + std
print('\nMEAN:   [%.8f] + [%.8f]' % (np.mean(scores), np.std(scores)))

# Full OOF score
# !!! FULL score slightly differs from MEAN score because folds contain
# different number of examples (404 can't be divided by 3)
# If we set n_folds=4 scores will be identical for given metric
print('FULL:   [%.8f]' % (mean_absolute_error(y_train, S_train_B_sklearn)))

fold 0: [3.38044832]
fold 1: [3.21036931]
fold 2: [3.49229353]

MEAN:   [3.36103705] + [0.11591064]
FULL:   [3.36071216]


## 3. Implementation B using vecstack

In [10]:
models = [LinearRegression()]
S_train_B_vecstack, S_test_B_vecstack = stacking(models, 
                                                 X_train, y_train, X_test, 
                                                 regression=True, 
                                                 n_folds=n_folds,
                                                 mode='oof_pred', 
                                                 random_state=0, 
                                                 verbose=2)

task:       [regression]
metric:     [mean_absolute_error]
mode:       [oof_pred]
n_models:   [1]

model 0:    [LinearRegression]
    fold 0: [3.38044832]
    fold 1: [3.21036931]
    fold 2: [3.49229353]
    ----
    MEAN:   [3.36103705] + [0.11591064]
    FULL:   [3.36071216]

    Fitting on full train set...



## Compare results

In [11]:
print('%s\n\n%s\n\n%s' % (S_train_B_scratch[:5], S_train_B_sklearn[:5], S_train_B_vecstack[:5]))

[[ 32.87287178]
 [ 22.02957522]
 [ 27.16855956]
 [ 23.77791521]
 [  7.70569251]]

[[ 32.87287178]
 [ 22.02957522]
 [ 27.16855956]
 [ 23.77791521]
 [  7.70569251]]

[[ 32.87287178]
 [ 22.02957522]
 [ 27.16855956]
 [ 23.77791521]
 [  7.70569251]]


In [12]:
print('%s\n\n%s\n\n%s' % (S_test_B_scratch[:5], S_test_B_sklearn[:5], S_test_B_vecstack[:5]))

[[ 24.89012999]
 [ 23.72488246]
 [ 29.37213304]
 [ 12.14010251]
 [ 21.4468654 ]]

[[ 24.89012999]
 [ 23.72488246]
 [ 29.37213304]
 [ 12.14010251]
 [ 21.4468654 ]]

[[ 24.89012999]
 [ 23.72488246]
 [ 29.37213304]
 [ 12.14010251]
 [ 21.4468654 ]]


## Results from three implementations above are identical

In [13]:
print('Train:\n')
print(all(S_train_B_scratch == S_train_B_sklearn))
print(all(S_train_B_scratch == S_train_B_vecstack))
print(all(S_train_B_sklearn == S_train_B_vecstack))

print('\nTest:\n')
print(all(S_test_B_scratch == S_test_B_sklearn))
print(all(S_test_B_scratch == S_test_B_vecstack))
print(all(S_test_B_sklearn == S_test_B_vecstack))

Train:

True
True
True

Test:

True
True
True
