---
layout: post
title:  "Stacking"
date:   2023-05-18 10:14:54 +0700
categories: MachineLearning
---

# TOC

- [Introduction](#intro)
- [Variants](#var)

# Stacked generalization

Stacking, also called stacked generalization, is one of the popular techniques of ensembling. Similar to bagging and boosting, stacking also combines prediction from multiple base models, trained on the same training set. The difference of stacking with bagging is that stacking uses many different machine learning model, instead of only decision tree. And then all the base models are trained on the training set, instead of sampling from those training set. The difference with boosting is that stacking uses a new model to combine the prediction, instead of predict the error of the previous model. In general, stacking is like a more sophisticated version of cross validation.

A general stacking model includes two levels: base models and the meta model:

- Base model learns directly from the training set. The outputs would be used as input for the meta model. The base models can be any models: decision tree, SVM, neural network.. Since the way they learn are different, the outputs and errors are not correlated. To avoid overfitting, we can use k-fold cross validation technique.

- Meta models use the output of base models as their input and output a prediction, with respect to the labels of the problem. This is how the predictions of base models are combined. The meta models can be simple techniques: linear regression to output a real value for the regression task, and logistic regression to output a label probability for the classification task.

<img src="https://raw.githubusercontent.com/kpokrass/dsc-3-final-project-online-ds-ft-021119/master/stacked_schema.png">

# Variants

## K fold cross validation

Given a dataset with N observations $$ D = (x_n, y_n), n = 1,..N $$ where $$ y_n $$ is the class value and $$ x_n $$ is the attribute vector of the n instance. Let's split the data into K parts: $$ D_1,...D_J $$. As an usual K fold cross validation task, let $$ D_j $$ will be the test set and $$ D^{(-j)} = D - D_j $$ to be the training set for the kth fold. Now we assemble L learning algorithms, which are called level-0 generalizers. Each learning algorithm l will train on the training set $$ D^{(-j)} $$ and result in the model $$ M_l^{(-j)} $$. At the end, the final level-0 model $$ M_l, l = 1,...L $$ is trained on all the data in D.

For the test set, for each $$ x_n \in D_j $$, let $$ z_{ln} $$ be the prediction of $$ M_l^{(-j)} $$ on $$ x_n $$. After doing all the cross validation, the dataset assembled from the outputs of the L models would be $$ D_{CV} = \{(y_n, z_{1n}, ...z_{Ln}), n = 1,...N \} $$. This is called the level-1 data since it is the output of all the level-0 models.

We then decide on a level-1 generalizer to derive from that level-1 dataset: the model M for y as a function of $$ (z_1, ...z_L) $$. This is called level-1 model.

For the classification process, $$ M_l $$ will be used together with M. For a new instance, model $$ M_k $$ will output a vector $$ (z_1,...z_K) $$. This vector is then be the input for the level-1 model M, who will output the final classification result for that instance. This is the original stacked generalization process. 

Apart from the hard classification, the model also considers probabilities of classes. Assume I classes. For model $$ M_l^{(-j)} $$ and instance x in $$ D_j $$, the output of the model is the probability vector for nth instance: $$ P_{kn} = (P_{k1}(x_n),...P_{kI}(x_n)) $$ for $$ P_{ki} $$ to be the probability of the ith output class.

The level-1 data would be all the class probability vectors from all L models, together with the true class:

$$ D'_{CV} = \{(y_n, P_{1n}, ... P_{Kn}), n=1,...N \} $$

The level-1 model trained on this dataset would be called M'.

There are reasons why stacking in this way works to improve the overall performance. First, each base model will add to the coverage of the training set. For example, base model 1 can cover 60% of the dataset, base model 2 can cover 30%. Together they can cover at most 90%, and in their special ways. Second, using a meta learner can describe the output of all base models in non trivial way, instead of just choosing winner-takes-all or simple average method. In the case of winner takes all, if we simply choose the highest confident output, it is just like using that base learner alone, we haven't utilized the other base learners and their sophistication. Similarly, taking average of the outputs of the base learners also fail to take into account the intricacies of each learners. By doing the combination in a sophisticated way, we combine the output better and increases overall performance.

## Restacking

One way to improve the stack is to pass the original training set altogether with the outputs of the base models into the next level learner.

## Generate test and aggregate

We can generate multiple test predictions and average them.

## Increase the levels

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-30-15-30-32.png">

Apart from level-1 meta learners, we add level-2 learners as well, before coming to the final prediction.