# Ensemble

## <center> Introduction

In general, ensemble is trying to train a number of models on same task but has different data. The combination of these models may usually have a better performance than single model.

We will introduce 3 ensemble method.

## <center> Bagging

### How to train

First of all, Sampling $t$(decide by you) sub-dataset from whole dataset($N$ data)and each dataset has each dataset has $N'$(usually, $N' == N$) examples. Don't worry that sub-dataset may have the same data with original dataset because it may sample the same example.

Next, we use these $t$ sub-dataset to training $t$ models(same model but has different parameters).

Get final model by averaging these models.

### How to do validation

Out-of-bag(OOB) validation: Some examples in orignial dataset may be not in some sub-dataset. We could use those data not in sub-dataset to test model.

### Tips

To different models, resampling training data is not sufficient. It might need collaboration of some other ways.

## <center> Boosting

### What is Boosting?
Boosting is an ensemble method to improve model performance. Sound unbelievable, However, you can indeed obtain 0% error rate after boosting. If you want to use boosting. Your model must have error rate less than 50%. We will explain why error rate must less than 50% later. 

### General process:
* we discuss classification problem in here
* give a weak classifier $f_1(x)$ as first classifier
* get the next classifier $f_2(x)$ that can be complementary with $f_1(x)$
* get the third classifier $f_3(x)$ that can be complementary with $f_2(x)$
* ...
* aggregate all classifier
* we learn classifier sequentially

### How to get next classifier?
* training on different training data sets can obtain different classifiers
* Now the problem is how to obtain different training data sets?
    * resampling your data, like what we did in Bagging.
    * reweighting your data
    
        For example, if you have $(x^1, \hat{y}^1, u^1 = 1), (x^2, \hat{y}^2, u^2 = 1), (x^3, \hat{y}^3, u^3 = 1)$, where x is features and y is label, we can change data by changing $u_i$. The real data using in training will be mutiplied by corresponding $u$. That is $x_i = x_i \cdot u_i$, $\hat{y}^i = \hat{y}^i \cdot u_i$. 

        Additionally, loss function should be modified to $L(f) = \sum_n u^n l(f(x^n), \hat{y}^n)$ (orignial: $L(f) = \sum_nl(f(x^n), \hat{y}^n)$), a weighted version of loss function.
        
**Now, let's introduce some algorithms to do boosting**

## <center> Algorithm: Adaboost

### Initialisation

Supposing we have $f_1(x)$ with error rate $\epsilon_1$ less than $0.5$ on orignial training data sets(with weight 1). 

### How to calculate error rate on weighted data sets?
$$
\epsilon_1 = \frac{\sum_n u_1^n \delta(f_1(x^n) \ne \hat{y}^n)}{Z_1}, Z_1 = \sum_n u_1^n, \epsilon_1 < 0.5
$$
where $Z_1$ is summantion of weight on dataset $X_1$, $u_i^n$ is the weight of n'th data on i'th iteration.

### What kind of reweighted data we want to obtain?

We want to reweight data such that $f_1(x)$ will fail to classify reweighted dataset(random classification). Since we use binary classification as example, random classification on such data will obtain $50\%$ error rate.

Thinking about the formula of calculation of error rate,  the error rate will rise if we increase the weight on misclassified data and decrease the weight on correctly clasified data.

### How to reweight training data?

* If $x^n$ misclassified by $f_1$($f_1(x^n) \ne \hat{y}^n$)
   
    new $u_{i+1}^n = u_i^n \cdot d_i$
  
* If $x^n$ correctly classified by $f_1$($f_1(x^n) == \hat{y}^n$)

    new $u_{i+1}^n = u_i^n / d_i$

### How to compute $d_i$

Go back to the formula of calculating error rate. Recall that we are talking about binary classificatoin.

We know 

$$
\epsilon_1 = \frac{\sum_n u_1^n \delta(f_1(x^n) \ne \hat{y}^n)}{Z_1}, Z_1 = \sum_n u_1^n, \epsilon_1 < 0.5
$$

and we want $f_1$ get $50\%$ error rate on new dataset, that is

$$
\epsilon_1' = \frac{\sum_n u_2^n \delta(f_1(x^n) \ne \hat{y}^n)}{Z_2} = 0.5
$$

$Z_i$ is the summation of weight. It can be splitted to summation of weight of misclassified data plus summation of weight of correctly classified data

$$
\frac{\sum_n u_2^n \delta(f_1(x^n) \ne \hat{y}^n)}{Z_2} = 
\frac{\sum_n u_2^n \delta(f_1(x^n) \ne \hat{y}^n)}                    
     {\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_2^n + \displaystyle\sum_{f_1(x^n) \ne \hat{y}^n}u_2^n} 
$$

recall that how we reweight data by $d$,

$$
=
\frac{\sum_n u_1^n \cdot d_1 \delta(f_1(x^n) \ne \hat{y}^n)}
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1 + \displaystyle\sum_{f_1(x^n) \ne \hat{y}^n}u_1^n \cdot d_1}
$$

reverse $\epsilon_1'$, 

$$
\frac{1}{\epsilon_1'} = \frac
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1 + \displaystyle\sum_{f_1(x^n) \ne \hat{y}^n}u_1^n \cdot d_1}
{\sum_n u_1^n \cdot d_1 \delta(f_1(x^n) \ne \hat{y}^n)}
= 2
$$

then,
$$
= 
\frac
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1}
{\sum_n u_1^n \cdot d_1 \delta(f_1(x^n) \ne \hat{y}^n)} 
+ 
\frac
{\displaystyle\sum_{f_1(x^n) \ne \hat{y}^n}u_1^n \cdot d_1}
{\sum_n u_1^n \cdot d_1 \delta(f_1(x^n) \ne \hat{y}^n)} 
= 
\frac
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1}
{\sum_n u_1^n \cdot d_1 \delta(f_1(x^n) \ne \hat{y}^n)} 
+
1
$$

since $\delta(...)$ return either $0$ or $1$
$$
= \frac
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1}
{\displaystyle\sum_{f_1(x^n) \ne \hat{y}^n} u_1^n \cdot d_1 } 
+
1
=
2
$$

Now, we get

$$
\frac
{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1}
{\displaystyle\sum_{f_1(x^n) \ne \hat{y}^n} u_1^n \cdot d_1 } 
=
1 
\rightarrow
\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n / d_1
=
\displaystyle\sum_{f_1(x^n) \ne \hat{y}^n} u_1^n \cdot d_1
\rightarrow
\frac{1}{d_1} \cdot \displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n  
=
d_1 \cdot \displaystyle\sum_{f_1(x^n) \ne \hat{y}^n} u_1^n 
$$

keep going, recall the calculation of $\epsilon_1$, error rate is actually the ratio between summation of misclassified weight and total weight

$$
\rightarrow
d_1^2 = 
\frac{\displaystyle\sum_{f_1(x^n) = \hat{y}^n}u_1^n}
{\displaystyle\sum_{f_1(x^n) \ne \hat{y}^n} u_1^n}
= 
\frac{(1 - \epsilon_1)\cdot Z_1}
{\epsilon_1 \cdot Z_1}
=
\frac{(1 - \epsilon_1)}
{\epsilon_1}
$$

Therefore,

$$
d_i =
\sqrt
\frac{1 - \epsilon_i}
{\epsilon_i} > 1
$$

remember we want to $\epsilon < 0.5$, so $d_i > 1$. We want to increase weight of misclassified data by multiplying $d_i$. This is exactly what we get in here.

### Conclusion of Adaboost

* Given training data and initial weight $\{(x^1, \hat{y}^1, u_1^1), ..., (x^k, \hat{y}^k, u_1^k), ..., (x^n, \hat{y}^n, u_1^n)\}$, initial weight is 1.
* Repeat t = 1, ..., T:
    * training weak classifier $f_t(x)$ with weights {u_t^1, ..., u_t^n}
    * compute $\epsilon_t = \frac{\sum_{f_t(x) \ne \hat{y}} u_t^n }{Z_t}$ 
    * update weight:
        * introduce $\alpha_t = \ln{d_t} = \ln{\sqrt{(1 - \epsilon_t)/\epsilon_t}}$
        * if $f_t(x) \ne \hat{y}$, then $u_{t+1}^n = u_t^n \cdot d_t = u_t^n \cdot \exp(\alpha_t)$
        * if $f_t(x) = \hat{y}$, then $u_{t+1}^n = u_t^n \cdot d_t = u_t^n \cdot \exp(-\alpha_t)$
* final classifier(aggregate function), we use $H(x)$ to denote the aggregation results, 
    * uniform distribution, $H(x) = sign(\sum^T_{t=1} f_t(x))$, if $H(x) > 0$, class 1, if $H(x) < 0$, class 0.
    * This doesn't make sense...
    * Weighted aggretation, $H(x) = sign(\sum^T_{t=1} \alpha_t f_t(x))$, we use this.

### Why error rate down to 0 on training dataset?

Final classifier: $H(x) = sign(\sum^T_{t=1} \alpha_t f_t(x))$

then final error rate is 

$$
\frac{1}{N} \sum_n \delta(H(x^n) \ne \hat{y}^n) 
=^{(1)}
\frac{1}{N} \sum_n \delta(\hat{y}^n g(x^n) < 0)
\le^{(2)}
\frac{1}{N} \sum_n \exp(-\hat{y}^n g(x^n))
=^{(3)}
\frac{1}{N}Z_{T+1}
\le^{(4)}
\prod_{t=1}^T 2\sqrt{\epsilon_t(1 - \epsilon_t)}
$$

Since we guarantee $\epsilon_t < 0.5$, final error rate converge to 0.  


### break down

**(1)**: We set $g(x^n) = \sum^T_{t=1} \alpha_t f_t(x^n)$, the left side on (1) is to compute the part of incorrect predicted results. These incorrect results are $H(x^n) \ne \hat{y}^n$ which are actually those $g(x^n)$ and $\hat{y}^n$ not in same region(positive and negative, or negative and positive). 

**(2)**: $\exp(...)$ is the upper bound of $\delta(...)$. This can be verified by drawing plots.

**(3)**: $Z_{T+1} = \sum_n u_{T+1}^n$. 
* We know $u_1^n = 1$
* and $u_{t+1}^n = u_t^n \cdot \exp(-\hat{y}^n f_t(x^n) \alpha_t)$. 
* By induction, $u^n_{T+1} = \prod_{t=1}^T\exp(-\hat{y}^n f_t(x^n) \alpha_t)$. 
* Therefore $Z_{T+1} = \sum_n \prod_{t=1}^T\exp(-\hat{y}^n f_t(x^n) \alpha_t) = \sum_n\exp(-\hat{y}^n \sum_{t=1}^T \alpha_t f_t(x^n) ) = \sum_n\exp(-\hat{y}^n g(x^n))$

**(4)**: recall $\epsilon_t = \frac{\sum_{f_t(x^n) \ne \hat{y^n}} u^n}{Z_t}$
* $Z_1 = N$
* $Z_{t+1} = Z_t \cdot \epsilon_t \cdot \exp(\alpha_t) + Z_{t-1} (1 - \epsilon_t) \exp (- \alpha_t)$
* = $Z_t \cdot \epsilon_t \cdot \sqrt((1-\epsilon_t)/\epsilon_t) + Z_t \cdot (1-\epsilon_t) \cdot \sqrt((1-\epsilon_t)/\epsilon_t) $
* = $Z_t \cdot 2 \cdot \sqrt(\epsilon_t (1 - \epsilon_t))$
* Therefore, $Z_{T+1} = N \prod_{t=1}^T 2 \cdot \sqrt{\epsilon_t(1 - \epsilon_t)}$

### Test error rate decreases even training error rate down to 0

error rate($\frac{1}{N} \sum_n \exp(-\hat{y}^n g(x^n))$) tells us machine try to find $g(x^n)$ such that error rate as less as possible.

### Gradient Boosting(general version)

... It is adaboost actually.

## <center> Stacking

### Introduction

Suppose we already have several models to predict.

We use the output of these results as input to train a final model to do the same task.

Like a NN model.

However, you need to split training data into two parts, first part to train several models, another one for final model.