## Voting Classifiers
* *Hard voting:* majority-voting classifier
* *Soft voting* predicts the class with the highest probability averaged over all the individual classifiers. It often achieved higher performance than hard voting because it gives more weight to highly confident votes.
* For classifiers that does not generate probability such as SVC, check [Platt Scaling] (https://en.wikipedia.org/wiki/Platt_scaling). Basically it snaps a logistic function on top of the SVM decision function, and calibrate the parameters using cross-validation.
* **Why does ensembling work?** *Law of large numbers*. think of tossing a biased coin many times, the more you tosse it, the more likely you are going to find the head. Similarly, ensemble of *weak leaners* can still be a *strong learner (achieving high accuracy)* provided there are a sufficient number of weak learners and they are sufficiently diverse.


## Bagging and Pasting
* *Bagging*: sampling *with* replacement. *Pasting*: *without* replacement.
* [Bagging VS Pasting](https://stats.stackexchange.com/questions/219193/when-should-the-pasting-ensemble-method-be-used-instead-of-bagging), basically when size of the dataset is small, bagging is always the choice, pasting might be preferrable with large sample size and on external validations.
* In Scikit-Learn, it is selected by the hyperparameter **bootstrap** (*=True* for bagging, *=False* for pasting.)

### Out-of-Bag Evaluation
* in Scikit-Learn, set **oob_score=True** to request automatic oob evaluations after training, which could be used as an estimate for testing set performance.

## Random Forests

* Sampling both training instances and features is called *Random Patches*; keeping all training instances but sampling features is called *Random Subspaces*.
* RF searches for the best feature among a random subset of features when growing trees in order to introduce extra randomness. It trades a higher bias for a lower variance, generally yielding an overall better model. 

### Extra-Trees
* On top of RF, also using random thresholds for each feature rather than searching for the best possible thresholds like regular Decision Trees do. Called *Extremely Randomized Trees* ensemble.
* Much faster to train.

### Feature Importance
* Important features are likely to appear closer to the root, so it is possible to estimate a feature's importance by computing the average depth at which it appears across all trees in the forest.
* [Gini importance](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp): "Every time a split of a node is made on feature $n$ the gini impurity criterion for the two children nodes is less than the parent node. Adding up the gini decreases for each individual feature over all trees in the forest gives a fast feature importance that is often very consistent with the permutation importance measure."

## Boosting
* *Sequentially* train predictors to correct its predecessors.

### AdaBoost
Initiate each instance weight $w^{(i)}$ to $\frac{1}{m}$
* Calculate *weighted error rate* of the $j^{\text{th}}$ predictor
$$r_j = \dfrac{\underset{\hat{y}_j^{(i)}\neq y^{(i)}}{\sum_{i=1}^m}w^{(i)}}{\sum_{i=1}^mw^{(i)}}$$
* Compute predictor weight
$$\alpha_j=\eta\log\dfrac{1-r_j}{r_j}$$
Note that if a predictor is just guessing randomly ($r_j\approx0.5$), then tis weight will be close to zero.
* Update instance weight, for  $i=1,...,m$
$$w^{(i)}=\left\{
                \begin{array}{ll}
                  w^{(i)}, \qquad \qquad \hat{y}_j^{(i)}= y^{(i)}\\
                  w^{(i)}\exp(\alpha_j), \quad \hat{y}_j^{(i)}\neq y^{(i)}
                \end{array}
              \right.$$
and then normalize all instance weights (i.e., divided by $\sum_{i=1}^m w^{(i)}$). Note that if $r_j = 0$ (worst possible prediction), then $\alpha_j=\infty$, and $w^{(i)}$ gets very big for miss-classified cases.
* Finally, AdaBoost precitions
$\hat{y}(x) =\underset{k}{\text{argmax}}\underset{\hat{y}_j(x) = k}{\sum_{j=1}^N}\alpha_j$
* If AdaBoost ensemble is overfitting the training set, reduce the # of estimators or more strongly regularize the base estimator.


### Gradient Boosting
* Instead of tweaking the instance weights at every iteration, [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting#Informal_introduction) tries to fit the new predictor to the *residual error*, **why**? 

"A generalization of this idea to loss functions other than squared error - and to classification and ranking problems - follows from the observation that residuals $y - F(x)$ for a given model are the negative gradients (with respect to $F(x)$) of the squared error loss function $\frac{1}{2}(y - F(x))^2$. So, gradient boosting is a gradient descent algorithm; and generalizing it entails "plugging in" a different loss and its gradient."

Classification or ranking models could be integrated with the "gradient descent" setting by applying decision function or ranking score functions.
* The *Stochastic Gradient Boosting* uses a subsample of instances for training each tree.
* The learning_rate hyperparameter scales the contribution of each tree, regularize this hyperparameter is called *shrinkage*.
* To find the optimal # of trees, use early stopping

#### Note that although Boosting often works better for lowering biases, it's not parallelizable like bagging algorithms.

## Stacking
* Train *meta learners* to take predictions from different predictors as inputs and makes the final prediction.
* Could be multi-layer stacking ensemble, as long as the meta learner is trained using a hold-out set (or out-of-fold predictions).

## Resources

* Detailed [implementation](https://github.com/ageron/handson-ml/blob/master/07_ensemble_learning_and_random_forests.ipynb) of this chapter by [Aurélien Geron](https://github.com/ageron)
* [Bagging VS Pasting](https://stats.stackexchange.com/questions/219193/when-should-the-pasting-ensemble-method-be-used-instead-of-bagging)
* [Gini importance](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp)
* Previous [notes](https://github.com/Roger-Li/Notes/blob/master/note_cmu_ml/10_boosting.pdf) of step-by-step demonstration of AdaBoost.