# Ensemble Methods

In this chapter, we will discuss ensemble methods in-depth: why they exists, what they have to offer and where to use them.  
Ensemble methods are a type of machine learning algorithm that combines multiple base models (we call them *"weak learners"*) in order to produce one optimal predictive model (*"strong learners"*). Ensemble methods are very popular in machine learning competitions because they often produce the best predictive models. In this chapter, we will cover the following subjects:
- Bagging
- Boosting
- Stacking  

# Why Ensemble Methods?

As we've mentioned above, ensemble methods often produce the best results. Let's go back to our *1000 people* example. We know through common sense that askins a lot of people for their opinion on a topic will often lead to a better result than asking only one person. But why?  

There are a few reasons. Of course, we're not citing research right now, but think about it: different people means different skillsets, different background, different experiences, different viewpoints. By attacking a problem from so many different angles, we're bound to reach a results that's close to the truth (or at least closer than if we only had one person's opinion).  

This variety is what gives ensemble methods their power. We combine the opinions of many different models to reach a more accurate, more consistent result.

#### What if we didn't use multiple models?

Well, we've already seen what happens. When forced to resort to a single model, we have to make multiple decisions. These are failure points. Each decision is something that requires a human to make a correct prediction about the best direction the model should take.  

Things like data availability or cleanliness (think clean = organized, tabular data; dirty = random words on a page), feature engineering (choosing which parts of the data should be used where), model selection (thinking about and choosing a model that can describe the data accurately) and hyperparameter tuning (experimenting with different values and picking the best combination) are all decisions that can lead to failure.  

In the end, we have no choice but to make decisions ourselves about those variables, and we're bound to make mistakes. Those mistakes add up. That's why ensemble methods come in handy: they allow us to average out the mistakes made by each model, so that the overall result is more accurate than any single model could have produced.  

Another way to think about it: we've discussed the [bias-variance tradeoff](../otherConcepts/bias%26variance.ipynb). Choosing one single model forces us to have this tradeoff. Averaging across multiple models would allow us to keep a better balance between bias and variance, without explicitly having to search for and choose parameters and whatnot.

# What they have to offer?

We've already covered some of the benefits of these methods. Essentially,what we get is a more robust model that was built somewhat automatically (Note: really depends on how you build your ensemble). Besides this, we avoid the curse of having to choose a model's characteristics + parameters (like choosing degree of a polynomial model in a regression problem), since combining multiple models (that fit to multiple features) often generates a high-dimension fit. In other words, multiple simple models get combined into a complex model that can fit data points with a lot of dimensions (even highly non-linear distributions in some cases).  

### !!

<span style="color:red">It's quite important</span> not to couple *the idea of ensemble methods* with the idea of a model. Think of this approach as a meta-model, since it can be applied to pretty much any kind of algorithm we can find out there. Indeed, most commonly we find ensemble methods applied to decision trees, but it's not a requirement. Depends on the problem, our choices, our ideas, our experience and many more factors like these.

# Where to use them?

Ensemble methods are especially popular in machine learning competitions. This is because they allow for great results without much hassle. Of course, they are not limited to competitions. Any situation where we would have to deal with a lot of hyper-parameters, non-linear data points, or a big dataset would be a good fit for ensemble methods.  
As a rule of thumb, if you're not sure what to do, try ensemble methods. They're a good starting point. If you see a problem and instantly think "Ah, yes. I'll use <insert method name>", then you probably know why you're doing it so maybe ensemble methods won't be the best first choice (although in general they seem to outperform simplistic models).

# Bagging

Bagging stands for *bootstrap aggregating*. It's a method that combines multiple models to produce a more accurate result. The idea is to train multiple models on different subsets of the data, and then average their results. Think of it as a way to average out the mistakes made by each model.  

Let's consider our previous Random Forest. What we did was train multiple decision trees with random subsets of the data and random subsets of the features. Then we combined them all together to get final result.  

Picking out random subsets of the data is called *bootstrapping*. We do this because we want to avoid overfitting. If we trained all our models on the same data, they would all be the same. We want to avoid this. "Bootstrapping" is a term that comes from statistics, and it refers to the process of sampling data with replacement (You pick $X$ samples randomly, but you don't discard them after picking them out; each sample has the chance of being picked any number of times).  

In our case, the RF process used boostrapping to create different subsets of the data (and features). Then, it used each subset to train a decision tree. Finally, it combined (aggregated) the results of all the trees to get a final result. That's bagging.  

### Advantages

- It's a very simple method that can be applied to any kind of model.
- It can be used to reduce overfitting (increase bias).
- (!!!)Models can be trained in parallel; this is a big one, since we make use of modern hardware efficiently; This means that bagging is a fast & scalable method.

### Disadvantages

- Limited by the quality of it's components; bagging relies on the overall accuracy of the many models it combines; When faced with a really complex problem with high-dimensional data, a single, more specialized model might be a better fit.
- It's designed to average results, so to increase bias; as an addendum to the previous point, if we need high variance, bagging might not be the best choice.
- When data is insufficient, trying to sample it multiple times would probably just lead to repeated samples; it would probably not lead to a much better solution compared to one single model that takes all the dataset as input.
- It's both flexible and rigid. Since you rely on random choices, there is a chance you will simply be unlucky (just like there are chances to get great results). To improve the chances of a good result, we would increase the number of models (thus increasing diversity) + make sure to properly sample the data (avoiding sampling bias); One other important factor (although this applies to **any kind of model**): trash data, trash results. Collecting & organizing data is even more important than picking out the right model.


### Final note on Bagging

Although we've talked about "multiple models", it's important to mention that the vision of bagging mainly relies on multiple models *of the same type*. In our case, we had the example of fitting multiple DTs. They are multiple models, but the same kind of model.  

If we require different types of models, we get away from the idea of bootstrapping, which means we're not really doing bagging anymore. Instead, depending on the case, it would probably be classified as a combination of both bagging and stacking (more on that later).

# Boosting

This family of methods works the same way as bagging: we combine multiple models to create a strong learner that performs better. The difference is that boosting is an iterative process. We start with a single model, and then we add more models to it. Each new model is trained on the data that the previous models failed to fit.  

Let's put it another way: whereas bagging allows us to create many models in parallel (all at the same time, since they do not depend on each other in any way), boosting relies on continuously fitting weak learners to pieces of data that the previous models failed to fit.  

This generally results in reducing bias (increasing variance).  
To sum up: *Boosting is the process of iteratively fitting multiple models (weak learners) on observations (data) which the previous attempts failed to fit correctly (or had trouble doing so either way). We can use it for both regression and classification.*

Since our main goal is to reduce bias, the models considered for boosting are usually ones with high bias and low variance. If we think about DTs, we would choose shallow trees, since they do not allow for overfitting to take place.  
Another point of view is that we choose such models because we cannot train in parallel, therefore we need to make sure that each training process is as quick as possible. As such, simple models are preferred.  

We are now faced with a few questions:  
- how do we know which observations to focus on?
- how do we combine our models?

We'll discuss 2 methods: AdaBoost and Gradient Boosting. We'll see how they differ in their approaches and what these differences mean for the final result.


# AdaBoost = Adaptive Boosting

Simply put, this method attempts to find a set of weights that are applied to each of the weak learners, so that the final result is as desired.  
$$
\begin{align}
\text{Final result} &= \sum_{i=1}^{n} \alpha_i \cdot \text{Weak learner}_i \\
\text{where } \alpha_i &\text{ is the weight of the } i^{th} \text{ weak learner}
\end{align}
$$  

How do we find these weights then? This is a tough optimization problem. What works better (although it does not necessarily guarantee the *best* result) is to iteratively fit weak learners to the data so that we find our weights step by step. Here's how it works:

We define our models such that they are dependent on the previous models:
$$
\begin{align}
s_l(x) = s_{l-1}(x) + c_l \cdot w_l(x)
\end{align}
$$
where $s_l(x)$ is the result of the $l^{th}$ model, $s_{l-1}(x)$ is the result of the previous model, $c_l$ is the weight of the $l^{th}$ model, and $w_l(x)$ is the result of the $l^{th}$ weak learner.  
Models with $s$ are the strong learners, and models with $w$ are the weak learners. We define our result as the strong learner up to the $l^{th}$ model.  

We choose $w$ and $c$ such that $s_l$ is the model that best fits the data.  
What does "best fits the data" mean? Of course, we turn to the loss function. We want an error score that we can minimize to find the best model.  

As such, our $w$ and $c$ are parameters chosen such that the loss function is minimized. Mathematically, we write:
$$
\begin{align}
(c_l, w_l) &= \arg \min_{c, w} \mathcal{L}(s_{l-1}(x) + c \cdot w(x), y) \\
&= \arg \min_{c,w} \sum_{n=1}^{N} \mathcal{L}(s_{l-1}(x_n) + c \cdot w(x_n), y_n) \\
\text{where } \mathcal{L} &\text{ is the loss function}
\end{align}
$$

This process allows us to optimize "locally", instead of trying to directly find the best combination of models for the whole dataset.