<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/Week-7-Ensemble-learning/Ensemble_learning_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Learning Objectives:

This week, our focus is on Ensemble Learning:

- **Understanding**: We want you to get a clear idea of what ensemble learning is and why it's useful.

- **Types of Ensembles**: We'll explore two main types:
  - **Homogeneous Ensembles**: Here, we use the same type of model multiple times. Examples include **bagging** **random forests** and **boosting**.

  - **Heterogeneous Ensembles**: This involves using different types of models together. We'll touch on this briefly.

  - **Sequential vs. Parallel Methods**: We'll differentiate between these two. Sequential methods build on the previous model's output, whereas parallel methods operate concurrently, aiming to improve accuracy and reduce overfitting.

- **Practical Work**: We'll build a Bagging model using decision trees. If you missed our last session, don't worry; there's a ready-to-use code module for you.

- **Using Scikit-Learn**: We'll practice implementing our learned methods using this popular tool.

By the end of the week, you should:

- Know what ensemble learning is and its benefits.

- Understand the difference between bagging, boosting, and other methods.

- Be able to create a basic Bagging model.

- Know how to use these methods in Scikit-Learn.

- Have a basic idea about stacking (another ensemble method).

# Ensemble Learning

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/orchestra.png' width=500px>

Ensemble Learning is a technique in machine learning where multiple models (often referred to as "base models") are trained and their predictions are combined to produce a final result.

Ensemble learning is a concept that has been around for a while. At its core, it's based on the principle that gathering opinions from multiple sources often leads to better outcomes than relying on just one source. This idea is commonly referred to as the "wisdom of the crowd."

This week, we'll apply this principle to machine learning. By combining simple models, like the decision trees we discussed last week, we aim to create more accurate and robust predictive models that perform better on new, unseen data.


**Ensemble Learning Analogy** is like an orchestra performing a symphony. Each individual musician (model) has their own instrument (algorithm) and plays their part (makes a prediction). While each musician is talented on their own, it's when they all come together under the guidance of the conductor (ensemble method) that they produce a harmonious and powerful performance (more accurate prediction). Just as a single out-of-tune instrument can be drowned out by the harmony of the entire orchestra, the errors from a single model in ensemble learning can be offset by the correct predictions of other models. The collective output is often more beautiful and impactful than any solo performance.

## Toy example (combining by voting)

Let's break down a simple example to see how combining multiple systems can improve accuracy:

Imagine you have 10 samples, and they're all marked as positive or "1".

Now, we have three systems or classifiers named A, B, and C. Individually, each can correctly identify the samples about 70% of the time. Here's what their results look like:

- A's results: {1,1,0,1,1,1,1,1,1,0}
- B's results: {0,1,1,1,1,1,1,0,1,1}
- C's results: {1,1,1,0,1,1,1,0,1,1}

Even though they're right 70% of the time, they make different mistakes. They're not wrong about the same samples.

So, how do we get better results? We use "majority voting". This means if at least two systems say a sample is "1", then we go with "1" as the final answer.

When we do this for our example, the combined result is {1,1,1,1,1,1,1,1,1,1}. Now, they're right 90% of the time together!

This magic happens because the systems' mistakes don't often overlap. When two systems get it right, they can correct the third system's mistake.

However, this only works when the systems make different mistakes. If all three systems were often wrong about the same samples, then majority voting wouldn't improve accuracy.

Certainly. Here's the revised content:

When using ensembles that are correlated, the impact on improving accuracy can be minimal. Consider three correlated classifiers, each with a prediction accuracy of 70%. The predictions from these classifiers are:

- $ \hat{y}_A $ = {1,1,1,0,1,1,1,1,0,0} 70% correct
- $ \hat{y}_B $ = {1,1,1,0,1,1,1,1,0,0}  70% correct
- $ \hat{y}_C $ = {1,1,1,0,1,1,1,1,0,0}
 70% correct
- $ \hat{y}_mv$ = {1,1,1,0,1,1,1,1,0,0}  70% correct

Given that these predictions are identical and correlated, majority voting (represented as \( \hat{y}_{mv} \)) will not enhance their combined performance. This cannot be improved through majority voting as each classifier individually achieves 70% accuracy.

On the other hand, if our three models A, B, and C are highly correlated with completely overlapping predictions, then we see little or no improvement through majority voting.

## Basics of Probability Theory and Application to Classifiers

 To understand how combining multiple classifiers can enhance accuracy, we can use probability theory.

1. **Basics of Probability Theory**: Before diving deep, it's important to grasp some fundamental probability concepts.

    - **Independent Events**: These are events that don't influence one another. The probability of all of them occurring is the product of their individual probabilities. Mathematically:
$ P(\hat{y}_A \cap \hat{y}_B \cap \hat{y}_C) = P(\hat{y}_A) \times P(\hat{y}_B) \times P(\hat{y}_C) $

    - **Mutually Exclusive Events**: These are events that can't occur at the same time. The combined probability for such events is the sum of their individual probabilities. For instance:
    
      $$ P(\hat{y}_A \cup \hat{y}_B) = P(\hat{y}_A) + P(\hat{y}_B) $$

2. **Application to Classifiers**: When dealing with multiple classifiers, like in our example, these principles come into play. Each classifier might have an accuracy of 70%, but they don't necessarily make mistakes on the same data points. This variance can be leveraged.

    - If one classifier mispredicts an outcome, the others might get it right. So, by taking a majority vote from the classifiers, we can 'correct' these individual errors.

    - Picture this as having three friends, each good at answering 70% of the questions in a quiz. If they were to work together, taking a collective decision on each question, their combined expertise could lead to an even higher score.

3. **Potential Outcomes**:
    - **Best Case**: All classifiers predict the right outcome, leading to 100% accuracy for that particular instance.

    - **Average Case**: Their combined expertise, through methods like majority voting, can push the accuracy up to 78% across a larger set of data.

    - **Worst Case**: If all classifiers get it wrong, then combining them won't help, and the accuracy remains at 70%.

In essence, the principle behind combining classifiers is to capitalize on their individual strengths and offset their weaknesses. This method often leads to a more robust and accurate predictive system, as confirmed by our probability theory.

## Toy example (combining by voting)

In this example, we're working with three separate prediction models. Each of these models gets things right about 70% of the time ($ \hat{y}_i $ are correct 70% of the time.).

From this setup, four distinct situations can arise for any given example:


1.** All three independent models are correct: **
All three models hit the mark. Since each model's prediction doesn't affect the others, the combined chance of all being right is 0.7 (the chance one is right) times itself twice. Doing the math: $0.7 \times 0.7 \times 0.7 = 0.343$.

$ P(\hat{y}_1 = 1, \hat{y}_2 = 1, \hat{y}_3 = 1) = 0.7 * 0.7 * 0.7 = 0.343 $

2. **Two models are correct (with 3 different possible mutually exclusive combinations)**:

Two models get it right, and one doesn't. There are three ways this can play out: either the first, second, or third model could be the one that's off. These scenarios can't all occur at once, so they're separate from one another. For each of these cases, the probability is $0.7 \times 0.7 \times 0.3$. Since there are three such cases, the total chance for this situation is $3 \times 0.7 \times 0.7 \times 0.3 = 0.441$.T

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 1, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 0, \hat{y}_3 = 1) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 1, \hat{y}_3 = 1) = 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.147 + 0.147 + 0.147 = 0.441 $

3. **Two models are wrong:**

It's also possible for two models to miss the mark and only one to get it right. Calculating in the same way, the combined chance for this scenario is 0.18.

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 0, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 1, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 0, \hat{y}_3 = 1) = 0.7 * 0.3 * 0.3 + 0.3 * 0.7 * 0.3 + 0.3 * 0.3 * 0.7 = 0.063 + 0.063 + 0.063 = 0.189 $

4. **All three are wrong:**

Lastly, there's the slim chance that all three models mess up. This has a likelihood of $0.3 \times 0.3 \times 0.3 = 0.027$.

$ P(\hat{y}_1 = 0, \hat{y}_2 = 0, \hat{y}_3 = 0) = 0.3 * 0.3 * 0.3 = 0.027 $


**Probability of all possible events** = 0.343 + 0.441 + 0.189 + 0.027 = 1

When you put all these situations together, the probabilities add up to 1, which makes sense because these four scenarios cover every possible outcome.


## continued: Toy example (combining by voting)

In this scenario (In this case, $ \hat{y_i} $ are correct 70% of the time.), we want to know the chance that at least 2 predictions out of 3 are accurate. So, we need to look at situations where either 2 answers are spot on or all 3 nail it.

From our calculations, the likelihood of getting 2 right combinations is 44%, which we represent as 0.441.

For a majority vote with 3 members, we can expect 4 classes of mutually exclusive outcomes.
In other words:
Probability of all possible events = $0.343 + 0.441 + 0.189 + 0.027 = 1$

If we toss in the times when all 3 predictions hit the mark, our overall accuracy jumps to 78%, denoted as 0.784.

When taking a majority vote of 3 predictions, the result will be correct when at least 2 of the predictions are correct.
- The total probability of any combination of results (where 2 are right) is $0.441$.
- Thus majority voting will correct the result ~ 44% of the time.
- By adding cases where all 3 are correct, it means the ensemble will be correct an average of $0.441 + 0.343 = 0.784$ % of the time.
- This increases to 83% of the time if we instead combine 5 ensembles.

A cool thing to highlight is, the more models we use, the better our results. Like, if we blend the insights from 5 models, our accuracy rate climbs to an impressive 83%.


## Understanding the Bias-Variance Tradeoff

When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to "variance". There is a tradeoff between a model's ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.

Think of understanding bias and variance like playing darts on a target board. The bullseye in the middle is the perfect prediction. The farther away the darts land from the bullseye, the more off our predictions are.

Every time we make a model using different training data, it's like throwing a dart. Sometimes, the data is great and our dart (prediction) is close to the bullseye. But sometimes, if our data has odd values or outliers, our dart might land far off.

So, with all these darts (models), we can see a pattern on the board. This pattern helps us understand different scenarios of bias and variance, both high and low.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/target.webp' width=500px>

[Source](http://scott.fortmann-roe.com/docs/BiasVariance.html)

### **Understanding Bias and Variance**

**Bias Error:** This is the difference between what our model predicts and the actual true values. If there's a high bias error, it means our model isn't doing a great job because it's missing key patterns.

**Variance Error:** This is about the consistency of our model's predictions. If there's high variance, it means the model's predictions change a lot depending on the sample data. If it's too high, the model might fit our training data too closely, but do a poor job with new data.

**Noise (or Irreducible Error):** No matter how good we get, our measurements won't always be perfect. This error comes from things out of our control, like if there's an error in how we measure something.

### **Understanding Over- and Under-Fitting**

At its root, dealing with bias and variance is really about dealing with over- and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls. For example, as more polynomial terms are added to a linear regression, the greater the resulting model's complexity will be 3. In other words, bias has a negative first-order derivative in response to model complexity 4 while variance has a positive slope.


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/bias_varaince.png' width=300px>




## Ensemble Learning: overview

The aim is to mix many weak learners that are not related to make one powerful guesser by lowering shakiness and/or slant.

There are a few ways to do this.

First, we look at similar learners. These use the same kind of machine learning to make guesses. Examples are bagging, random forests, and boosting. We'll mostly talk about these in the videos.

Second, there are mixed learners, also called stacking models. These use different types of machine learning, like logistic regression, support vector machines, and random forests, to make guesses. You can even mix similar learners into these.

We'll only briefly talk about mixed learners this week, but what you learn about similar learners will help you understand how to make a stacking model.

To sum up:

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/review.jpg' width=400px>

For similar learners, they use the same type of machine learning:
- Bagging
- Random Forests
- Boosting

For mixed learners, they use different types:
- Stacking


## Types of Ensemble Learning

The goal is to mix different simple guessers to make one strong guesser. There are two main ways to do this, and each has a different effect on mistakes.


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/sequential.webp' width=500px>


First way, called "Parallel," uses many simple guessers at the same time. They work separately but get mixed together in the end. This lowers the shakiness of the final guess by averaging out single mistakes. Examples are Voting, Bagging, and Random Forests.

Second way, called "Sequential," adds one simple guesser after another. If a guesser makes a mistake, the next one pays more attention to that mistake. This helps the group of guessers work better together, lowering the wrongness and making the final guess more accurate. This is mostly seen in Boosting methods, like Adaboost, Gentleboost, and Gradient Boosted Trees.

So, in short:

Parallel:
- Many guessers work at the same time.
- Lowers shakiness by averaging out single mistakes.
- Examples: Voting, Bagging, Random Forests.

Sequential:
- Add guessers one by one.
- Each new guesser focuses more on past mistakes.
- Makes the final guess more accurate and lowers wrongness.
- Also helps avoid overfitting, which means it's good at not memorizing the data.
- Examples: Boosting, Adaboost, Gentleboost, Gradient Boosted Trees.

Knowing the difference between these two ways is important because they deal with mistakes in different ways.

## Summary

Ensemble learning joins guesses from lots of different guessing models. These models can either be of the same kind, called "homogenous," or of different kinds, called "heterogenous."

For the first way, called "Parallel," many models work at the same time and their results are put together at the end. This helps make the final guess less shaky. Examples of this are Voting, Bagging, and Forests.

The second way, called "Serial," adds models one by one. Each new model pays more attention to the mistakes made before. This helps make the final guess more accurate and less wrong. This method is known as Boosting.

## So, to sum up:

- Ensemble learning methods aggregate predictions made from many different learning models

- They can either combine methods of the same class (homogenous) or different classes (heterogenous)

- This can be trained in parallel and combined at the end to reduce model variance (voting/Bagging/Forests)

- Or trained in serial, one after another, upweighting misclassified examples each time to reduce bias (and decrease variance) - Boosting


## Bagging (Also called Bootstrap Aggregating)

Bagging, also known as Bootstrap Aggregating, is a technique that tries to make the final guess more stable and less shaky (reduce model variance). **How?** By using many decision trees. Each of these trees is trained on different parts of the main data set.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/bootstrapping.webp' width=500px>

It fits the base learners (classifiers) on each random subset taken from the original dataset (bootstrapping). Due to the parallel ensemble, all of the classifiers in a training set are independent of each other so that each model will inherit slightly different features.
Next, bagging combines the results of all the learners and adds (aggregates) their prediction by averaging (mean) their outputs to get to final results.
The Random Forest (RF) algorithm can solve the problem of overfitting in decision trees. Random orest is the ensemble of the decision trees. It builds a forest of many random decision trees.
The process of RF and Bagging is almost the same. RF selects only the best features from the subset to split the node.
The diverse outcomes reduce the variance to give smooth predictions.


The smart idea behind bagging is based on a math rule. This rule tells us that when you take the average of many different guesses, the shakiness or uncertainty of that average becomes much smaller. To understand this rule, let's consider T, which represents the total number of models (or trees) you're using. According to the rule, the shakiness of your average guess is only 1/T of the combined shakiness from all individual guesses. In other words, for separate random guesses, the average's variance (or "shakiness") is 1/T of the variance across all the models. In this rule, T stands for the total number of models we have.

Last time, we got our hands dirty with creating a decision tree. Today, we're taking it a step further. We'll see how this single decision tree can be a small but vital part of the bigger bagging method. By averaging out the guesses from many trees, bagging can significantly reduce the shakiness of the final prediction.

But here comes a common question: If we start with just one main set of data, **where do all these varied data pieces, used to train different trees, come from?** The simple answer is that these varied data sets are created by taking different samples from the main data set. This sampling method is what allows bagging to have so many unique decision trees, even when working from one main dataset.

## Bootstrapping

The bootstrap method is a statistical technique that involves generating multiple small subsets from a larger dataset. To do this, random data points are selected from the larger dataset, and after each selection, that data point is put back (or "replaced") so it can potentially be chosen again. This act of selecting and then putting the data point back is termed as "resampling".

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/bootstrapping.webp' width=500px>

Imagine a bag full of marbles, each marble representing a data point. Now, if you were to draw a marble, note its color, and then put it back into the bag before drawing again, you're essentially practicing the bootstrap method.

Because each data point (or marble) has the same chance of being picked every time, over multiple draws, some will be selected more often than others, while some might not be picked at all. This randomness ensures that each subset we create has a unique combination of data points.

The introduction of such randomness alters the mean (average) and standard deviation (a measure of the spread of data points) of these subsets compared to the original dataset. These changes introduced by bootstrapping can make machine learning models more robust. Essentially, by training the model on slightly different versions of the data, it becomes better equipped to handle a variety of scenarios, thereby enhancing its reliability and performance.

In the context of ensemble methods, which involve using multiple models (or "base learners" and "classifiers") to make predictions, bootstrapping plays a pivotal role. Each of these models is trained on a different bootstrap subset, allowing them to learn from different "perspectives" of the data. When their individual predictions are combined, the final result is often more accurate and reliable than any single model could produce on its own.

Of course!


**Bootstrap Sampling**

1. Take the original dataset $X$ which has $N$ training examples.

2. Create $T$ copies, denoted as $\{ \hat{X}_m \}^{M}_{m=1}$, by sampling with replacement.

   - Each $\hat{X}_m$ contains $N$ examples (or rows).
   
   - Each $\hat{X}_m$ is unique because some examples might appear more than once, while others might not be included at all.

For large sample sizes, Bootstrap sampling will closely match the sampling distribution derived from the entire population.


## Why Bootstrapping Works?

Run the next code cell from `Notebook 7.Ensemble_Learning.ipynb` to:

Using this function (function bootstrap_sample ), we can illustrate how bootstrapping helps in approximating sampling distributions that would typically require estimation from entire populations.

Suppose we had a vast random real-world population. For this illustration, we'll create an extensive array containing 100,000 integers ranging from 0 to 20. Imagine this array as representing a real-world population, say, if we wanted to gather brain scans from every human. It would be impractical, given our resources, to sample everyone.

In such scenarios, we'd take as large a sample as possible, which we'll refer to as DATA_sample.

Our objective is then to evaluate the efficacy of estimating a sample mean by continually taking bootstrapped samples from our DATA_sample. We want to compare this to the results we'd obtain if we had the means to consistently draw substantial samples from the entire population.
1. Create a very large random population (in real-world examples this would be unavailable)
```python
DATA_population = np.random.randint(0,20,100000)
```

2. Simulate the process of generating random samples from this population for different sample sizes
```python
DATA_sample = np.random.choice(DATA_population, sample_size)
```

3. Take bootstrapped samples:
```python
sample = bootstrap_sample(DATA_sample)
```

- $n\_samples$ times with replacement

- Generate $n\_samples$ different bootstrapped samples of the mean.

4. Compare against $n\_samples$ from true population (again not practical to obtain)
```python
population_sample = np.random.choice(DATA_population, sample_size)
```


## Plotting resulting distributions as histograms

When we visualize the results for various sample sizes, it's evident that, with a sufficiently large number of samples, our aim is to get distributions centered around the true mean, which is 9.51.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/result1.png' width=700px>

In our initial illustration where the sample size is 1, it's unsurprising that our approximation is off-mark. Given that DATA_sample contains only a single row, no matter how many times we bootstrap from it, the result remains constant - in this instance, it's 3, which deviates significantly from the mean.

Yet, as we increase the sample size, especially beyond 1000 samples, bootstrapping begins to provide an accurate representation of the genuine sampling distribution. This accuracy enhances further, nearing a near-perfect overlap when the sample size is a tenth of the full population's size.

## Decision tree

In machine learning, decision trees have a huge impact on decision-based analysis problems. They cover both classification and regression. As the name implies, they use a tree-like model containing nodes and leaves.

In the image below, the model's root is at the top—it's an upside-down tree! Here, **conditions are internal nodes** and **outcomes are leaf nodes**.
Notice that the decision tree contains several **if-else statements** in a strict order, making the model **inflexible**. It gives rise to **overfitting**, which occurs when a function fits the data too well. That means the model will be accurate only on the data it's been trained on. The model fails to work if there is **a slight change** in the training data or new data.

Using one decision tree is can be problematic and might not be stable enough; however, using **multiple decision trees and combining their results will do great.** Combining multiple classifiers in a prediction model is called ensembling. The simple rule of ensemble methods is to reduce the error by reducing the variance.



## Ensemble methods: blind men and the elephant

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/elephant.jpg' width=500px>


Ensemble methods* are techniques that combine the decisions from several base machine learning (ML) models to find a predictive model to achieve optimum results. Consider the fable of the blind men and the elephant depicted in the image below. The blind men are each describing an elephant from their own point of view. Their descriptions are all correct but incomplete. Their understanding of the elephant would be more accurate and realistic if they came together to discuss and combined their descriptions.

## Bagging (Bootstrap Aggregating)

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/Ensemble_Bagging.svg.png' width=500px>


Bagging, which stands for "Bootstrap Aggregating", is a machine learning ensemble method designed to improve the accuracy and robustness of a model. Here's a breakdown of how it works and its primary benefits:

1. **Bootstrapping:** Bagging begins by creating multiple subsets of the original dataset. These subsets are formed by randomly sampling the dataset with replacement, which means some data points might be repeated in a single subset while others might be left out.

2. **Training Multiple Models:** For each of these bootstrapped subsets, a separate model (often a decision tree, but it can be any type) is trained. As a result, you end up with multiple independently trained models.

3. **Aggregation of Predictions:** When making predictions, all the models in the ensemble cast their "votes". For regression problems, the final output is usually the average of all the model outputs. For classification problems, the final prediction is typically the class that gets the majority vote from all the models.


In summary:

1. **Create bootstrapped samples** from data sets
2. **Train a separate weak learner** (e.g. Trees) on each Bootstrap sample
3. **Test data on each weak learner**
4. **Aggregate results**; how?

    * For classification use majority voting
     
 $ f(X) = \text{mode}\{f_1(X), f_2(X)... f_T(X)\} $

    * For regression: averaging
        $$ f(X) = \sum_{t=1}^{T} f_t(X) $$

**Note:** This we will code from scratch during in our tutorial.


## Key Benefits of Bagging:

- **Reduction in Variance:** By averaging the predictions from multiple models, bagging tends to reduce the variance, making the final prediction more stable and less susceptible to the fluctuations in the training data.

- **Handles Overfitting:** Individual models, especially complex ones like deep decision trees, can sometimes overfit to their training data. By averaging predictions over multiple models, bagging can mitigate the overfitting problem.

- **Parallel Computation:** Since each model is trained independently, bagging is inherently parallelizable, making it efficient to compute, especially with modern multi-core processors or distributed computing environments.

- **Out-of-Bag Evaluation:** One unique advantage of bagging is the ability to use out-of-bag (OOB) samples (samples not used in a particular bootstrapped set) to validate the performance of the model, eliminating the need for a separate validation set.

In essence, bagging leverages the power of multiple models to achieve better generalization and accuracy than would be possible with any single model.

**Comparison between Bagging and Decision Trees Performance:**

For a decision tree, training on this dataset yields an accuracy of 0.875. Notably, the blue examples in the data are incorrectly labeled; they should be in the red class.

In contrast, the bagged model correctly classifies nearly all the data points. It makes a single error, misclassifying an isolated blue point.

**Understanding Out-of-Bag (OOB) Error in Bagging**

The Out-of-Bag (OOB) error provides a way to gauge the performance of a bagging ensemble without requiring a separate validation set. Here's how it works and why it's important:

**Evaluating Error in a Bootstrapped Ensemble Approach:**
When we use techniques such as bagging, which involves bootstrapped training sets, a certain portion of the training data doesn't get included in each bootstrap sample. This non-included data serves as an automatic "left-out" or "out-of-bag" set.

**Key Features of OOB Error Calculation:**
1. **Automatic Left-out Set:** With every bootstrapped training set, some data points don't get selected. For sizable training datasets, roughly 37% of the data points aren't included in a particular bootstrap sample.
  
2. **Validation through OOB:** The OOB samples can be used to validate or test the model. For instance, for a given data point, xi, which is left out of a particular bootstrap sample, we can test the model trained on that sample using xi.

3. **Prediction Aggregation:** For a given left-out data point, xi, predictions from all the bootstrap samples that didn't contain xi are aggregated. This means if we had 100 bootstrap samples and xi was left out in 37 of them, we would get 37 predictions for xi which we can then average (for regression problems) or take a majority vote (for classification problems).

4. **Accuracy/Prediction Score:** The predictions from the models, based on the OOB samples, are aggregated to compute a single accuracy or prediction score for the entire ensemble.

5. **Estimation for Classification:** For classification tasks, the OOB error can be estimated as the inverse of the accuracy obtained from the OOB samples. For example, if the accuracy is 90%, the OOB error would be 10%.

**Visual Representation:** The accompanying graph titled "OOB error for Bagged Classifier" plots the OOB error against the number of trees in the ensemble. From the graph, we can observe how the OOB error fluctuates as the number of trees increases. This is a valuable visualization as it can guide the decision on the optimal number of trees to use in the ensemble. Typically, as more trees are added, the OOB error tends to stabilize or decrease, indicating the ensemble's improved performance.

## Boosting

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/boosting.svg.png' width=500px>

rewrite: The output of one base learner will be input to another. If a base classifier is misclassified (red box), its weight will get increased (over-weighting) and the next base learner will classify more correctly.

The boosting technique follows a sequential order. So Boosting operates in a step-by-step manner. Each model's output serves as input for the next one. If a model (base classifier) gets it wrong (highlighted by a red box), its importance is amplified (or over-weighted) so that the subsequent model (next base learner) can correct the mistake.

After training these models, their results are merged to make final predictions.
There are several advanced variations of boosting, including Gradient Descent Boosting, AdaBoost, and XGBoost. Gradient boosting not only aims to reduce errors but also integrates gradient optimization in each step. On the other hand, AdaBoost adjusts the data weights with each new model added to the ensemble.

## When not to use Bagging

- Bagging seeks to creates uncorrelated predictions

- Through training on multiple datasets generated by bootstrap sampling (from the original sampling)

- Useful for datasets that have high variance and are noisy

- If predictions across trees vary considerably bagging can really help

BUT

- If predictions are stable across trees, then bagging may even lead to a degradation of the result

**In details:**

### Understanding Bagging
- **Uncorrelated Predictions**:

Bagging's primary objective is to produce uncorrelated predictions. This means that it aims to ensure that individual models (like decision trees) aren't making the same errors. By having a diverse set of predictions, the aggregated result is often more robust and accurate.

- **Bootstrap Sampling**:

Bagging achieves this diversity by training multiple models on various datasets. These datasets are created using bootstrap sampling, where random subsets of the original data are selected with replacement.

- **Dealing with Noisy Data**:

 Bagging shines when dealing with data that's 'noisy' or has high variance. This is because individual models might overfit to different parts of the data, but when their predictions are aggregated, these overfittings tend to cancel out, leading to a more generalized model.

- **Varying Predictions**:

If the individual models (trees, in many cases) produce very different predictions for the same data point, bagging can be extremely beneficial. By combining these diverse predictions, bagging often arrives at a more accurate and stable result.

### The Caveat
However, like all techniques, bagging isn't universally beneficial.

- **Stable Predictions Across Models**:

If the base models are producing very similar or stable predictions, bagging might not offer much advantage. The principle behind bagging is to bring together diverse models to counteract their individual errors. If there's little diversity in the predictions, then bagging doesn't have much to "work with."

- **Potential Degradation**:

In scenarios where predictions are already stable across trees (or other base models), introducing bagging might not only be redundant but might also degrade the performance. This is because the process of bootstrapping and aggregating can introduce its own sources of error or noise, which would not be offset by the benefits of diversity (since the predictions are already stable).

In conclusion, while bagging is a powerful tool for many machine learning tasks, it's essential to understand the nature of your data and the behavior of your models. If your base models are already performing consistently and with low variance, the added complexity of bagging might be unnecessary and could even be counterproductive.


## Extension of Bagged Ensembles of Decision Trees: Decision Forests

In the world of machine learning, ensembles have become a powerful way to improve the performance of individual models. One such ensemble technique is the bagging of decision trees, which has led to the evolution of Decision Forests. But what differentiates a Decision Forest from a simple bagged ensemble of decision trees? Let's dive deep into this extension.

### The Core Idea
Decision Forests, often known as Random Forests, are an evolution of the bagging technique applied specifically to decision trees. The primary goal remains consistent: reduce overfitting and enhance the model's accuracy by combining multiple trees' predictions. However, Decision Forests introduce additional layers of randomness to achieve these goals more effectively.

### Key Features of Decision Forests:

1. **Random Feature Selection**:
   - In traditional decision trees, at each node, the best feature is chosen to split the data based on some criterion (like Gini impurity or information gain).
   - Decision Forests add a twist to this. Instead of allowing every feature to be considered for a split at each node, they randomly select a subset of features. This subset is then evaluated, and the best feature from this subset is used for the split.
   - This approach increases randomness, ensuring that individual trees in the forest are not just different due to bootstrapped data samples but also because of the features they consider at each decision point.

2. **De-correlation of Models**:
   - By introducing this randomness in feature selection, Decision Forests further de-correlate the individual trees. This means that the errors or biases of one tree are less likely to be replicated across the forest, leading to a more robust overall model.

3. **Reduction in Model Variance**:
   - Variance refers to how much a model's predictions might change if trained on a different set of data. High variance can lead to overfitting.
   - The increased randomization in Decision Forests means that individual trees might have higher variance, but when their predictions are aggregated, the variances tend to cancel out, leading to a model with reduced overall variance.

4. **Increased Stability Against Feature Noise**:
   - In real-world data, not all features are equally informative. Some might contain 'noise' or irrelevant information.
   - By considering only a subset of features at each decision point, Decision Forests increase the chances that noisy features are left out in many of the splits, thus providing a natural resistance against overfitting due to such noisy features.

### Conclusion:
Decision Forests represent an advanced stage in the evolution of ensemble methods, particularly for decision trees. By introducing controlled randomness in the decision-making process, they effectively combat common pitfalls like overfitting and high variance. This ensures that the ensemble not only benefits from the wisdom of multiple models but also that each of those models brings a unique perspective, making the overall ensemble more powerful and reliable.