<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/Week-7-Ensemble-learning/Ensemble_learning_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Learning Objectives:

This week, our focus is on Ensemble Learning:

- **Understanding**: We want you to get a clear idea of what ensemble learning is and why it's useful.

- **Types of Ensembles**: We'll explore two main types:
  - **Homogeneous Ensembles**: Here, we use the same type of model multiple times. Examples include **bagging** **random forests** and **boosting**.

  - **Heterogeneous Ensembles**: This involves using different types of models together. We'll touch on this briefly.

  - **Sequential vs. Parallel Methods**: We'll differentiate between these two. Sequential methods build on the previous model's output, whereas parallel methods operate concurrently, aiming to improve accuracy and reduce overfitting.

- **Practical Work**: We'll build a Bagging model using decision trees. If you missed our last session, don't worry; there's a ready-to-use code module for you.

- **Using Scikit-Learn**: We'll practice implementing our learned methods using this popular tool.

By the end of the week, you should:

- Know what ensemble learning is and its benefits.

- Understand the difference between bagging, boosting, and other methods.

- Be able to create a basic Bagging model.

- Know how to use these methods in Scikit-Learn.

- Have a basic idea about stacking (another ensemble method).

# Ensemble Learning

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/orchestra.png' width=500px>

Ensemble Learning is a technique in machine learning where multiple models (often referred to as "base models") are trained and their predictions are combined to produce a final result.

Ensemble learning is a concept that has been around for a while. At its core, it's based on the principle that gathering opinions from multiple sources often leads to better outcomes than relying on just one source. This idea is commonly referred to as the "wisdom of the crowd."

This week, we'll apply this principle to machine learning. By combining simple models, like the decision trees we discussed last week, we aim to create more accurate and robust predictive models that perform better on new, unseen data.


**Ensemble Learning Analogy** is like an orchestra performing a symphony. Each individual musician (model) has their own instrument (algorithm) and plays their part (makes a prediction). While each musician is talented on their own, it's when they all come together under the guidance of the conductor (ensemble method) that they produce a harmonious and powerful performance (more accurate prediction). Just as a single out-of-tune instrument can be drowned out by the harmony of the entire orchestra, the errors from a single model in ensemble learning can be offset by the correct predictions of other models. The collective output is often more beautiful and impactful than any solo performance.

## Toy example (combining by voting)

Let's break down a simple example to see how combining multiple systems can improve accuracy:

Imagine you have 10 samples, and they're all marked as positive or "1".

Now, we have three systems or classifiers named A, B, and C. Individually, each can correctly identify the samples about 70% of the time. Here's what their results look like:

- A's results: {1,1,0,1,1,1,1,1,1,0}
- B's results: {0,1,1,1,1,1,1,0,1,1}
- C's results: {1,1,1,0,1,1,1,0,1,1}

Even though they're right 70% of the time, they make different mistakes. They're not wrong about the same samples.

So, how do we get better results? We use "majority voting". This means if at least two systems say a sample is "1", then we go with "1" as the final answer.

When we do this for our example, the combined result is {1,1,1,1,1,1,1,1,1,1}. Now, they're right 90% of the time together!

This magic happens because the systems' mistakes don't often overlap. When two systems get it right, they can correct the third system's mistake.

However, this only works when the systems make different mistakes. If all three systems were often wrong about the same samples, then majority voting wouldn't improve accuracy.

Certainly. Here's the revised content:

When using ensembles that are correlated, the impact on improving accuracy can be minimal. Consider three correlated classifiers, each with a prediction accuracy of 70%. The predictions from these classifiers are:

- $ \hat{y}_A $ = {1,1,1,0,1,1,1,1,0,0} 70% correct
- $ \hat{y}_B $ = {1,1,1,0,1,1,1,1,0,0}  70% correct
- $ \hat{y}_C $ = {1,1,1,0,1,1,1,1,0,0}
 70% correct
- $ \hat{y}_mv$ = {1,1,1,0,1,1,1,1,0,0}  70% correct

Given that these predictions are identical and correlated, majority voting (represented as \( \hat{y}_{mv} \)) will not enhance their combined performance. This cannot be improved through majority voting as each classifier individually achieves 70% accuracy.

On the other hand, if our three models A, B, and C are highly correlated with completely overlapping predictions, then we see little or no improvement through majority voting.

## Basics of Probability Theory and Application to Classifiers

 To understand how combining multiple classifiers can enhance accuracy, we can use probability theory.

1. **Basics of Probability Theory**: Before diving deep, it's important to grasp some fundamental probability concepts.

    - **Independent Events**: These are events that don't influence one another. The probability of all of them occurring is the product of their individual probabilities. Mathematically:
$ P(\hat{y}_A \cap \hat{y}_B \cap \hat{y}_C) = P(\hat{y}_A) \times P(\hat{y}_B) \times P(\hat{y}_C) $

    - **Mutually Exclusive Events**: These are events that can't occur at the same time. The combined probability for such events is the sum of their individual probabilities. For instance:
    
      $$ P(\hat{y}_A \cup \hat{y}_B) = P(\hat{y}_A) + P(\hat{y}_B) $$

2. **Application to Classifiers**: When dealing with multiple classifiers, like in our example, these principles come into play. Each classifier might have an accuracy of 70%, but they don't necessarily make mistakes on the same data points. This variance can be leveraged.

    - If one classifier mispredicts an outcome, the others might get it right. So, by taking a majority vote from the classifiers, we can 'correct' these individual errors.

    - Picture this as having three friends, each good at answering 70% of the questions in a quiz. If they were to work together, taking a collective decision on each question, their combined expertise could lead to an even higher score.

3. **Potential Outcomes**:
    - **Best Case**: All classifiers predict the right outcome, leading to 100% accuracy for that particular instance.

    - **Average Case**: Their combined expertise, through methods like majority voting, can push the accuracy up to 78% across a larger set of data.

    - **Worst Case**: If all classifiers get it wrong, then combining them won't help, and the accuracy remains at 70%.

In essence, the principle behind combining classifiers is to capitalize on their individual strengths and offset their weaknesses. This method often leads to a more robust and accurate predictive system, as confirmed by our probability theory.

## Toy example (combining by voting)

In this example, we're working with three separate prediction models. Each of these models gets things right about 70% of the time ($ \hat{y}_i $ are correct 70% of the time.).

From this setup, four distinct situations can arise for any given example:


1.** All three independent models are correct: **
All three models hit the mark. Since each model's prediction doesn't affect the others, the combined chance of all being right is 0.7 (the chance one is right) times itself twice. Doing the math: $0.7 \times 0.7 \times 0.7 = 0.343$.

$ P(\hat{y}_1 = 1, \hat{y}_2 = 1, \hat{y}_3 = 1) = 0.7 * 0.7 * 0.7 = 0.343 $

2. **Two models are correct (with 3 different possible mutually exclusive combinations)**:

Two models get it right, and one doesn't. There are three ways this can play out: either the first, second, or third model could be the one that's off. These scenarios can't all occur at once, so they're separate from one another. For each of these cases, the probability is $0.7 \times 0.7 \times 0.3$. Since there are three such cases, the total chance for this situation is $3 \times 0.7 \times 0.7 \times 0.3 = 0.441$.T

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 1, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 0, \hat{y}_3 = 1) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 1, \hat{y}_3 = 1) = 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.147 + 0.147 + 0.147 = 0.441 $

3. **Two models are wrong:**

It's also possible for two models to miss the mark and only one to get it right. Calculating in the same way, the combined chance for this scenario is 0.18.

- $ P(\hat{y}_1 = 1, \hat{y}_2 = 0, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 1, \hat{y}_3 = 0) $ OR

- $ P(\hat{y}_1 = 0, \hat{y}_2 = 0, \hat{y}_3 = 1) = 0.7 * 0.3 * 0.3 + 0.3 * 0.7 * 0.3 + 0.3 * 0.3 * 0.7 = 0.063 + 0.063 + 0.063 = 0.189 $

4. **All three are wrong:**

Lastly, there's the slim chance that all three models mess up. This has a likelihood of $0.3 \times 0.3 \times 0.3 = 0.027$.

$ P(\hat{y}_1 = 0, \hat{y}_2 = 0, \hat{y}_3 = 0) = 0.3 * 0.3 * 0.3 = 0.027 $


**Probability of all possible events** = 0.343 + 0.441 + 0.189 + 0.027 = 1

When you put all these situations together, the probabilities add up to 1, which makes sense because these four scenarios cover every possible outcome.


## continued: Toy example (combining by voting)

In this scenario (In this case, $ \hat{y_i} $ are correct 70% of the time.), we want to know the chance that at least 2 predictions out of 3 are accurate. So, we need to look at situations where either 2 answers are spot on or all 3 nail it.

From our calculations, the likelihood of getting 2 right combinations is 44%, which we represent as 0.441.

For a majority vote with 3 members, we can expect 4 classes of mutually exclusive outcomes.
In other words:
Probability of all possible events = $0.343 + 0.441 + 0.189 + 0.027 = 1$

If we toss in the times when all 3 predictions hit the mark, our overall accuracy jumps to 78%, denoted as 0.784.

When taking a majority vote of 3 predictions, the result will be correct when at least 2 of the predictions are correct.
- The total probability of any combination of results (where 2 are right) is $0.441$.
- Thus majority voting will correct the result ~ 44% of the time.
- By adding cases where all 3 are correct, it means the ensemble will be correct an average of $0.441 + 0.343 = 0.784$ % of the time.
- This increases to 83% of the time if we instead combine 5 ensembles.

A cool thing to highlight is, the more models we use, the better our results. Like, if we blend the insights from 5 models, our accuracy rate climbs to an impressive 83%.


##

Think of understanding bias and variance like playing darts on a target board. The bullseye in the middle is the perfect prediction. The farther away the darts land from the bullseye, the more off our predictions are.

Every time we make a model using different training data, it's like throwing a dart. Sometimes, the data is great and our dart (prediction) is close to the bullseye. But sometimes, if our data has odd values or outliers, our dart might land far off.

So, with all these darts (models), we can see a pattern on the board. This pattern helps us understand different scenarios of bias and variance, both high and low.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-7-Ensemble-learning/imgs/target.webp' width=500px>
