<a href="https://colab.research.google.com/github/Metallicode/Math/blob/main/Boosting_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Boosting Methods

*Boosting works by combining multiple weak learners to create a strong learner. **A weak learner** is a model that performs only slightly better than random guessing, while **a strong learner** is a model that has high accuracy on the task.*

* Sequential Learning: Unlike Random Forests, which train each tree independently, boosting methods train learners sequentially. Each subsequent model is built to correct the mistakes of its predecessor.

* Weighted Training Data: After each round, instances that were misclassified by the current weak learner are given higher weights, making it more likely for subsequent learners to focus on them. Similarly, correctly classified instances are given lower weights.

* Aggregation: The final prediction is typically a weighted combination of the predictions of all the weak learners. For regression tasks, this is often a weighted sum. For classification tasks, it can be a weighted majority vote.

* Regularization via Shrinkage: Boosting algorithms often introduce a learning rate (or shrinkage) parameter. This slows down the learning process by shrinking the contribution of each weak learner, which often leads to better performance and less overfitting.

* Strengths: Often provides higher accuracy than other algorithms, especially on tabular data.
Effective with imbalanced datasets by focusing more on the underrepresented class.

* Challenges: More prone to overfitting, especially with noisy data or when the dataset is small.
Typically slower to train than bagging methods like Random Forests due to the sequential nature.
Loses some interpretability when compared to a single decision tree.

* Hyperparameters: Boosting methods come with their set of hyperparameters (like the depth of the trees, learning rate, and number of trees) that need to be tuned for optimal performance.

*In essence, boosting is a versatile and powerful method in the realm of machine learning, particularly for structured/tabular data. It leverages the idea that "the whole is greater than the sum of its parts," turning a series of weak models into a highly accurate combined model.*

##Examples of Boosting Algorithms:

> **AdaBoost (Adaptive Boosting)**: One of the first successful boosting algorithms. It adjusts the weights of misclassified instances and combines learners through weighted majority voting.

> **Gradient Boosting Machines (GBM)**: Builds trees sequentially, where each tree tries to correct the residuals (the differences between the predicted and true values) of the previous one. It generalizes the boosting procedure to optimize arbitrary loss functions.

> **XGBoost, LightGBM, CatBoost**: Modern and efficient implementations of gradient boosting that offer faster computation and additional features.

##Soft vs. Hard Voting


When using ensemble methods, especially in the context of classification, there are two main ways to aggregate predictions from multiple classifiers: soft voting and hard voting.

**Hard Voting:**
> Each classifier in the ensemble "votes" for a class label.
The class label that gets the majority of the votes is chosen as the final predicted class.
Example: Suppose you have three classifiers and two possible classes (A and B). If the classifiers predict A, A, and B, respectively, then the ensemble prediction using hard voting would be A, as it's the majority class predicted.

**Soft Voting:**

> Each classifier provides a probability for each class label.
The probabilities for each class are averaged across all classifiers.
The class with the highest average probability is chosen as the final predicted class.
Example: Using the same scenario with two possible classes (A and B), let's say the classifiers provide the following probabilities for class A: 0.7, 0.2, and 0.8. The average probability for class A would be (0.7 + 0.2 + 0.8) / 3 = 0.567. If the average for class B is lower than this, then the ensemble prediction using soft voting would be A.

**Which to Use?:**

**Soft voting** is generally preferred over hard voting, especially when the classifiers are well-calibrated. The reason is that soft voting takes into account the confidence levels of individual classifiers, leading to potentially more accurate ensemble predictions.

**Hard voting** can be more robust in cases where the probability outputs from classifiers aren't reliable or well-calibrated.

*In conclusion, while hard voting is a simple majority rule approach, soft voting provides a more nuanced way to aggregate predictions by considering the confidence of individual models.

##AdaBoost

short for "Adaptive Boosting".

one of the most popular ensemble methods that employs the boosting technique.


**AdaBoost in Simple Terms:**

> Imagine you're trying to differentiate between cats and dogs, but you're not very good at it. So, you ask a friend (a "weak learner") to help. Your friend makes some mistakes but gets some right.

> Instead of getting upset at the mistakes, you focus more on the pictures your friend got wrong and ask another friend to try and classify those.

> This second friend also makes mistakes, but again, instead of being upset, you focus on the pictures that are still wrong.

> You repeat this process with many friends. Each friend tries to correct the mistakes of the previous one.

> In the end, you combine all your friends' decisions. Each friend gets a say, but friends who were more confident and accurate have a louder voice in the final decision.

*This process is basically AdaBoost! Each friend is a "weak learner" (often a simple decision tree). Mistakes made by previous learners are given more emphasis by subsequent learners. In the end, you combine all their decisions for a final, strong decision.*

##AdaBoost Formula

1. Weight Initialization: Each data point is given an equal weight initially.

2. For Each Learner:

> * Fit the learner on the dataset considering the weights.
* Calculate the error of the learner.
* Compute the learner's weight in the final decision.
Learners with lower error have more weight (a louder voice).
* Increase the weights of the misclassified points, so the next learner pays more attention to them.

3. Final Output: Combine the decisions of all learners. Each learner's decision is weighted by its accuracy.

###Error of a Learner

```
learner_error = Sum_of_weights_of misclassified_points / Total_Sum_of_weights

```


###Weight of a learner (alpha)

```
alpha = (1/2)*natural_log(1-error/error)
```

###Updating Weights of Data Points

**For correctly classified points:**

```
new_w = old_w * np.e**-alpha
```

**For misclassified points:**
```
new_w = old_w * np.e**alpha
```

###Final Prediction

Mathematically, if h_t(x) is the prediction of the t'th weak learner for an input x and α_t is the weight of that learner, then the aggregated score for x is given by

```
(Where T is the total number of weak learners)

H(x) = sum(α_t*h_t(x) for t in T)

Output = Sign(H(x))

```