# Gradient Boosting Step-by-Step

Gradient Boosting is an ensemble learning method that builds a strong predictive model by combining multiple weak models (typically decision trees) in a sequential manner. It minimizes the loss function by learning from the errors of previous models.

---
## **Step 1: Initialize the Model**
We start by initializing the model with a simple prediction, usually the **mean** (for regression) or **log-odds** (for classification).
For regression:

$F_0(x) = \arg\min_{\gamma} \sum_{i=1}^{n} L(y_i, \gamma)$

where:
- $F_0(x)$ is the initial model (often a constant),
- $L(y_i, \gamma)$ is the loss function (e.g., Mean Squared Error),
- $y_i$ are the actual values.
---
## **Step 2: Compute Residuals (Negative Gradients)**
At each iteration $m$, compute the negative gradient (pseudo-residuals), which points in the direction of the steepest descent of the loss function.

$r_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x) = F_{m-1}(x)}$

where:
- $r_{im}$ are the residuals (negative gradients),
- $F_{m-1}(x)$ is the prediction from the previous iteration.

For Mean Squared Error (MSE):

$r_{im} = y_i - F_{m-1}(x_i)$

For Log Loss in classification:

$r_{im} = y_i - p_{m-1}(x_i), \quad p_{m-1}(x_i) = \frac{1}{1 + e^{-F_{m-1}(x_i)}}$

---
## **Step 3: Fit a Weak Learner (Decision Tree)**
Fit a weak learner (e.g., a small decision tree) to predict the residuals:

$h_m(x) = \arg\min_{h} \sum_{i=1}^{n} (r_{im} - h(x_i))^2$

where:
- $h_m(x)$ is the weak learner (decision tree),
- It tries to approximate the residuals $r_{im}$.
---
## **Step 4: Compute Step Size (Shrinkage)**
Find the best step size $\gamma_m$ by minimizing the loss:

$\gamma_m = \arg\min_{\gamma} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))$

This step ensures that we move in the optimal direction while updating our model.

---

## **Step 5: Update the Model**
Update the model by adding the new weak learner:

$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$

where:
- $F_m(x)$ is the updated model,
- $h_m(x)$ is the weak learner,
- $\gamma_m$ is the learning rate (shrinkage factor).
---
## **Step 6: Repeat for Multiple Iterations**
Repeat **Steps 2 to 5** for $M$ iterations until convergence or a stopping criterion is met.

---

## **Step 7: Final Prediction**
After $M$ iterations, the final model is:

$F_M(x) = F_0(x) + \sum_{m=1}^{M} \gamma_m h_m(x)$

---
## **Advantages of Gradient Boosting**
✅ Handles non-linear relationships well  
✅ Works with both regression and classification  
✅ Reduces bias and variance effectively  
✅ Highly customizable with hyperparameters like learning rate, tree depth, and loss functions  

---
## **Disadvantages of Gradient Boosting**
❌ Can be slow due to sequential learning  
❌ Prone to overfitting if not regularized  
❌ Requires careful tuning of hyperparameters