# Gradient Boosting: Note | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2045%20Gradient%20Boosting)

Gradient Boosting is a powerful ensemble technique that builds a strong predictive model by sequentially combining multiple weak learners (often shallow decision trees). Each new model is trained to correct the errors of the ensemble built so far by approximating the negative gradient of a loss function.

---

## 1. Mathematical Background

We want to find a model \( F(x) \) that minimizes a differentiable loss function \( L(y, F(x)) \) over a training set. The objective is to solve:

$$
F = \underset{F}{\operatorname{arg\,min}} \sum_{i=1}^{n} L(y_i, F(x_i))
$$

We represent \( F(x) \) as an additive model:

$$
F(x) = F_0(x) + \sum_{m=1}^{M} \gamma_m \, h_m(x)
$$

where:
- \( F_0(x) \) is the initial model (often a constant).
- \( h_m(x) \) is the \( m \)-th weak learner.
- \( \gamma_m \) is the weight (step size) for the \( m \)-th learner.

At each iteration \( m \), we compute the pseudo-residuals:

$$
r_{im} = -\left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F = F_{m-1}}
$$

Next, we fit a weak learner \( h_m(x) \) to these residuals and determine the optimal step size \( \gamma_m \) via:

$$
\gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^{n} L\left(y_i, F_{m-1}(x_i) + \gamma \, h_m(x_i)\right)
$$

Finally, the model is updated:

$$
F_{m}(x) = F_{m-1}(x) + \nu \, \gamma_m \, h_m(x)
$$

Here, \( \nu \) (nu) is the learning rate that scales the contribution of each new learner.

---

## 2. Python Implementation Step by Step

Below is an example using scikit-learn’s `GradientBoostingClassifier` on a synthetic dataset.

### Step 1: Import Libraries and Create a Dataset

```python
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a toy binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                           n_redundant=5, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
```

### Step 2: Initialize and Train the Gradient Boosting Model

```python
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                      max_depth=3, random_state=42)

# Train the model on the training data
gb_clf.fit(X_train, y_train)
```

### Step 3: Make Predictions and Evaluate the Model

```python
from sklearn.metrics import accuracy_score

# Predict on the test set
y_pred = gb_clf.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```

### Step 4: Conceptual Illustration of One Boosting Iteration (Regression Example)

For regression using mean squared error (MSE), the process can be outlined as follows:

1. **Initialization:**  
   Set the initial prediction \( F_0 \) as the mean of the target values:
   
   $$
   F_0 = \frac{1}{n}\sum_{i=1}^{n} y_i
   $$

2. **Compute Pseudo-Residuals:**  
   For MSE, the pseudo-residuals are the actual residuals:
   
   $$
   r_i = y_i - F_0
   $$

3. **Fit a Weak Learner:**  
   Train a simple decision tree regressor on the residuals:
   
   ```python
   from sklearn.tree import DecisionTreeRegressor
   
   weak_learner = DecisionTreeRegressor(max_depth=3, random_state=42)
   weak_learner.fit(X_train, r)
   predictions = weak_learner.predict(X_train)
   ```

4. **Determine Optimal Step Size (\( \gamma \)):**  
   For MSE loss in this simple case, the optimal \( \gamma \) is 1.
   
   ```python
   gamma = 1
   ```

5. **Update the Model:**
   
   $$
   F_1 = F_0 + \gamma \cdot \text{predictions}
   $$

6. **Repeat:**  
   In a full gradient boosting model, repeat these steps for several iterations, updating \( F_m(x) \) at each step.

---

## 3. Conclusion

This note has explained the key ideas behind Gradient Boosting:

- **Model Representation:**  
  $$ F(x) = F_0(x) + \sum_{m=1}^{M} \gamma_m \, h_m(x) $$

- **Pseudo-Residuals Computation:**  
  $$ r_{im} = -\left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F = F_{m-1}} $$

- **Model Update Rule:**  
  $$ F_{m}(x) = F_{m-1}(x) + \nu \, \gamma_m \, h_m(x) $$