# **Gradient Boosting Regressor Model Theory**


## Theory
GradientBoostingRegressor is a machine learning model that uses the gradient boosting algorithm for regression tasks. It builds an ensemble of decision trees in a sequential manner, where each tree corrects the errors of the previous one. The algorithm minimizes a loss function (typically mean squared error for regression) by fitting each new tree to the residuals of the previous predictions. 

The model function for GradientBoostingRegressor is the sum of the outputs of all decision trees:

$$ f(x) = \sum_{t=1}^{T} \alpha_t h_t(x) $$

Where:
- $f(x)$ is the predicted output.
- $T$ is the number of trees.
- $\alpha_t$ is the weight of the $t$-th tree.
- $h_t(x)$ is the prediction of the $t$-th tree for input $x$.

## Model Training

### Forward Pass

During training, the forward pass in GradientBoostingRegressor involves iteratively building decision trees, each of which focuses on predicting the residuals (errors) of the previous trees. This helps the model correct errors made by previous trees and refine the predictions.

### Cost Function

The cost function for GradientBoostingRegressor is the loss function used to evaluate how well the model's predictions match the actual outputs. In the case of regression, this is typically the mean squared error (MSE):

$$ J(\alpha) = \frac{1}{m} \sum_{i=1}^{m}(f(x^{(i)}) - y^{(i)})^2 $$

Where:
- $J(\alpha)$ is the cost function (MSE).
- $m$ is the number of training examples.
- $f(x^{(i)})$ is the predicted output for the $i$-th example.
- $y^{(i)}$ is the actual output for the $i$-th example.

### Gradient Boosting

GradientBoostingRegressor uses gradient descent to minimize the cost function by fitting new trees to the residuals of the previous prediction. The gradient of the cost function is calculated, and each tree is fit to the negative gradient of the loss function. This process is repeated for each tree.

The update for each tree is:

$$ h_t(x) = -\frac{\partial J}{\partial f(x)} $$

Where:
- $h_t(x)$ is the tree added at stage $t$.
- $\frac{\partial J}{\partial f(x)}$ is the gradient of the cost function with respect to the prediction at each stage.

## Training Process

The training process in GradientBoostingRegressor involves the following steps:
1. **Initial prediction**: Start with an initial prediction, which is typically the mean of the target variable.
2. **Compute residuals**: Calculate the residuals, which are the differences between the actual target values and the current model predictions.
3. **Fit decision trees**: Fit a decision tree to the residuals. This tree learns to predict the errors of the current model.
4. **Update prediction**: Add the predictions of the new tree to the current prediction.
5. **Repeat**: Repeat the above steps until the desired number of trees is reached or the model stops improving.

GradientBoostingRegressor uses a learning rate ($\alpha$) to control the contribution of each new tree. Smaller learning rates lead to more trees being used, while larger learning rates speed up the training but may increase the risk of overfitting.

By iteratively improving the model through decision trees, GradientBoostingRegressor can learn complex patterns in the data and achieve high accuracy in regression tasks.


## **Model Evaluation**

### 1. Mean Squared Error (MSE)

**Formula:**
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{true}_i} - y_{\text{pred}_i})^2
$$

**Description:**
- **Mean Squared Error (MSE)** is a widely used metric for evaluating the accuracy of regression models.
- It measures the average squared difference between the predicted values ($y_{\text{pred}}$) and the actual target values ($y_{\text{true}}$).
- The squared differences are averaged across all data points in the dataset.

**Interpretation:**
- A lower MSE indicates a better fit of the model to the data, as it means the model's predictions are closer to the actual values.
- MSE is sensitive to outliers because the squared differences magnify the impact of large errors.
- **Limitations:**
  - MSE can be hard to interpret because it is in squared units of the target variable.
  - It disproportionately penalizes larger errors due to the squaring process.

---

### 2. Root Mean Squared Error (RMSE)

**Formula:**
$$
\text{RMSE} = \sqrt{\text{MSE}}
$$

**Description:**
- **Root Mean Squared Error (RMSE)** is a variant of MSE that provides the square root of the average squared difference between predicted and actual values.
- It is often preferred because it is in the same unit as the target variable, making it more interpretable.

**Interpretation:**
- Like MSE, a lower RMSE indicates a better fit of the model to the data.
- RMSE is also sensitive to outliers due to the square root operation.
- **Advantages over MSE:**
  - RMSE provides a more intuitive interpretation since it is in the same scale as the target variable.
  - It can be more directly compared to the values of the actual data.

---

### 3. R-squared ($R^2$)

**Formula:**
$$
R^2 = 1 - \frac{\text{SSR}}{\text{SST}}
$$

**Description:**
- **R-squared ($R^2$)**, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable ($y_{\text{true}}$) that is predictable from the independent variable(s) ($y_{\text{pred}}$) in a regression model.
- It ranges from 0 to 1, where 0 indicates that the model does not explain any variance, and 1 indicates a perfect fit.

**Interpretation:**
- A higher $R^2$ value suggests that the model explains a larger proportion of the variance in the target variable.
- However, $R^2$ does not provide information about the goodness of individual predictions or whether the model is overfitting or underfitting.
- **Limitations:**
  - $R^2$ can be misleading in cases of overfitting, especially with polynomial regression models. Even if $R^2$ is high, the model may not generalize well to unseen data.
  - It doesn’t penalize for adding irrelevant predictors, so adjusted $R^2$ is often preferred for models with multiple predictors.

---

### 4. Adjusted R-squared

**Formula:**
$$
\text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n-p-1}
$$
where \(n\) is the number of data points and \(p\) is the number of predictors.

**Description:**
- **Adjusted R-squared** adjusts the R-squared value to account for the number of predictors in the model, helping to prevent overfitting when adding more terms to the model.
- Unlike $R^2$, it can decrease if the additional predictors do not improve the model significantly.

**Interpretation:**
- A higher adjusted $R^2$ suggests that the model is not just overfitting, but has genuine explanatory power with the number of predictors taken into account.
- It is especially useful when comparing models with different numbers of predictors.

---

### 5. Mean Absolute Error (MAE)

**Formula:**
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{true}_i} - y_{\text{pred}_i}|
$$

**Description:**
- **Mean Absolute Error (MAE)** measures the average of the absolute errors between the predicted and actual values.
- Unlike MSE and RMSE, MAE is not sensitive to outliers because it does not square the errors.

**Interpretation:**
- MAE provides a straightforward understanding of the average error magnitude.
- A lower MAE suggests better model accuracy, but it may not highlight the impact of large errors as much as MSE or RMSE.

## sklearn template [scikit-learn: GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)

### class sklearn.ensemble.GradientBoostingRegressor(*, loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, max_depth=3, min_impurity_decrease=0.0, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=1e-4, ccp_alpha=0.0)

| **Parameter**          | **Description**                                                                                                                                                                | **Default**     |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| `loss`                 | Loss function to be minimized. Can be 'ls' (least squares) or other loss types.                                                                                               | `'ls'`          |
| `learning_rate`        | Shrinks the contribution of each tree by this factor. Higher values make the model more sensitive to the data.                                                                  | `0.1`           |
| `n_estimators`         | Number of boosting stages to be run.                                                                                                                                             | `100`           |
| `subsample`            | Proportion of samples used for fitting each tree. Values between 0.0 and 1.0.                                                                                                | `1.0`           |
| `criterion`            | The function to measure the quality of a split. 'friedman_mse' is the most commonly used.                                                                                       | `'friedman_mse'`|
| `min_samples_split`    | Minimum number of samples required to split an internal node.                                                                                                                  | `2`             |
| `min_samples_leaf`     | Minimum number of samples required to be at a leaf node.                                                                                                                       | `1`             |
| `max_depth`            | Maximum depth of the individual trees.                                                                                                                                           | `3`             |
| `min_impurity_decrease`| Minimum impurity decrease required for a split to occur.                                                                                                                        | `0.0`           |
| `random_state`         | Controls the randomness of the estimator. Used for reproducibility.                                                                                                            | `None`          |
| `max_features`         | The number of features to consider when looking for the best split. Can be integer, float, or specific strings.                                                               | `None`          |
| `alpha`                | The alpha-quantile of the huber loss function and the quantile loss function. Used if `loss='huber'` or `loss='quantile'`.                                                    | `0.9`           |
| `verbose`              | Controls the verbosity of the output.                                                                                                                                           | `0`             |
| `max_leaf_nodes`       | Maximum number of leaf nodes in the trees.                                                                                                                                       | `None`          |
| `warm_start`           | If True, reuse the solution of the previous call to fit and add more estimators.                                                                                                | `False`         |
| `validation_fraction`  | Proportion of training data set aside as validation set for early stopping.                                                                                                     | `0.1`           |
| `n_iter_no_change`     | Number of iterations with no improvement in validation score before stopping training.                                                                                         | `None`          |
| `tol`                  | Tolerance for the early stopping. Training will stop if the validation score is not improving by at least `tol` for `n_iter_no_change` iterations.                             | `1e-4`          |
| `ccp_alpha`            | Complexity parameter used for Minimal Cost-Complexity Pruning. A subtree with cost complexity smaller than `ccp_alpha` is selected.                                            | `0.0`           |

-

| **Attribute**          | **Description**                                                                                                                                                                |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `n_estimators_`        | Number of boosting stages (i.e., trees) used in the model.                                                                                                                     |
| `feature_importances_` | Impurity-based feature importances.                                                                                                                                             |
| `oob_improvement_`     | Improvement in loss on the out-of-bag samples relative to the previous iteration.                                                                                             |
| `train_score_`         | The training score at each iteration.                                                                                                                                           |
| `estimators_`          | The collection of fitted sub-estimators (decision trees).                                                                                                                      |
| `n_features_in_`       | The number of features seen during fitting.                                                                                                                                     |
| `feature_names_in_`    | Names of features seen during fitting (if applicable).                                                                                                                         |

-

| **Method**             | **Description**                                                                                                                                                                |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `fit(X, y)`            | Fit the gradient boosting model to the input data `X` and target values `y`.                                                                                                  |
| `predict(X)`           | Predict regression target for the input data `X`.                                                                                                                              |
| `score(X, y)`          | Return the coefficient of determination (R² score) for the prediction.                                                                                                        |
| `get_params()`         | Gets the parameters of the GradientBoostingRegressor model.                                                                                                                     |
| `set_params(**params)` | Sets the parameters of the GradientBoostingRegressor model.                                                                                                                     |
| `apply(X)`             | Apply trees in the ensemble to `X`, returning the leaf indices for each tree.                                                                                                  |





# XXXXXXXX regression - Example

## Data loading

##  Data processing

## Plotting data

## Model definition

## Model evaulation