# **XGBoostRegressor Model Theory**


## Theory
XGBoostRegressor is an implementation of gradient boosting designed for regression tasks. It builds an ensemble of decision trees, where each tree is trained to predict the residuals (errors) of the previous tree. XGBoost applies several optimizations, including regularization (L1 and L2), parallelization, and handling of missing values, to improve training speed and model performance. It is known for its high efficiency, scalability, and accuracy, making it popular for regression problems with large datasets.

The model function for XGBoostRegressor is the weighted sum of the outputs from all decision trees:

$$ f(x) = \sum_{t=1}^{T} \alpha_t h_t(x) $$

Where:
- $f(x)$ is the predicted output.
- $T$ is the number of trees.
- $\alpha_t$ is the weight of the $t$-th tree.
- $h_t(x)$ is the prediction of the $t$-th tree for input $x$.

## Model Training

### Forward Pass

In the forward pass, XGBoostRegressor builds decision trees iteratively, where each tree focuses on predicting the residuals (errors) of the previous trees. XGBoost uses a combination of techniques like regularization (L1 and L2) to control overfitting and improve the generalization ability of the model.

### Cost Function

XGBoostRegressor uses a loss function (usually mean squared error for regression) that measures how far the predictions are from the actual target values. The cost function is:

$$ J(\alpha) = \frac{1}{m} \sum_{i=1}^{m}(f(x^{(i)}) - y^{(i)})^2 $$

Where:
- $J(\alpha)$ is the cost function (MSE).
- $m$ is the number of training examples.
- $f(x^{(i)})$ is the predicted output for the $i$-th example.
- $y^{(i)}$ is the actual output for the $i$-th example.

XGBoost also includes regularization terms in the objective function to penalize large model coefficients, helping to prevent overfitting:

$$ J(\alpha) = \text{Loss}(f(x), y) + \lambda \sum_{t=1}^{T} \|h_t(x)\|^2 $$

Where $\lambda$ is the regularization parameter.

### Gradient Boosting

XGBoostRegressor uses gradient boosting, where each tree is trained to fit the gradient of the loss function with respect to the previous prediction. The update for each tree is calculated by minimizing the loss function's gradient, allowing the model to improve by focusing on areas where it makes large errors.

The update for each tree is given by:

$$ h_t(x) = -\frac{\partial J}{\partial f(x)} $$

Where:
- $h_t(x)$ is the tree added at stage $t$.
- $\frac{\partial J}{\partial f(x)}$ is the gradient of the cost function with respect to the prediction at each stage.

## Training Process

The training process for XGBoostRegressor involves the following steps:
1. **Initial prediction**: Start with an initial prediction, typically the mean of the target variable.
2. **Compute residuals**: Calculate the residuals, which are the differences between the actual target values and the current model predictions.
3. **Fit decision trees**: Fit decision trees to the residuals. XGBoost uses gradient boosting and regularization to build trees that focus on reducing errors while avoiding overfitting.
4. **Update prediction**: Add the output of the new tree to the current prediction.
5. **Repeat**: Repeat the process until a predefined number of trees are added or further improvement is minimal.

XGBoost also supports early stopping, where training can halt if the model's performance on a validation set does not improve after a certain number of rounds. Additionally, hyperparameters like learning rate, number of trees, and tree depth can be tuned to balance bias and variance.

By iteratively improving predictions with decision trees and using advanced optimization techniques, XGBoostRegressor can learn complex patterns in the data and provide accurate predictions for regression tasks.


## **Model Evaluation**

### 1. Mean Squared Error (MSE)

**Formula:**
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{true}_i} - y_{\text{pred}_i})^2
$$

**Description:**
- **Mean Squared Error (MSE)** is a widely used metric for evaluating the accuracy of regression models.
- It measures the average squared difference between the predicted values ($y_{\text{pred}}$) and the actual target values ($y_{\text{true}}$).
- The squared differences are averaged across all data points in the dataset.

**Interpretation:**
- A lower MSE indicates a better fit of the model to the data, as it means the model's predictions are closer to the actual values.
- MSE is sensitive to outliers because the squared differences magnify the impact of large errors.
- **Limitations:**
  - MSE can be hard to interpret because it is in squared units of the target variable.
  - It disproportionately penalizes larger errors due to the squaring process.

---

### 2. Root Mean Squared Error (RMSE)

**Formula:**
$$
\text{RMSE} = \sqrt{\text{MSE}}
$$

**Description:**
- **Root Mean Squared Error (RMSE)** is a variant of MSE that provides the square root of the average squared difference between predicted and actual values.
- It is often preferred because it is in the same unit as the target variable, making it more interpretable.

**Interpretation:**
- Like MSE, a lower RMSE indicates a better fit of the model to the data.
- RMSE is also sensitive to outliers due to the square root operation.
- **Advantages over MSE:**
  - RMSE provides a more intuitive interpretation since it is in the same scale as the target variable.
  - It can be more directly compared to the values of the actual data.

---

### 3. R-squared ($R^2$)

**Formula:**
$$
R^2 = 1 - \frac{\text{SSR}}{\text{SST}}
$$

**Description:**
- **R-squared ($R^2$)**, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable ($y_{\text{true}}$) that is predictable from the independent variable(s) ($y_{\text{pred}}$) in a regression model.
- It ranges from 0 to 1, where 0 indicates that the model does not explain any variance, and 1 indicates a perfect fit.

**Interpretation:**
- A higher $R^2$ value suggests that the model explains a larger proportion of the variance in the target variable.
- However, $R^2$ does not provide information about the goodness of individual predictions or whether the model is overfitting or underfitting.
- **Limitations:**
  - $R^2$ can be misleading in cases of overfitting, especially with polynomial regression models. Even if $R^2$ is high, the model may not generalize well to unseen data.
  - It doesn’t penalize for adding irrelevant predictors, so adjusted $R^2$ is often preferred for models with multiple predictors.

---

### 4. Adjusted R-squared

**Formula:**
$$
\text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n-p-1}
$$
where \(n\) is the number of data points and \(p\) is the number of predictors.

**Description:**
- **Adjusted R-squared** adjusts the R-squared value to account for the number of predictors in the model, helping to prevent overfitting when adding more terms to the model.
- Unlike $R^2$, it can decrease if the additional predictors do not improve the model significantly.

**Interpretation:**
- A higher adjusted $R^2$ suggests that the model is not just overfitting, but has genuine explanatory power with the number of predictors taken into account.
- It is especially useful when comparing models with different numbers of predictors.

---

### 5. Mean Absolute Error (MAE)

**Formula:**
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{true}_i} - y_{\text{pred}_i}|
$$

**Description:**
- **Mean Absolute Error (MAE)** measures the average of the absolute errors between the predicted and actual values.
- Unlike MSE and RMSE, MAE is not sensitive to outliers because it does not square the errors.

**Interpretation:**
- MAE provides a straightforward understanding of the average error magnitude.
- A lower MAE suggests better model accuracy, but it may not highlight the impact of large errors as much as MSE or RMSE.

## XGBoost template [XGBoost: XGBRegressor](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor)

### class xgboost.XGBRegressor(*, objective='reg:squarederror', booster='gbtree', n_estimators=100, learning_rate=0.1, max_depth=3, min_child_weight=1, gamma=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, n_jobs=1, missing=None, importance_type='weight', tree_method='auto', predictor='auto', verbosity=1)

| **Parameter**          | **Description**                                                                                                                                                                | **Default**     |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| `objective`            | Specify the learning task and corresponding objective function. For regression, use 'reg:squarederror'.                                                                       | `'reg:squarederror'` |
| `booster`              | Type of boosting model to use. Options: 'gbtree', 'gblinear', 'dart'.                                                                                                          | `'gbtree'`      |
| `n_estimators`         | Number of boosting rounds (trees).                                                                                                                                              | `100`           |
| `learning_rate`        | Step size shrinking to prevent overfitting.                                                                                                                                     | `0.1`           |
| `max_depth`            | Maximum depth of a tree.                                                                                                                                                        | `3`             |
| `min_child_weight`     | Minimum sum of instance weight (hessian) in a child.                                                                                                                           | `1`             |
| `gamma`                | Minimum loss reduction required to make a further partition.                                                                                                                   | `0`             |
| `subsample`            | Fraction of samples used to grow each tree.                                                                                                                                     | `1`             |
| `colsample_bytree`     | Fraction of features to consider when building each tree.                                                                                                                     | `1`             |
| `colsample_bylevel`    | Fraction of features to consider for each level.                                                                                                                               | `1`             |
| `reg_alpha`            | L1 regularization term on weights.                                                                                                                                              | `0`             |
| `reg_lambda`           | L2 regularization term on weights.                                                                                                                                              | `1`             |
| `scale_pos_weight`     | Controls the balance of positive and negative weights for unbalanced classes.                                                                                                  | `1`             |
| `base_score`           | The initial prediction score.                                                                                                                                                   | `0.5`           |
| `random_state`         | Seed for reproducibility.                                                                                                                                                       | `0`             |
| `n_jobs`               | Number of parallel threads used during training.                                                                                                                               | `1`             |
| `missing`              | Value used to represent missing data in the input dataset.                                                                                                                     | `None`          |
| `importance_type`      | The method to calculate feature importance. Options: 'weight', 'gain', 'cover'.                                                                                               | `'weight'`      |
| `tree_method`          | The tree construction algorithm. Options: 'auto', 'exact', 'approx', 'hist', 'gpu_hist'.                                                                                       | `'auto'`        |
| `predictor`            | The type of predictor to use. Options: 'auto', 'cpu_predictor', 'gpu_predictor'.                                                                                               | `'auto'`        |
| `verbosity`            | The level of messages to print during training.                                                                                                                                 | `1`             |

-

| **Attribute**          | **Description**                                                                                                                                                                |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `feature_importances_` | Feature importances computed during training.                                                                                                                                  |
| `booster_`             | The trained booster (XGBoost model object).                                                                                                                                     |
| `n_features_in_`       | Number of features seen during fitting.                                                                                                                                         |
| `best_iteration_`      | Number of boosting rounds chosen based on early stopping.                                                                                                                      |

-

| **Method**             | **Description**                                                                                                                                                                |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `fit(X, y)`            | Fit the model to the training data `X` and target values `y`.                                                                                                                 |
| `predict(X)`           | Predict regression target for the input data `X`.                                                                                                                              |
| `score(X, y)`          | Return the R² score for the prediction.                                                                                                                                         |
| `get_params()`         | Get the parameters of the XGBRegressor model.                                                                                                                                   |
| `set_params(**params)` | Set the parameters of the XGBRegressor model.                                                                                                                                   |
| `get_booster()`        | Returns the trained booster (model) as a `Booster` object.                                                                                                                     |
| `get_dump()`           | Dump the model in human-readable format.                                                                                                                                         |




# XXXXXXXX regression - Example

## Data loading

##  Data processing

## Plotting data

## Model definition

## Model evaulation