<h1 id="random-forest-regressor-algorithm" align="center">📈 XGBoost Regressor 📈</h1>

<center><i>Combining Different Models to get Accurate Predictions<i></center>

----

<h1 id='brief-description'>📝 Brief Description</h1>

`XGBoost` Algorithm consists in a group of `different Machine Learning Ensemble Models` in order to get accurate results.

<center>
<img src='https://media.geeksforgeeks.org/wp-content/uploads/20210707140912/Bagging.png' />
</center>

<br />

**✔️ Pros:**

```
- Gradient Boosting comes with an easy to read and interpret algorithm, making most of its predictions easy to handle;

- Boosting is a resilient and robust method that prevents and cubs over-fitting quite easily;

- XGBoost performs very well on medium, small, data with subgroups and structured datasets with not too many features;

- Less feature engineering required (no need for scaling, normalizing data, can also handle missing values well);

- Feature importance can be found out (it output importance of each feature, can be used for feature selection)

- Handles large sized datasets well;

- good execution speed;

- good model performance (wins most of the Kaggle competitions);

- less prone to overfitting.
```

<br />

**❌ Cons:**

```
- XGBoost does not perform so well on sparse and unstructured data;

- the Algorithm is very sensitive to outliers since every classifier is forced to fix the errors in the predecessor learners;

- The overall method is hardly scalable. This is because the estimators base their correctness on previous predictors, hence the procedure involves a lot of struggle to streamline;

- difficult interpretation, visualization tough;

- overfitting possible if parameters not tuned properly;

- harder to tune as there are too many hyperparameters.
```

<br />

**📛 Some XGB Regressor Properties:**

```
- objective: evaluation method
- n_estimators: number of ensembles
- learning_rate: the minimum value to identify whether the model is learning or not
- colsample_by_tree: features' percentage used by the ensemble
- max_depth: max depth for each ensemble
- n_jobs: number of processors used over the trianing and prediction steps

- early_stopping_rounds: number of subsequent epochs/rounds the model is not improving the learning rate
- eval_set: dataset for evaluation (it is commonly used the validation one)
- verbose: whether the training log will be or will not be registered to thee user on the screen
```

----

<h1 id='reach-me'>ℹ️ Further Information</h1>
<br/>

For further information, check out these four videos from *[StatQuest with Josh Starmer](https://www.youtube.com/@statquest)* YouTube channel:

- *[XGBoost Part 1 (of 4): Regression](https://www.youtube.com/watch?v=OtD8wVaFm6E)*
- *[XGBoost Part 2 (of 4): Classification](https://www.youtube.com/watch?v=8b1JEDvenQU)*
- *[XGBoost Part 3 (of 4): Mathematical Details](https://www.youtube.com/watch?v=ZVFeW798-2I)*
- *[XGBoost Part 4 (of 4): Crazy Cool Optimizations](https://www.youtube.com/watch?v=oRrKeUCEbq8)*

----

<h1 id='example-code'>💻 Example Code</h1>
<br/>

Let's use `XGBoost` package to demonstrate how to create, fit, make predictions and evaluate a simple `XGBoost Regressor Model`.

To evaluation, we will be using the `Root Mean Squared Error (RMSE)` Algorithm. This Algorithm works getting the absolute value of the substraction between the predicted values by the real ones. After that, we calculate the summatory between them, find out their mean and then calculate the square root of the result. The method can be repreented by the following equation:

$sqrt(mean(sum(abs(predictedvalues - realvalues))))$


Now, let's hop into the code!!

In [12]:
# Setting up the environment
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import math

SEED = (2000)
X_MIN = (0)
X_MAX = (100)
Y_MIN = (0)
Y_MAX = (20)

TRAIN_SAMPLES = (800)
VALID_SAMPLES = (2)

np.random.seed(SEED)
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', 15)

In [3]:
# Generating fake dataset
X_train = np.random.randint(X_MIN, X_MAX, TRAIN_SAMPLES)
X_valid = np.random.randint(X_MIN, X_MAX, VALID_SAMPLES)
y_train = np.random.randint(Y_MIN, Y_MAX, TRAIN_SAMPLES)
y_valid = np.random.randint(Y_MIN, Y_MAX, VALID_SAMPLES)

X_train = pd.DataFrame(X_train, columns=['X'])
X_valid = pd.DataFrame(X_valid, columns=['X'])
y_train = pd.DataFrame(y_train, columns=['y'])
y_valid = pd.DataFrame(y_valid, columns=['y'])

In [4]:
# Creating the model
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror'
    , n_estimators=250
    , learning_rate=0.10
    , colsample_bytree=0.70
    , max_depth=3
    , n_jobs=4
)

In [15]:
# Training and making predictions
xgb_model.fit(
    X_train, y_train
    , early_stopping_rounds=5
    , eval_set=[(X_valid, y_valid)]
    , verbose=False
)

print('Training Done!')

predictions = xgb_model.predict(X_valid)

print('Predictions Done!')



Training Done!
Predictions Done!


In [13]:
# Evaluation
rmse = math.sqrt(mean_squared_error(y_valid, predictions))
train_score = round(xgb_model.score(X_train, y_train) * 100, 2)
valid_score = round(xgb_model.score(X_valid, y_valid) * 100, 2)

print('Root Mean Squared Error (RMSE):', rmse)

Root Mean Squared Error (RMSE): 7.5749505298437265


**OBS.:** *as far as the goal of this Kernel is to explain what is and how to apply `XGB Regressor Algorithm`, we have not done any Data Preprocessing and Transformation, so our model's evaluation is quite suck! Do not worry it 😂*

----

Thank so much for today, see ya!! 👋👋

<br/>
<h1 id='reach-me'>📫 Reach Me</h1>
<br/>

> **Email:** **[csfelix08@gmail.com](mailto:csfelix08@gmail.com?)**

> **Linkedin:** **[linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)**

> **Instagram:** **[instagram.com/c0deplus/](https://www.instagram.com/c0deplus/)**

> **Portfolio:** **[CSFelix.io](https://csfelix.github.io/)**

> **Kaggle:** **[DSFelix](https://www.kaggle.com/dsfelix)**