## INTRO

Gradient Boosting is another ensemble technique, but unlike Random Forest, it builds trees sequentially (hence the name “Gradient”, which means in steps). Each subsequent tree aims to correct the errors of the previous trees by minimizing a loss function (hence the name “Boosting”, that is, prediction accuracy is boosted).

Like Random Forest, Gradient Boosting can be used for supervised regression and classification problems But in this project, we will showcase a regression problem.


### BUSINESS PROBLEM

Given a dataset on cars, predict car prices based on features such as brand, mileage, and year of manufacture.

#### DATA GATHERING

In [1]:
# Import statements

from sklearn.ensemble import GradientBoostingRegressor

# to generate a synthetic regression-type dataset on cars that mimics a real-world regression problem.
from sklearn.datasets import make_regression   

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

In [3]:
# Generate synthetic regression dataset
# n_samples=1000 specifies the number of samples (or rows) in the dataset. In this case, the dataset will
# contain 1,000 samples.
# n_features=5 determines the number of independent variables (or features) in the dataset. Here, there are 5
# features (columns in the input X, that is, predictor variables).
# noise=0.1 adds random noise (standard deviation of 0.1) to the target variable (y) to simulate real-world 
# data imperfections.
# random_state=42 sets a seed for the random number generator to ensure reproducibility. When using the same
# random_state, you’ll get the same dataset every time you run the code.

X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

Catching a glimpse of the features and target dataset

In [9]:
X

array([[ 2.05654356,  0.60685059,  0.48268789, -1.13088844,  0.42009449],
       [-0.79919201, -0.64596418, -0.18289644, -0.48274352,  1.37487642],
       [ 1.07600714, -0.79602586, -0.75196933,  0.02131165, -0.31905394],
       ...,
       [ 0.49968511,  0.2394045 ,  1.48724616,  0.47200227, -0.58005324],
       [-0.64148691,  0.01914778, -0.66198218,  0.48787228,  0.42588721],
       [ 0.91539028, -0.83305606, -1.77624633, -0.54954027, -0.08059975]])

In [10]:
y

array([ 7.49056131e+01, -4.16495880e+01, -2.41828662e+01, -8.70945046e+01,
       -8.61220294e+01, -3.82099532e+01,  5.92690820e+00,  7.13778728e+01,
        5.99126017e+01, -1.50969118e+01, -6.62185812e+00, -7.03069531e+01,
        1.47467968e+01,  5.51303165e+01,  8.61219002e+01,  5.81299895e+01,
       -8.98491200e+01, -4.18401281e+01, -6.93561507e+01, -8.15057182e+01,
       -1.60562797e+02, -4.89033782e+01,  3.35958333e+00,  3.14380808e+01,
        1.30548314e+01,  3.51909271e+01,  6.79636910e+01, -4.97747030e+01,
       -2.90641540e+01, -4.98120494e+01, -9.09613377e+01,  1.13737237e+01,
       -3.48320487e+01,  4.46810892e+01,  1.12303695e+02, -3.39523070e+01,
        1.73901516e+01, -1.26712274e+02, -8.75390090e+01,  1.57514909e+02,
        1.22693052e+02, -3.66901854e+01,  5.03243801e+01, -3.99676527e+01,
       -7.18372950e+01, -1.02186746e+02, -1.12360567e+01,  5.78318563e+01,
        1.11333975e+02,  1.02179801e+02, -1.55316583e+01,  3.43318665e+01,
        5.96308043e+01, -

DATA ASSESSMENT/CLEANING

We will ignore this stage and concentrate on building the model since a synthetic, near-perfect dataset is involved here

#### EDA

We will ignore this stage and concentrate on building the model since a synthetic, near-perfect dataset is involved here

#### MODEL BUILDING

In [13]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [15]:
# Now, actually building the model and assigning it the name gb_reg, short for Gradient Boosting Regression.
# n_estimators=100 specifies the number of decision trees (or estimators) to be built. The model will build 100
# weak learners (small decision trees) sequentially. Increasing this number can improve accuracy but may also
# increase computation time and risk overfitting.
# learning_rate=0.1 determines the contribution of each tree to the final prediction. A smaller learning rate
# (e.g., 0.1) means each tree’s contribution is scaled down, leading to slower but more robust learning.
# Smaller values often require more trees (n_estimators) to achieve high performance, but it reduces overfitting.
# random_state=42 sets the seed for the random number generator, ensuring the results are reproducible.

gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

#### MODEL TRAINING

In [16]:
# We pass the train dataset through the gb_reg model to train it

gb_reg.fit(X_train, y_train)

#### MODEL TESTING

In [18]:
# Now we pass the test dataset (without the labels) for it to predict the labels

y_pred = gb_reg.predict(X_test)

# To see the model's prediction of the car prices
y_pred

array([ -51.87418974,  -75.55244649,   55.85684489,   -1.0929767 ,
         55.17545487,   21.856484  ,  -26.70212004,  -53.33523196,
         52.17255673, -110.82056932,   54.06518379, -124.26298491,
         98.47624438,   22.8122768 ,   11.48476414,   -9.41004559,
         11.86319778,   -6.96226806,   64.5870439 ,  -32.36511709,
         84.67346986,  -15.76247761,  -14.77628682,   15.2637696 ,
        -91.27851721,   72.77900802,   49.89040879,  -44.54897858,
         79.21978519,   32.62953123,   38.31216751,    2.38394416,
        -15.26620962,   30.73844098,  -44.63161898,   51.35784011,
         -5.93946669,   12.89553192,  -46.64760223,  -74.0672976 ,
         23.84864571,  -56.97532354,   59.74096298,  -15.33037577,
         71.34587169,  122.70505143,   54.52943969,  -46.67245065,
        -85.94074303,  -50.59294706,   89.78083657,   63.0674407 ,
          4.97654481,  -52.62811483,   33.32165697,  -70.36214618,
         89.41074733,    5.3149054 ,  -94.58989229,  -21.71911

#### MODEL EVALUATION/SCORING

In [19]:
# Checking the differences between the predicted values (labels) and actual values (labels)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Mean Squared Error: 145.00529307521242


To understand the acceptability or not of this value, compare the MAE with the mean or range of the target variable (y). A good MAE is often a small percentage (e.g., <10%) of the mean or range of the target value.

So, we will calculate the mean of the target variable here

In [27]:
import numpy as np

# Calculate the mean of the target variable
mean_y = np.mean(y)

# seeing the result
print("Mean of the target variable:", mean_y)

Mean of the target variable: 0.8977208639410601


In [29]:
0.1 * mean_y

0.08977208639410601

Now, since 10% of 0.8977208639410601 is 0.089, but we got an MAE of 145, the MAE is too much

In [25]:
# Finding the range of the target variable
range_y = np.max(y) - np.min(y)

# Print the result
print("Range of the target variable:", range_y)

Range of the target variable: 396.2213956626842


In [30]:
0.1 * range_y  

39.62213956626842

Once again, the MAE is greater than 10% of the range of the target value.

Thus, we can conclude that the model's predictions are not accurate and would need retraining