In [None]:
# SPDX-FileCopyrightText: 2023 Machine-Learning-OER-Collection
# SPDX-License-Identifier: CC-BY-4.0

### Example code for a gradient boosting classification

Welcome back! 

One important ensemble method is boosting. Boosting methods build basic estimators sequentially and try to reduce the bias of the combined estimator. By combining several weak models, the desired result is a powerful ensemble. Unlike bagging, boosting is a sequential process. The training of the weak learners is sequential (as opposed to parallel), and each model attempts to correct its predecessor. 

The Model:

Boosting is an ensemble method that adapts a sequence of weak learners to repeatedly modified versions of the data. The predictions of the weak learners are then combined by a weighted majority vote to produce the final prediction. The gradient boosting algorithm is a particular boosting method. It builds the model additively. At each step, negative gradient values are fitted to the loss function.

<img src="../img/gradient_boosting.svg" alt='Example Gradient Boosting' height="500" width='1000'>

- Weak learner: Also called base classifier, like shallow trees, are combined. By default, sklearn uses 100 so-called decision stumps as weak learners.
- Loss function or cost function: The goal is to minimize the loss function, which is a global measure of the loss or error occurring in any available decisions.


Gradient Boosted Trees often use shallow trees with a depth of one to five, which reduces memory requirements and speeds up prediction. However, they are more sensitive to parameter adjustments than random forests. Nevertheless, they can provide improved accuracy if the parameters are chosen optimally.

In addition to the pre-pruning method and the number of trees in the ensemble, the learning rate is another important parameter in gradient boosting. It determines how much each tree tries to correct the mistakes of the previous trees. A higher learning rate allows the trees to make larger corrections and, thus, more complex models. Adding more trees to the ensemble by increasing n_estimators can also lead to increased model complexity, as the model has more opportunities to correct errors in the training data.


See [Gradient Tree Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting) in the scikit-learn User Guide for more information about the algorithm.


Reference:

Example code for a Gradient Boosting Classification model by julia from the repo [machine-learning-OER-Basics](https://github.com/Machine-Learning-OER-Collection/Machine-Learning-OER-Basics) is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).

We'll start with a basic pipeline which includes the following steps:

Steps:

1. Preparation of the data set:

    - Transformation:
        - Encode categorical values to numerical (OneHotEncoding etc.)

<br>

2. Training of the data:
    - Split the data into training and testing sets
    - Instantiate the Classifier specifying hyperparameters
    - Train the Classifier on the training data

<br>

3. Evaluate the performance of the Classifier
    - Classification report


##### Used libraries in the code:
* [pandas](https://pandas.pydata.org/docs/index.html#) for data analyzing
* [scikit-learn](https://scikit-learn.org/stable/index.html) for machine learning
* [seaborn](https://seaborn.pydata.org/) for statistical data visualization
* [matplotlib](https://matplotlib.org/) for data visualization
* [RandomOverSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) for resampling the data

From sklearn:

`from sklearn.preprocessing import OneHotEncoder`

`from sklearn.model_selection import train_test_split`

`from sklearn.tree import GradientBoostingClassifier`

`from sklearn.metrics import classification_report`

__________________________________________________



Each tree can only provide good predictions on part of the data. So more and more trees are added to iteratively improve performance. <br>
We build a strong learner from a set of weak learners. <br>
In a binary classification task, a weak learner does at least a little better than random guessing, not a lot better. The probability of error of a strong learner is arbitrarily small. Boosting is focused on reducing bias rather than variance.
 
When training the next classifier in the sequence, data points misclassified by one of the base classifiers are given more weight. Once all classifiers are trained, their predictions are combined using weighted majority voting.

In a classification task, the sub-estimator is a regressor, not a classifier. This is because the sub-estimators are being trained for the prediction of (negative) gradients, which are always continuous quantities.

**Read in the data set**

We'll use the cleaned data set from the EDA for this tutorial.

Check for missing values: This downloaded example data set uses question marks ("?") for missing values. The parameter na_values converts the question marks to the format NaN (not a number), which pandas can detect. By using the data set from the EDA, the missing values are already handled.

In [1]:
import pandas as pd

df = pd.read_csv('../../../../kick_after_EDA.csv')

In [2]:
# Display the data frame
pd.set_option('display.max_columns', None) # if not set, displayed columns will be truncated
df.head(2)

Unnamed: 0,IsBadBuy,PurchDate,Auction,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelType,VehOdo,Nationality,Size,TopThreeAmericanName,CurrentAuctionAveragePrice,CurrentAuctionCleanPrice,CurrentRetailAveragePrice,CurrentRetailCleanPrice,BYRNO,VNST,VehBCost,IsOnlineSale,WarrantyCost
0,0,1260144000,ADESA,3,MAZDA,MAZDA3,i,4D SEDAN I,RED,AUTO,Alloy,89046,OTHER ASIAN,MEDIUM,OTHER,7451.0,8552.0,11597.0,12409.0,21973,FL,7100.0,0,1113
1,0,1260144000,ADESA,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,WHITE,AUTO,Alloy,93593,AMERICAN,LARGE TRUCK,CHRYSLER,7456.0,9222.0,11374.0,12791.0,19638,FL,7600.0,0,1053


# Basic Pipeline

### First step - Transformation

#### Encode the categorical values | One-Hot-Encoding

As scikit-learn states in the documentation, the tree-based algorithms currently do not support categorical variables. To use both - numerical and categorical features - for the algorithm, the categorical values must be transformed into numeric values. Depending on the value type, the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn-preprocessing-onehotencoder), [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn-preprocessing-ordinalencoder), or [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) (NOTE: Use LabelEncoder only for the target variable! The target variable for this data set is numerical, so there is no need for encoding.) comes in helpful.

First, we import the OneHotEncoder from sklearn.preprocessing. Next, we determine the categorial features and assign them to a new data frame.

In [3]:
from sklearn.preprocessing import OneHotEncoder

# Define the categorical feature
categorical_features = ['Auction', 'Make', 'Model', 'Trim', 'SubModel', 'Color', 'Transmission',
       'WheelType', 'Nationality', 'Size', 'TopThreeAmericanName', 'VNST']

# Write categorical columns to new data frame
categorical_data = df[categorical_features]

Now, we assign the OneHotEncoder to the variable encoder. The `fit()`method of the encoder is used to train the categorical features. Finally, we use the `transform()` method to transform the categorical features into numerical features.

In [4]:
# Assign OneHotEncoder to variable
encoder = OneHotEncoder()

# Encoder learns the categories
encoder.fit(categorical_data)

# Transform categorical data into encoded array
encoded_data = encoder.transform(categorical_data)

Create a data frame with encoded data

The method `toarray()` converts the encoded data into a numpy array. The array is then converted into a data frame with `pd.DataFrame()`. The column names are extracted from the encoder with `get_feature_names_out()`.

Using the `concat()` method, the encoded data is concatenated with the original data set. The original columns are dropped with the `drop()` method.

In [5]:
# Create df
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(categorical_features))

# Concatenate the encoded data frame with the remaining columns of original data frame
df_encoded = pd.concat([df.drop(categorical_features, axis=1), encoded_df], axis=1)
print(f'Amount of features after using OneHotEncoder: {df_encoded.shape[1]}')

Amount of features after using OneHotEncoder: 2144


The OneHotEncode method converts the categorical values to numerical values. The method creates a new column for each category and assigns a 1 or 0 to the column. The 1 represents the existence of the category, the 0 represents the non-existence. We have 12 categorical features in our data set, all with many different values. Dimensionality reduction is not covered in this notebook for now. You can find more information on [methods here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) or create a notebook and contribute to this repository.

**Split the target variable (y) from the features (X)**

The [Gradient Boosting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn-ensemble-gradientboostingclassifier) takes X and y as input. The target variable (y) is the variable that should be predicted. The features (X) are the variables used to predict the target variable. 

In [6]:
# Split the encoded df into X and y
X = df_encoded.drop('IsBadBuy', axis=1) # Drop the target column from the features
y = df_encoded['IsBadBuy']

### Second step - Training of the data

Split the data into training and testing sets

The [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function splits the data set into training and test sets. The test_size parameter specifies the proportion of the test set. You can also split the data set into 25% test and 75% training data by setting the parameter to 0.25. If you use the default parameters, you're all set for the first run, so don't worry.

The random_state parameter ensures that the split is always the same. You can set it to any integer. If you don't set the parameter, the split will differ each time you run the code. We'll decide to set the parameter to 42.

X_train and y_train are the _training_ sets for the features and the target variable. X_test and y_test are the _test_ sets for the features and the target variable.

In [7]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [8]:
from collections import Counter

print(f'Training set shape {Counter(y_train)}')
print(f'Testing set shape {Counter(y_test)}')

Training set shape Counter({0: 42568, 1: 5982})
Testing set shape Counter({0: 20988, 1: 2926})


**Train the GradientBoostingClassifier**

Let's go line by line through the code:

We'll import the GradientBoostingClassifier from sklearn.ensemble. 

Instantiate the Classifier specifying parameters: We'll use almost all the default parameters now. You can find more information about the parameters in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn-ensemble-gradientboostingclassifier).

Let's dive into the parameters we set:

* The `loss` parameter is set to the default value log_loss. This is called [logistic loss or cross-entropy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html). It quantifies the difference between the predicted probabilities (y_pred) and the true labels (y_true) for a given sample. 

* The `learning_rate` parameter is set to the default of 0.1. The learning rate reduces each tree's contribution by the learning_rate. Between learning_rate and n_estimators is a trade-off. A higher learning rate means that each individual tree can make stronger corrections. That allows for more complex models.

* `n_estimators` is set to the default of 100. This parameter determines the number of boosting stages to perform. Gradient boosting is quite robust to overfitting. Therefore, a higher number typically results in better performance. Values have to be in the range between 1 and an infinity integer.

* For the parameter `subsample`, we use the default of 1.0. This parameter determines the fraction of samples to be used for fitting the individual base learners. If it is less than 1.0, this results in Stochastic Gradient Boosting. This parameter interacts with the n_estimators parameter.  Choosing subsample < 1.0 will reduce the variance and increase the bias. Values must range from 0.0 to 1.0.

* We use the default for `criterion`, which is ’friedman_mse’. This function measures the quality of a split. The criterion used is 'friedman_mse' for the mean squared error with an improvement score by [Friedman](https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full). The default is generally the best, as it may give a better approximation.

* For the `min_samples_split`, we use the default of 2. This determines the minimum number of samples necessary to split an internal node.

* The `min_samples_leaf` parameter, here 1, determines the minimum number of training samples that must be present in each leaf node of the decision tree. If a split point is considered at any depth, it is only allowed if the left and right branches resulting from that split have at least 'min_samples_leaf' training samples each.

* We use the default of 0.0 for the `min_weight_fraction_leaf` parameter of the sum of weights (of all input samples) required to be at a leaf node. The samples have the same weight if sample_weight is not specified.

* The `max_depth` is the depth of each regression estimator, here 3. The maximum depth limits the number of nodes in the tree. This parameter can be adjusted for best performance. The best value depends on the interaction of the input variables. If it is set to None, the nodes will be expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

* The `min_impurity_decrease` parameter is set to the default value of 0.0. A node will be split if the split causes a decrease in impurity greater than or equal to this value.

* The `init` parameter, here None, specifies an estimator used to generate initial predictions during the boosting process. It must implement both the _fit_ and _predict_proba_ methods. By default, the boosting algorithm uses a DummyEstimator to predict class priors, and setting it to 'zero' will initialize raw predictions to zero.

* The `random_state` parameter controls the random seed used for the tree estimator at each boosting iteration and for random feature permutation at each split. Additionally, it influences the random splitting of the training data to create a validation set when 'n_iter_no_change' is specified. We set the parameter to integer 42.

* The `max_features` parameter determines the number of features considered for finding the best split. We use the default value None, using all features (n_features). Using a smaller max_features than n_features value reduces variance and increases bias. The algorithm continues searching for a split until at least one valid partition of the node samples is found, even if it needs to inspect more than max_features features.

* We set `verbose` to 0, so no messages are printed. Setting it to >= 1 will print messages for progress and performance.

* `max_leaf_nodes` is set to the default value of None. This parameter specifies the maximum number of leaf nodes. The default value of None means that the number of leaf nodes is unlimited.

* The parameter `warm_start` is set to False. The parameter controls whether the solution from the previous fit call should be reused, adding more estimators to the existing ensemble.

* The `validation_fraction` parameter, with a default value of 0.1, represents the proportion of training data reserved as the validation set for early stopping. It is only utilized when the 'n_iter_no_change' parameter is set to an integer and should be within the range (0.0, 1.0).

* The `n_iter_no_change` parameter determines whether early stopping is used during training if the validation score does not improve. If set to a number, a portion of the training data is reserved for validation, and training is stopped if the validation score does not improve for the last 'n_iter_no_change' iterations. The data partitioning for validation is done in layers. By default, it is set to None, which disables early stopping.

* The `tol` parameter represents the tolerance for early stopping. If 'n_iter_no_change' is set to a number, training will stop when the loss does not improve by at least 'tol' for 'n_iter_no_change' iterations consecutively. The default value _1e-4_ means that training stops if the loss does not improve at least 0.0001 in the specified number of iterations.

* For `ccp_alpha`, we leave the default value of 0.0. The ccp_alpha parameter, also known as the complexity parameter, regulates the pruning of subtrees in decision trees.


As mentioned, if you use the default parameters, you don't need to specify them in the GradientBoostingClassifier.

In [9]:
# Import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier


# Instantiate GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)

And now, we train the model. The `fit()` method trains the model on the training data X_train and y_train.

In [10]:
# Fit model to training set
model.fit(X_train, y_train)

The method `predict()` predicts the target variable for the test data X_test. The predicted values are assigned to the variable `y_pred`.

In [11]:
# Predict test set labels
y_pred = model.predict(X_test)

### Third step - Evaluation of the performance

Let's compute the `classification_report()` using the true labels (y_test) and the predicted labels (y_pred). 

The precision, recall and f1_score are explained [here](/supervised_learning/classification/k_nearest_neighbors/text/f1_score.md).

In [12]:
# Evaluate the model
from sklearn.metrics import classification_report

# Predictions on the test data by using the trained model
y_pred = model.predict(X_test)

target_names=['Class 0', 'Class 1']
print(classification_report(y_test, y_pred, target_names=target_names, digits=5))

              precision    recall  f1-score   support

     Class 0    0.87834   0.99933   0.93494     20988
     Class 1    0.60000   0.00718   0.01418      2926

    accuracy                        0.87794     23914
   macro avg    0.73917   0.50325   0.47456     23914
weighted avg    0.84429   0.87794   0.82228     23914



This concludes the basic pipeline for a Gradient Boosting Classifier. For Class 0, the F1-score is 0.93, which is high and indicates a good balance between precision and recall. However, for Class 1, the F1-score is only 0.01, showing poor performance due to a recall of 0.01. Let's see if we can improve the performance by balancing the data set.

-----------

#### Imbalanced data set

The manipulation of [imbalanced data](/poc/supervised_learning/decision_tree/explainer/data_sets.md) sets has an impact on the performance of the model. This tutorial uses the [RandomOverSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) method to balance the data set. The method randomly selects samples with replacement to match the number of samples of the  minority class. For different methods, see the [imbalanced-learn](https://imbalanced-learn.org/stable/references/index.html) documentation.

**RandomOverSampler**

The RandomOverSampler technique balances the data by randomly picking samples from the minority class with replacement and adds them to the training data set. The test data set remains unchanged.


Let's apply the oversampling method to our data set. NOTE: The oversampling method should only be applied to the training data set. So we need to split the data set into training and test data set first. We use the same parameter we used in our basic model.

We name the split data set `X_train_ros, X_test_ros, y_train_ros, y_test_ros` for better distinction.

In [13]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train_ros, X_test_ros, y_train_ros, y_test_ros = train_test_split(X, y, test_size=0.33, random_state=42)

Consider the following parameter:
* `sampling_strategy='auto'` specifies that all classes but the minority class are resampled; as we have only two classes here, it is equivalent to 'not minority'. The number of samples in the minority class will be equal to the number of samples in the majority class.
* The parameter `random_state=42` ensures that the results are reproducible.

We import the `RandomOverSampler()` method from the `imblearn.over_sampling` library and assign the method to the variable ros. 

In [14]:
from imblearn.over_sampling import RandomOverSampler
# Instantiate RandomOverSampler
ros = RandomOverSampler(random_state=42, sampling_strategy='auto')

Now we call the method `fit_resample()` on the training data set. The method returns the resampled data set and the resampled labels. We assign the resampled data set to the variable X_train_resampled and the resampled labels to the variable y_train_resampled.

In [15]:
# Fit the model to the training set
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_ros, y_train_ros)

from collections import Counter
print('Original data set shape %s' % Counter(y_train))
print(f'Resampled data set shape {Counter(y_train_resampled)}')

Original data set shape Counter({0: 42568, 1: 5982})
Resampled data set shape Counter({0: 42568, 1: 42568})


Classes 0 and 1 are now equally represented in the training set.

Let's train the model again with the oversampled data set and the same parameter settings. We name it ros_model.

In [16]:
# Import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Instantiate GradientBoostingClassifier
ros_model = GradientBoostingClassifier(random_state=42)

# Fit model to training set
ros_model.fit(X_train_resampled, y_train_resampled)

# Predict test set labels
y_pred_ros = ros_model.predict(X_test_ros)

#### Fourth step - Evaluation of the performance


In [17]:
# Evaluate the model
from sklearn.metrics import classification_report

# Predictions on the test data by using the trained model
y_pred_ros = ros_model.predict(X_test_ros) # create variable predicted labels (y_pred_ros)

target_names=['Class 0', 'Class 1']
print(classification_report(y_test_ros, y_pred_ros, target_names=target_names)) # Taking into account the true labels (y_test_ros) and the predicted labels (y_pred_ros)

              precision    recall  f1-score   support

     Class 0       0.93      0.63      0.75     20988
     Class 1       0.20      0.68      0.31      2926

    accuracy                           0.63     23914
   macro avg       0.57      0.65      0.53     23914
weighted avg       0.84      0.63      0.70     23914



Compared to the basic pipeline, the recall for the minority class has increased from 0.00 to 0.68. That is a good score considering that all we did was balance the data set.

The precision for class 0 is quite good. When it predicts class 0, it is correct 93% of the time. However, the recall (0.63) is relatively low, indicating that it doesn't capture all class 0 instances, missing almost 40% of them.

The precision for class 1 is very low at 0.20, indicating that the model is correct only 20% of the time when predicting class 1. The low precision for Class 1 could be the reason for the low overall F1 score.

Note, that applying a boosting method requires a large data set so each learner as enough samples to learn from.

That concludes this notebook. The methods you can use for improvement are various and cover dimensionality reduction, feature selection, feature engineering, and many more. Again, always remember that a slight variation in the data can lead to a completely different model.

Let's return to our business case. The objective of the task is to predict whether the vehicle purchased at auction is a bad buy.
If we compare the three models, we can see this is a challenging task. Nevertheless, we get results that provide a basis for further methods. Extracting the first results from the available data is possible with only a few lines of code. But once these lines of code are written, the iterative process of improving the model begins. There should always be close collaboration with the business side/stakeholders, as there are changes in the data or the set target. Accordingly, information must be obtained and updated on an ongoing basis. 

The next step would be to look at the generalizability for each model with [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance) and trying to improve the performance of the model by examine the [most important features](https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance) and [parameter](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn-model-selection-gridsearchcv). Gradient Boosting shows good results when the parameters are tuned carefully.