# Gradient Boosting (Regression)
In this Notebook, an attempt is made to implement a Gradient Boosting model for predicting the time of the next event, based on decision trees. It uses the sklearn Gradient Boosting regressor, which worked quite well for other groups who got an RMSE of < 3 hours with it.

In [39]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, RepeatedKFold


In [40]:
# Config variables
training_data_path = '../../datasets/bpi_2012_train_eng.csv'
testing_data_path = '../../datasets/bpi_2012_test_eng.csv'

n_samples = -1

params = {
    "n_estimators": 1000,
    "max_depth": 4,
    "min_samples_split": 5,
    "learning_rate": 0.01,
    "loss": "squared_error",
}

# 1. Loading and preparing the data

In [41]:
# Loading and splitting the datasets
df_train = pd.read_csv(training_data_path)
df_train = df_train.set_index('event_index').drop('Unnamed: 0', axis=1)
df_train = df_train.dropna()

df_test = pd.read_csv(testing_data_path)
df_test = df_test.set_index('event_index').drop('Unnamed: 0', axis=1)
df_test = df_test.dropna()


In [42]:
# Make dummy variables from the event type for modelling purposes
event_dummies_train = pd.get_dummies(df_train['event'])
next_event_dummies_train = pd.get_dummies(df_train['nextEvent'])
event_dummies_train = event_dummies_train.join(next_event_dummies_train, lsuffix="_e", rsuffix=("_ne"))

event_dummies_test = pd.get_dummies(df_test['event'])
next_event_dummies_test = pd.get_dummies(df_test['nextEvent'])
event_dummies_test = event_dummies_test.join(next_event_dummies_test, lsuffix="_e", rsuffix=("_ne"))

# Put the dummy variables back in the dataframe as columns
df_train_dummies = pd.concat([df_train, event_dummies_train], axis=1)
df_test_dummies = pd.concat([df_test, event_dummies_test], axis=1)


# Sub-select rows based on prefixes (WIP, only a test)
df_train_dummies = df_train_dummies[df_train_dummies['nextEvent'].str[0] == 'O']
df_test_dummies = df_test_dummies[df_test_dummies['nextEvent'].str[0] == 'O']


df_train_dummies.drop(['event', 'nextEvent', 'W_Valideren aanvraag_e', 'W_Valideren aanvraag_ne'], axis=1, inplace=True)
df_test_dummies.drop(['event', 'nextEvent', 'W_Valideren aanvraag_e', 'W_Valideren aanvraag_ne'], axis=1, inplace=True)

# Drop the features the model doesn't use
X_train = df_train_dummies.drop(columns=['startTime', 'completeTime', 'REG_DATE', 'case', 'AMOUNT_REQ', 'org:resource', 'nextEventTime', 'nextEventTimeRel'])
X_test = df_test_dummies.drop(columns=['startTime', 'completeTime', 'REG_DATE', 'case', 'AMOUNT_REQ', 'org:resource', 'nextEventTime', 'nextEventTimeRel'])

# Construct the output parameters with from the training and testing sets with the correct rows dropped
Y_train = df_train_dummies['nextEventTimeRel']
Y_test = df_test_dummies['nextEventTimeRel']

X_train.head()


Unnamed: 0_level_0,startTimeRel,indexInCase,dayOfWeek,dayOfMonth,A_ACCEPTED_e,A_ACTIVATED_e,A_APPROVED_e,A_CANCELLED_e,A_DECLINED_e,A_FINALIZED_e,...,O_CREATED_ne,O_DECLINED_ne,O_SELECTED_ne,O_SENT_ne,O_SENT_BACK_ne,W_Afhandelen leads_ne,W_Beoordelen fraude_ne,W_Completeren aanvraag_ne,W_Nabellen incomplete dossiers_ne,W_Nabellen offertes_ne
event_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20727,7330,5,1,18,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
26832,71541,6,0,24,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
38154,184811,13,4,4,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
83625,1096843,11,0,19,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
87976,142592,7,0,12,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


# 2. Training the model and making the predictions

In [43]:
from sklearn.metrics import mean_squared_error
import time
start_time = time.time()

# Creating and fitting the model
xgbr = GradientBoostingRegressor(**params)
xgbr.fit(X_train, Y_train)

# Predicting the values of our test dataset
xgbr_pred = xgbr.predict(X_test)
rmse = np.sqrt(mean_squared_error(Y_test, xgbr_pred))

# Retrieving the accuracy of the model
print(f'RMSE score: {rmse}')

# Ending time
end_time = time.time()
print(f'\r\nThe execution of Gradient Boosting (Regression) took {round(end_time - start_time)} seconds')

RMSE score: 440346.56170794996

The execution of Gradient Boosting (Regression) took 43 seconds


# 3. Applying to Housing dataset

Since the resulting RMSE score of applying Gradient Boosting to the BPI dataset is still quite bad, it's important that we check whether the model has been implemented incorrectly or whether the issue is in the feature selection/engineering. To that end, the next blocks apply the same Gradient Boosting regressor model to another dataset, the Boston housing dataset, to check the performance.

In [44]:
# Load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df_housing = pd.read_csv(url, header=None)

# Summarize shape
display(df_housing.shape)

# Summarize first few lines
display(df_housing.head())

(506, 14)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [45]:
# Split data into input and output columns
from sklearn.model_selection import train_test_split
X, y = df_housing.iloc[:, :-1], df_housing.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13)

In [46]:
from sklearn.metrics import mean_squared_error
import time
start_time = time.time()

# Creating and fitting the model
xgbr_housing = GradientBoostingRegressor(**params)
xgbr_housing.fit(X_train, y_train)

# Predicting the values of our test dataset
xgbr_housing_pred = xgbr_housing.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, xgbr_housing_pred))

# Retrieving the accuracy of the model
print(f'RMSE score: {rmse}')

# Ending time
end_time = time.time()
print(f'\r\nThe execution of Gradient Boosting (Regression) for Housing took {round(end_time - start_time)} seconds')

RMSE score: 3.3251748416794116

The execution of Gradient Boosting (Regression) for Housing took 3 seconds


The resulting RMSE score for the Housing dataset is rather decent. The RMSE for the naive baseline for this set comes in at around 6.6, and other sources on the internet ([Kaggle](https://www.kaggle.com/code/tolgahancepel/boston-housing-regression-analysis/notebook)) also show that it ranks close to a lot of other machine learning models like Random Forest, Linear Regression and Support Vector Regression.

We can thus conclude that the issue lies not with the model, but with the feature engineering on the BPI dataset.