# Decision Tree Regressor

In dit notebook zal een Decision Tree Regressor model gebruikt worden om de duur van een storing te voorspellen. Dit wordt gedaan met feature variabelen die gevonden en geprepareerd zijn in "DataPrep.ipynb"

In [None]:
# importeren gebruikte libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from baseline import calculate_baseline
import matplotlib.pyplot as plt
import pandas as pd
import pickle
from math import sqrt
from tqdm import tqdm
from sklearn.tree import plot_tree
import numpy as np
from scipy.stats import norm
import seaborn as sns

In [None]:
# inladen data (al geprepareerd in ander bestand)
model_df = pd.read_pickle('data/model_df.pkl')
model_df.sample(5)

Er wordt een Decision Tree Regressor getraint met de features eerder geprepareerd. Eerst wordt er een test-train split gemaakt om het model mee te trainen en mee te testen. Daarna worden er modellen getraint met verschillende max_depths. Dit wordt geplot en in deze plots is te zien wat een goede depth is voor het model.

In [None]:
X = model_df.drop('anm_tot_fh', axis=1)
y = model_df['anm_tot_fh']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
depths = range(1, 16) 

rmse = []
r2 = []

# Train DTR model met verschillende max_depths
for depth in tqdm(depths):
    regressor = DecisionTreeRegressor(max_depth=depth, min_samples_leaf=0.05, criterion='squared_error', random_state=42)
    regressor.fit(X_train, y_train)
    predictions = regressor.predict(X_test)
    rmse.append(sqrt(mean_squared_error(y_test, predictions)))
    r2.append(r2_score(y_test, predictions))

Hieronder wordt gekeken wat voor soort diepte goed is voor dit model. We pakken hier een max_depth van 10, omdat de grafiek hier afvlakt. Met een hogere max_depth zal het model snel overfit raken.

In [None]:
# Two plots side by side, first one showing RMSE and second one showing R2 score
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

ax1.plot(depths, rmse, marker='o', linestyle='-', color='b')
ax1.set_title('Depth vs. RMSE for Decision Tree Regressor')
ax1.set_xlabel('Max Depth')
ax1.set_ylabel('RMSE')
ax1.set_xticks(depths)
ax1.grid(True)

ax2.plot(depths, r2, marker='o', linestyle='-', color='b')
ax2.set_title('Depth vs. R2 for Decision Tree Regressor')
ax2.set_xlabel('Max Depth')
ax2.set_ylabel('R2')
ax2.set_xticks(depths)
ax2.grid(True)

plt.show()

Train het model met de gevonden max_depth. Het model wordt gepickled en opgeslagen zodat deze gerbuikt kan worden in de GUI. Daarna wordt het model vergeleken met de baseline.

In [None]:
max_depth = 4
regressor = DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=0.05, criterion='squared_error', random_state=42)

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

rmse = sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

y_pred = regressor.predict(X_test)

# pickle a regressor model and create new file if it doesn't exist
with open('models/DecisionTreeRegressor.pkl', 'wb') as file:
    pickle.dump(regressor, file)

print("Root Mean Squared Error: ", rmse)
print("R-squared (R2) Score: ", r2)

baseline_rmse, baseline_r2 = calculate_baseline(model_df)
print('Baseline RMSE: ', baseline_rmse)
print('Baseline R2: ', baseline_r2)

## Conclusie

Het model is met een minimaal verschil beter dan de baseline (de RMSE is een heel klein beetje lager en de R2 score is een heel klein beetje hoger).

# Decision Tree Regressor Probability Notes

bereken probability van een DTR met mean squared error (als standaard deviatie)

In [None]:
# Calculate rmse for every leaf in the tree
leaf_nodes = [i for i in range(regressor.tree_.node_count) if regressor.tree_.children_left[i] == regressor.tree_.children_right[i]]

rmse_per_leaf = {}

for idx in leaf_nodes:
    samples_in_node = regressor.tree_.n_node_samples[idx]
    if samples_in_node > 0:
        node_rmse = sqrt(regressor.tree_.impurity[idx] * samples_in_node / (samples_in_node + 1)) 
        rmse_per_leaf[idx] = node_rmse
        print("Leaf Node {} has RMSE {}".format(idx, node_rmse))
        
# Get the models prediction per leaf
pred_per_leaf = {idx: regressor.tree_.value[idx][0][0] for idx, _ in rmse_per_leaf.items()}

In [None]:
# plt.figure(figsize=(20, 12)) 
# plot_tree(regressor, filled=True, proportion=True, impurity=False, precision=2, feature_names=X.columns, node_ids=True)

In [None]:
leaf_indices = regressor.apply(X)

samples_in_leaves = {}

# Iterate over unique leaf indices
unique_leaf_indices = np.unique(leaf_indices)
for leaf_index in unique_leaf_indices:
    # Select the target values (y) that belong to the current leaf
    samples_in_leaves[leaf_index] = y[leaf_indices == leaf_index].tolist()


In [None]:
leaf = 4
durations = np.array(samples_in_leaves[leaf])
mean_prediction = pred_per_leaf[leaf]

percentile_95 = np.percentile(durations, 95)

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
_, bins, _ = ax.hist(durations, bins=30, density=False, alpha=0.6, color='b', label='Historische storingsduur Data')
ax.axvline(mean_prediction, color='r', linestyle='--', label=f'Mean Prediction = {mean_prediction:.0f}')
ax.axvline(percentile_95, color='g', linestyle='--', label='95% Mark', linewidth=2)
ax.set_xlabel('Duur storing (minuten)')
ax.set_ylabel('Frequentie')
ax.set_title('Storingsduur histogram')

values_below_mean_prediction = durations[durations < mean_prediction]
values_above_mean_prediction = durations[durations >= mean_prediction]
percentage_below_mean = len(values_below_mean_prediction) / len(durations) * 100
percentage_above_mean = len(values_above_mean_prediction) / len(durations) * 100

right_side_color = 'orange'
n, bins, patches = ax.hist(values_above_mean_prediction, bins=bins, alpha=0.6, color=right_side_color, label=f'Values Above Mean Prediction: {percentage_above_mean:.2f}%')

labels = [
    f'Waardes onder voorspelling: {percentage_below_mean:.2f}%', 
    f'Voorspelling: {mean_prediction:.2f} min',
    f'95% van de data: {percentile_95:.2f} min', 
    f'Waardes boven voorspelling: {percentage_above_mean:.2f}%'
    ]
ax.legend(labels=labels)

plt.show()


In [None]:
from app.PlotPrediction import plot_prediction

fig, ax = plt.subplots(figsize=(12, 8))

plot_prediction(regressor, X_test.iloc[0:0+1], ax, X, y)

plt.show()