# Traffic Accident logistische regressie uitwerking

**Auteurs:** Lars Tovar  
**Opleiding:** HBO-ICT, Hogeschool Windesheim  
**Vak:** Machine Learning

---
## Onderzoeksvraag

> **Kunnen we met Machine Learning voorspellen of een verkeersongeval 'Slight', 'Serious' of 'Fatal' is, op basis van omgevings- en wegkenmerken uit de UK Road Safety dataset (2005-2017)?**

### Deelvragen

1. Welke omgevings- en wegkenmerken zijn beschikbaar in de dataset?
2. Welke features correleren het sterkst met de ernst van een ongeval?
3. Welk ML-algoritme presteert het beste voor deze classificatietaak?

---
## 1. Imports

In [9]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

---
## 2. Data Laden

In [10]:
X_train_df = pd.read_csv("train_set.csv")
X_val_df   = pd.read_csv("validation_set.csv")
X_test_df  = pd.read_csv("test_set.csv")

y_train = pd.read_csv("y_train.csv").squeeze()  #variables inlezen en omzetten voor sklearn compatibility
y_val   = pd.read_csv("y_val.csv").squeeze()
y_test  = pd.read_csv("y_test.csv").squeeze()

---
## 3. Random forest  

Met class imbalance meegerekent.

In [11]:
rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    n_jobs=-1,
    class_weight="balanced",  # class imbalance verbeteren
    max_features="sqrt"
)

rf.fit(X_train_df, y_train)


0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",400
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


---
## 4. Evaluatie of validation

In [12]:
y_val_pred = rf.predict(X_val_df) #voorspellingen op de validation set

print("VAL Balanced accuracy:", balanced_accuracy_score(y_val, y_val_pred))
print("\nVAL Classification report:\n")
print(classification_report(y_val, y_val_pred, digits=4))
print("\nVAL Confusion matrix:\n")
print(confusion_matrix(y_val, y_val_pred)) #confusionmatrix om te zien welke klasses vaak confued worden

VAL Balanced accuracy: 0.33819223729297904

VAL Classification report:

              precision    recall  f1-score   support

           0     0.8639    0.9944    0.9246     17268
           1     0.1681    0.0080    0.0154      2485
           2     0.6000    0.0121    0.0238       247

    accuracy                         0.8597     20000
   macro avg     0.5440    0.3382    0.3212     20000
weighted avg     0.7742    0.8597    0.8005     20000


VAL Confusion matrix:

[[17171    96     1]
 [ 2464    20     1]
 [  241     3     3]]


---
## 5. Tuning

In [13]:
param_dist = {
    "n_estimators": [200], #parameters voor de randomsearch grid
    "max_depth": [10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "max_features": ["sqrt", "log2", 0.5],
    "bootstrap": [True],
}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

rf_base = RandomForestClassifier(
    random_state=42,
    n_jobs=-1,              #ik gebruik hier alle cores ivm grootte dataset
    class_weight="balanced" #rekening houden met onevenwichtige klasses
)

search = RandomizedSearchCV( 
    estimator=rf_base,
    param_distributions=param_dist,
    n_iter=25,              # aantal keren om te testen
    scoring="f1_macro",     # eerlijk voor alle klassen houden
    cv=cv,
    n_jobs=-1,              #gebruiken alle cores bij de search ook om volledige search in redelijke tijd te doen
    random_state=42,
    verbose=1
)
    
search.fit(X_train_df, y_train)

print("Best params:", search.best_params_) 
print("Best CV f1_macro:", search.best_score_) # beste parameters en scoren output
 
best_rf = search.best_estimator_ # beste model opslaan


Fitting 3 folds for each of 25 candidates, totalling 75 fits




Best params: {'n_estimators': 200, 'min_samples_split': 10, 'max_features': 'log2', 'max_depth': 20, 'bootstrap': True}
Best CV f1_macro: 0.37039575361811167


---
## 5. Evaluation model after tuning

In [14]:
y_val_pred = best_rf.predict(X_val_df)

print("VAL Balanced accuracy:", balanced_accuracy_score(y_val, y_val_pred))
print("\nVAL Classification report:\n")
print(classification_report(y_val, y_val_pred, digits=4)    )
print("\nVAL Confusion matrix:\n")
print(confusion_matrix(y_val, y_val_pred))


VAL Balanced accuracy: 0.39531461469689716

VAL Classification report:

              precision    recall  f1-score   support

           0     0.8804    0.8507    0.8653     17268
           1     0.1826    0.1976    0.1898      2485
           2     0.0544    0.1377    0.0780       247

    accuracy                         0.7608     20000
   macro avg     0.3725    0.3953    0.3777     20000
weighted avg     0.7835    0.7608    0.7716     20000


VAL Confusion matrix:

[[14690  2138   440]
 [ 1843   491   151]
 [  153    60    34]]


---
## 6. Last evaluation

In [15]:
y_test_pred = best_rf.predict(X_test_df)

print("TEST Balanced accuracy:", balanced_accuracy_score(y_test, y_test_pred))
print("\nTEST Classification report:\n")
print(classification_report(y_test, y_test_pred, digits=4))
print("\nTEST Confusion matrix:\n")
print(confusion_matrix(y_test, y_test_pred))


TEST Balanced accuracy: 0.39528443735208735

TEST Classification report:

              precision    recall  f1-score   support

           0     0.8820    0.8514    0.8664     17269
           1     0.1908    0.2089    0.1995      2484
           2     0.0508    0.1255    0.0723       247

    accuracy                         0.7627     20000
   macro avg     0.3745    0.3953    0.3794     20000
weighted avg     0.7859    0.7627    0.7738     20000


TEST Confusion matrix:

[[14703  2138   428]
 [ 1814   519   151]
 [  153    63    31]]


---
## 7. Feature importance  

Welke features waren het belangrijkst?

In [16]:
importances = pd.Series(best_rf.feature_importances_, index=X_train_df.columns)
top20 = importances.sort_values(ascending=False).head(20)
print(top20)

Longitude                                        0.151041
Latitude                                         0.149439
Hour                                             0.122248
DayOfWeek_fromDate                               0.080406
Speed_limit                                      0.062303
Urban_or_Rural_Area_Urban                        0.032075
Pedestrian_Crossing-Physical_Facilities          0.027160
Light_Conditions_Daylight                        0.022402
1st_Road_Class_Unclassified                      0.019886
Road_Type_Single carriageway                     0.019562
Road_Surface_Conditions_Dry                      0.017204
Road_Surface_Conditions_Wet or damp              0.016917
1st_Road_Class_B                                 0.016246
Light_Conditions_Darkness - lights lit           0.015657
Weather_Conditions_Fine no high winds            0.015403
2nd_Road_Class_Unclassified                      0.013817
1st_Road_Class_C                                 0.013722
Junction_Detai