# Traffic Accident logistische regressie uitwerking

**Auteurs:** Koen van der Hoeven  
**Opleiding:** HBO-ICT, Hogeschool Windesheim  
**Vak:** Machine Learning

---
## Onderzoeksvraag

**Kunnen we met Machine Learning voorspellen of een verkeersongeval 'Slight', 'Serious' of 'Fatal' is, op basis van omgevings- en wegkenmerken uit de UK Road Safety dataset (2005-2017)?**

### Deelvragen

1. Welke omgevings- en wegkenmerken zijn beschikbaar in de dataset?
2. Welke features correleren het sterkst met de ernst van een ongeval?
3. Welk ML-algoritme presteert het beste voor deze classificatietaak?

---
## 1. Imports

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score
from sklearn.model_selection import GridSearchCV




---
## 2. Data Laden

We laden de voorbewerkte data die is geexporteerd uit de EDA notebook.

In [None]:
X_train_df = pd.read_csv("train_set.csv")
X_val_df   = pd.read_csv("validation_set.csv")
X_test_df  = pd.read_csv("test_set.csv")

y_train = pd.read_csv("y_train.csv").squeeze()
y_val   = pd.read_csv("y_val.csv").squeeze()
y_test  = pd.read_csv("y_test.csv").squeeze()

---
## 3. Logistische regressie met class weights.

In [None]:
logreg = LogisticRegression(
    solver="lbfgs",
    max_iter=2000,
    class_weight="balanced",       # pakt class imbalance aan zoals vermeld in verslag
    n_jobs=None
)

logreg.fit(X_train_df, y_train)

---
## 4. Evaluatie op validationset.

In [None]:


y_val_pred = logreg.predict(X_val_df)

print("Balanced accuracy:", balanced_accuracy_score(y_val, y_val_pred))
print("\nClassification report:\n")
print(classification_report(y_val, y_val_pred, digits=4))

print("Confusion matrix:\n")
print(confusion_matrix(y_val, y_val_pred))


---
## 5. C tunen doormiddel van GridSearch

In [None]:


param_grid = {"C": [0.01, 0.1, 1, 3, 10]}

grid = GridSearchCV(
    LogisticRegression(
        solver="lbfgs",
        max_iter=2000,
        class_weight="balanced"
    ),
    param_grid=param_grid,
    scoring="f1_macro",   # focust op alle klassen, niet alleen de grootste
    cv=3,
    n_jobs=-1
)

grid.fit(X_train_df, y_train)
print("Best C:", grid.best_params_["C"])
print("Best CV f1_macro:", grid.best_score_)

best_logreg = grid.best_estimator_

---
## 6. Testscore

In [None]:

y_test_pred = best_logreg.predict(X_test_df)

print("Testset Balanced accuracy:", balanced_accuracy_score(y_test, y_test_pred))
print("\nTestset Classification report:\n")
print(classification_report(y_test, y_test_pred, digits=4))

print("Testset Confusion matrix:\n")
print(confusion_matrix(y_test, y_test_pred))
