# Accident Severity Project 

This merged notebook combines the **clean, production-ready pipeline** (no leakage, ColumnTransformer pipeline, sensible defaults) with the **exploratory experiments** from original notebook (KNN, Decision Tree, bagging/boosting experiments, elbow plots, extra visualizations).  

**How it's organized**:
1. Data load & inspection
2. Cleaning: sentinel handling, leakage decision, identifiers removal
3. Feature engineering & cardinality reduction
4. Preprocessing Pipeline (ColumnTransformer)
5. Main Models: Random Forest & Logistic Regression (GridSearchCV)
6. Experiments: KNN (elbow + bagging/boosting), DecisionTree with corrected GridSearch
7. Evaluation, permutation importance, saving pipeline

Run the notebook sequentially. If runtime is long, reduce CV folds or grid sizes in the 'Model grids' section.

In [23]:
# Imports & global settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from pathlib import Path
sns.set(style='darkgrid')

RANDOM_STATE = 42
DATA_PATH = '../data/merged_data.csv' # Path to the merged dataset


In [24]:
# Load merged data
df = pd.read_csv(DATA_PATH)
print('Loaded dataframe shape:', df.shape)
display(df.head())


Loaded dataframe shape: (46707, 43)


  df = pd.read_csv(DATA_PATH)


Unnamed: 0,status,collision_index,collision_year,collision_reference,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,legacy_collision_severity,...,urban_or_rural_area,did_police_officer_attend_scene_of_collision,trunk_road_flag,lsoa_of_collision_location,enhanced_severity_collision,vehicle_count,avg_vehicle_age,heavy_vehicle_count,casualty_count,avg_casualty_age
0,Unvalidated,2024010486807,2024,10486807,527188.0,184782.0,,,1,3,...,-1,3,-1,-1,-1,2.0,-1.0,0.0,0.0,
1,Unvalidated,2024010486821,2024,10486821,528936.0,194721.0,,,1,3,...,-1,3,-1,-1,-1,3.0,-1.0,0.0,0.0,
2,Unvalidated,2024010486824,2024,10486824,552699.0,185940.0,,,1,3,...,-1,1,-1,-1,-1,2.0,-1.0,0.0,0.0,
3,Unvalidated,2024010486825,2024,10486825,545623.0,177185.0,,,1,3,...,-1,1,-1,-1,-1,2.0,-1.0,0.0,0.0,
4,Unvalidated,2024010486828,2024,10486828,536554.0,178468.0,,,1,3,...,-1,1,-1,-1,-1,1.0,-1.0,0.0,0.0,


In [25]:
# Quick inspection
display(df.info())
display(df.isnull().sum().sort_values(ascending=False).head(40))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46707 entries, 0 to 46706
Data columns (total 43 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   status                                        46707 non-null  object 
 1   collision_index                               46707 non-null  object 
 2   collision_year                                46707 non-null  int64  
 3   collision_reference                           46707 non-null  object 
 4   location_easting_osgr                         46623 non-null  float64
 5   location_northing_osgr                        46623 non-null  float64
 6   longitude                                     0 non-null      float64
 7   latitude                                      0 non-null      float64
 8   police_force                                  46707 non-null  int64  
 9   legacy_collision_severity                     46707 non-null 

None

longitude                                       46707
latitude                                        46707
avg_casualty_age                                16384
avg_vehicle_age                                  1868
location_northing_osgr                             84
location_easting_osgr                              84
collision_year                                      0
collision_reference                                 0
police_force                                        0
status                                              0
collision_index                                     0
number_of_vehicles                                  0
legacy_collision_severity                           0
number_of_casualties                                0
date                                                0
local_authority_district                            0
local_authority_ons_district                        0
day_of_week                                         0
time                        

In [26]:
# Convert -1 sentinel to NaN for numeric columns where appropriate
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cols_replaced = []
for c in num_cols:
    if (df[c] == -1).sum() > 0 and df[c].min() == -1:
        cnt = int((df[c] == -1).sum())
        df[c] = df[c].replace(-1, np.nan)
        cols_replaced.append((c, cnt))
print('Columns where -1 replaced with NaN (col, count):', cols_replaced)


Columns where -1 replaced with NaN (col, count): [('local_authority_district', 46707), ('first_road_number', 1), ('speed_limit', 35), ('junction_detail', 1331), ('junction_control', 19539), ('second_road_class', 501), ('second_road_number', 19143), ('pedestrian_crossing_human_control', 8950), ('pedestrian_crossing_physical_facilities', 8952), ('road_surface_conditions', 370), ('special_conditions_at_site', 8971), ('carriageway_hazards', 8969), ('urban_or_rural_area', 46707), ('trunk_road_flag', 46707), ('lsoa_of_collision_location', 46707), ('enhanced_severity_collision', 18964), ('avg_vehicle_age', 44839), ('avg_casualty_age', 383)]


In [27]:
# Decide on leakage: By default we drop casualty-derived features for a pre-event model.
leakage_keywords = ['casualty', 'avg_casualty', 'casualty_count']
leak_cols = [c for c in df.columns if any(k in c.lower() for k in leakage_keywords)]
print('Detected possible casualty-derived columns:', leak_cols)
# Drop them by default -- if you want retrospective model, comment out the next line
df = df.drop(columns=[c for c in leak_cols if c in df.columns], errors='ignore')
print('Shape after dropping casualty-derived columns:', df.shape)


Detected possible casualty-derived columns: ['casualty_count', 'avg_casualty_age']
Shape after dropping casualty-derived columns: (46707, 41)


In [28]:
# Drop identifiers and fully-empty geolocation columns
ids_to_drop = ['collision_index','collision_reference','status']
for c in ids_to_drop:
    if c in df.columns:
        df = df.drop(columns=[c])
# Drop lat/lon if empty or near-empty
if 'latitude' in df.columns and df['latitude'].isna().all():
    df = df.drop(columns=['latitude'], errors='ignore')
if 'longitude' in df.columns and df['longitude'].isna().all():
    df = df.drop(columns=['longitude'], errors='ignore')
print('Shape after dropping ids/empty geos:', df.shape)


Shape after dropping ids/empty geos: (46707, 36)


In [29]:
# Feature engineering: datetime features
if 'date' in df.columns and 'time' in df.columns:
    df['datetime'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'].astype(str), errors='coerce')
    df['hour_of_day'] = df['datetime'].dt.hour
    df['day_of_week'] = df['datetime'].dt.day_name()
    df['month'] = df['datetime'].dt.month
    df = df.drop(columns=['date','time','datetime'])
    print('Created temporal features: hour_of_day, day_of_week, month')

# Reduce cardinality for very high-cardinality object columns
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
high_card = [c for c in cat_cols if df[c].nunique() > 50]
TOP_N = 50
for c in high_card:
    top_vals = df[c].value_counts().nlargest(TOP_N).index
    df[c] = df[c].where(df[c].isin(top_vals), other='OTHER')
print('Reduced high-cardinality columns:', high_card)


Created temporal features: hour_of_day, day_of_week, month
Reduced high-cardinality columns: ['local_authority_ons_district', 'local_authority_highway']


In [30]:
# Prepare target & features
TARGET = 'legacy_collision_severity'
assert TARGET in df.columns, f"Target column {TARGET} not found."

df = df[~df[TARGET].isna()].copy()
X = df.drop(columns=[TARGET])
y = df[TARGET].astype('int')

print('Target distribution:')
display(y.value_counts(normalize=True))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)
print('Train/test shapes:', X_train.shape, X_test.shape)


Target distribution:


legacy_collision_severity
3    0.751707
2    0.233627
1    0.014666
Name: proportion, dtype: float64

Train/test shapes: (37365, 35) (9342, 35)


In [31]:
# Preprocessing pipeline with ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object','category']).columns.tolist()

print('Numeric cols:', len(numeric_cols), 'Categorical cols:', len(categorical_cols))

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numeric_cols),
    ('cat', cat_pipeline, categorical_cols)
], n_jobs=-1)


Numeric cols: 32 Categorical cols: 3


In [32]:
# Main models: RandomForest and LogisticRegression (GridSearchCV)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

rf_pipe = Pipeline([('preproc', preprocessor),
                    ('clf', RandomForestClassifier(random_state=RANDOM_STATE, class_weight='balanced', n_jobs=-1))])

rf_param_grid = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [10, None],
    'clf__max_features': ['sqrt']
}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
rf_gs = GridSearchCV(rf_pipe, rf_param_grid, cv=cv, scoring='f1_macro', n_jobs=-1, verbose=1)
print('Fitting RandomForest GridSearch (this may take a while)...')
rf_gs.fit(X_train, y_train)
print('Best RF params:', rf_gs.best_params_)
print('Best RF CV score:', rf_gs.best_score_)

lr_pipe = Pipeline([('preproc', preprocessor),
                    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE))])
lr_param_grid = {'clf__C': [0.1, 1.0, 10.0]}
lr_gs = GridSearchCV(lr_pipe, lr_param_grid, cv=cv, scoring='f1_macro', n_jobs=-1, verbose=1)
print('Fitting Logistic Regression GridSearch...')
lr_gs.fit(X_train, y_train)
print('Best LR params:', lr_gs.best_params_)
print('Best LR CV score:', lr_gs.best_score_)


Fitting RandomForest GridSearch (this may take a while)...
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best RF params: {'clf__max_depth': None, 'clf__max_features': 'sqrt', 'clf__n_estimators': 100}
Best RF CV score: 0.8624100020973753
Fitting Logistic Regression GridSearch...
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best LR params: {'clf__C': 0.1}
Best LR CV score: 0.6637070179022316


## Experiments: KNN, K-Optimal (Elbow), Bagging/Boosting, Decision Tree GridSearch

In [39]:
# KNN Experiment: Elbow method and optional Bagging/Boosting wrappers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_score

# Elbow method (error rate) - small sample for speed
sample = X_train.sample(n=min(2000, X_train.shape[0]), random_state=RANDOM_STATE)
y_sample = y_train.loc[sample.index]
# Minimal preprocessing: use preprocessor to transform sample once
X_sample_trans = preprocessor.fit_transform(sample)
error_rates = []
K_vals = range(1, 16)
for k in K_vals:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_sample_trans, y_sample, cv=3, scoring='accuracy', n_jobs=-1)
    error_rates.append(1 - scores.mean())
print('Error rates (1-accuracy) for K=1..15:', error_rates)

# Choose a small grid for KNN with bagging/boosting (optional heavy)
knn_pipe = Pipeline([('preproc', preprocessor), ('clf', KNeighborsClassifier())])
knn_param_grid = {'clf__n_neighbors': [3,5,7]}
knn_gs = GridSearchCV(knn_pipe, knn_param_grid, cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
knn_gs.fit(X_train, y_train)
print('Best KNN params:', knn_gs.best_params_, 'Best CV score:', knn_gs.best_score_)

# Bagging KNN
bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=knn_gs.best_params_['clf__n_neighbors']), n_estimators=10, random_state=RANDOM_STATE)
bag_pipe = Pipeline([('preproc', preprocessor), ('clf', bag_knn)])
bag_pipe.fit(X_train, y_train)
print('Trained Bagging KNN (10 estimators)')
print('CV score:', cross_val_score(bag_pipe, X_train, y_train, cv=3, scoring='f1_macro', n_jobs=-1).mean())



Error rates (1-accuracy) for K=1..15: [np.float64(0.16600483542012778), np.float64(0.20200335267801528), np.float64(0.13449731590661118), np.float64(0.13899581740661204), np.float64(0.13299806553179871), np.float64(0.12399430915173049), np.float64(0.12399806102954525), np.float64(0.114998056527292), np.float64(0.12400106253179721), np.float64(0.11749905827866858), np.float64(0.12400106253179721), np.float64(0.12300006153079612), np.float64(0.12600081340711033), np.float64(0.12250106178142162), np.float64(0.13100056578317443)]
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best KNN params: {'clf__n_neighbors': 3} Best CV score: 0.5962162564442296
Trained Bagging KNN (10 estimators)
CV score: 0.5889541911374631


In [34]:
# Decision Tree with corrected GridSearch (valid parameters)
from sklearn.tree import DecisionTreeClassifier
dt_pipe = Pipeline([('preproc', preprocessor), ('clf', DecisionTreeClassifier(random_state=RANDOM_STATE))])

dt_param_grid = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [3, 5, 10, None],
    'clf__min_samples_leaf': [1, 2, 5],
    'clf__min_samples_split': [2, 5, 10]
}
dt_gs = GridSearchCV(dt_pipe, dt_param_grid, cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
dt_gs.fit(X_train, y_train)
print('Best DT params:', dt_gs.best_params_, 'Best score:', dt_gs.best_score_)


Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best DT params: {'clf__criterion': 'gini', 'clf__max_depth': 3, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2} Best score: 0.8688383672584731


In [35]:
# Final evaluation: RF, LR, KNN (from grids) and Decision Tree
from sklearn.metrics import classification_report, confusion_matrix, f1_score

models = {
    'RandomForest': rf_gs.best_estimator_,
    'LogisticRegression': lr_gs.best_estimator_,
    'KNN': knn_gs.best_estimator_,
    'DecisionTree': dt_gs.best_estimator_,
    'BaggingKNN': bag_pipe
}

for name, model in models.items():
    print('---', name, '---')
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred, digits=4))
    print('Confusion matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('Macro F1:', f1_score(y_test, y_pred, average='macro'))
    print('\n')


--- RandomForest ---
              precision    recall  f1-score   support

           1     1.0000    0.6423    0.7822       137
           2     0.9750    0.7142    0.8244      2183
           3     0.9122    0.9944    0.9516      7022

    accuracy                         0.9238      9342
   macro avg     0.9624    0.7836    0.8527      9342
weighted avg     0.9282    0.9238    0.9194      9342

Confusion matrix:
[[  88    1   48]
 [   0 1559  624]
 [   0   39 6983]]
Macro F1: 0.8527368658687259


--- LogisticRegression ---
              precision    recall  f1-score   support

           1     0.1702    0.8102    0.2814       137
           2     0.7949    0.7778    0.7863      2183
           3     0.9399    0.8772    0.9075      7022

    accuracy                         0.8530      9342
   macro avg     0.6350    0.8218    0.6584      9342
weighted avg     0.8947    0.8530    0.8700      9342

Confusion matrix:
[[ 111    4   22]
 [ 113 1698  372]
 [ 428  434 6160]]
Macro F1: 0.6

In [36]:
# Choose the best model by macro F1 on test set and save pipeline
from sklearn.metrics import f1_score
best_name, best_score, best_model = None, -1, None
for name, model in models.items():
    score = f1_score(y_test, model.predict(X_test), average='macro')
    if score > best_score:
        best_score = score
        best_name = name
        best_model = model
print('Best on test set:', best_name, 'with macro F1:', best_score)
joblib.dump(best_model, '../model/accident_severity_pipeline_merged.joblib')
print('Saved merged best pipeline to model/accident_severity_pipeline_merged.joblib')


Best on test set: DecisionTree with macro F1: 0.8721791527427373
Saved merged best pipeline to model/accident_severity_pipeline_merged.joblib


In [37]:
# Permutation importance (sampled) for the chosen best_model
from sklearn.inspection import permutation_importance
sample_idx = np.random.choice(range(X_test.shape[0]), size=min(2000, X_test.shape[0]), replace=False)
X_test_sample = X_test.iloc[sample_idx]
y_test_sample = y_test.iloc[sample_idx]
r = permutation_importance(best_model, X_test_sample, y_test_sample, n_repeats=10, random_state=RANDOM_STATE, n_jobs=-1)
# Attempt to retrieve feature names
try:
    num_feats = numeric_cols
    ohe = best_model.named_steps['preproc'].named_transformers_['cat'].named_steps['ohe']
    ohe_names = ohe.get_feature_names_out(categorical_cols).tolist() if hasattr(ohe, 'get_feature_names_out') else []
    feature_names = num_feats + ohe_names
    imp = pd.Series(r.importances_mean, index=feature_names).sort_values(ascending=False)
    display(imp.head(30))
except Exception as e:
    print('Could not map all feature names:', e)
    print('Raw importances (first 40):', r.importances_mean[:40])


Could not map all feature names: Length of values (35) does not match length of index (141)
Raw importances (first 40): [0.      0.      0.      0.      0.      0.      0.      0.      0.
 0.      0.      0.      0.      0.      0.      0.      0.      0.
 0.      0.      0.      0.      0.      0.      0.      0.      0.
 0.      0.      0.26935 0.      0.      0.      0.      0.     ]
