# UK Road Safety - Supervised Learning Project
This notebook predicts accident severity (Slight, Serious, Fatal) using UK Road Safety Data (STATS19) from data.gov.uk.  
We load, merge, and preprocess data from Accidents, Casualties, and Vehicles datasets, then train a Random Forest model for multi-class classification.

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import requests
import zipfile
import io

## Load Data
Load the last 5 years of Accidents, Casualties, and Vehicles data from CSV URLs, merge on 'collision_index', and free memory.

In [11]:
# Accidents
accidents_url = "https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv"
accidents_df = pd.read_csv(io.StringIO(requests.get(accidents_url).text), low_memory=False)

# Casualties
casualties_url = "https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-last-5-years.csv"
casualties_df = pd.read_csv(io.StringIO(requests.get(casualties_url).text), low_memory=False)

# Vehicles
vehicles_url = "https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-vehicle-last-5-years.csv"
vehicles_df = pd.read_csv(io.StringIO(requests.get(vehicles_url).text), low_memory=False)

# Merge Datasets on 'collision_index'
df = accidents_df.merge(casualties_df, on='collision_index', how='left', suffixes=('', '_cas'))
df = df.merge(vehicles_df, on='collision_index', how='left', suffixes=('', '_veh'))

# Free memory
del accidents_df, casualties_df, vehicles_df
import gc
gc.collect()

2276

## Explore Data
Display the first few rows, info, and columns to understand the merged dataset.

In [12]:
print(df.head())
print(df.info())
print(df.columns)

  collision_index  collision_year collision_ref_no  location_easting_osgr  \
0   2021170H10421            2021        170H10421               447098.0   
1   2021170H10421            2021        170H10421               447098.0   
2   2021170H10421            2021        170H10421               447098.0   
3   2021170H10421            2021        170H10421               447098.0   
4   2021170H11231            2021        170H11231               450486.0   

   location_northing_osgr  longitude   latitude  police_force  \
0                532997.0  -1.270905  54.689833            17   
1                532997.0  -1.270905  54.689833            17   
2                532997.0  -1.270905  54.689833            17   
3                532997.0  -1.270905  54.689833            17   
4                533118.0  -1.218333  54.690592            17   

   collision_severity  number_of_vehicles  ...  age_of_driver  \
0                   3                   2  ...             -1   
1               

## Verify Features and Target
Check if selected features and target exist, inspect missing values, data types, and target distribution.

In [13]:
features = ['time', 'weather_conditions', 'road_surface_conditions', 'speed_limit', 'vehicle_type', 'age_of_casualty', 'sex_of_casualty', 'vehicle_manoeuvre']
target = 'collision_severity'

missing_features = [f for f in features + [target] if f not in df.columns]
if missing_features:
    print("Missing columns:", missing_features)
else:
    print("All features and target are present.")

print("Missing values in features and target:")
print(df[features + [target]].isnull().sum())

print("Data types:")
print(df[features + [target]].dtypes)

print("Target distribution:")
print(df[target].value_counts())

All features and target are present.
Missing values in features and target:
time                       0
weather_conditions         0
road_surface_conditions    0
speed_limit                0
vehicle_type               0
age_of_casualty            0
sex_of_casualty            0
vehicle_manoeuvre          0
collision_severity         0
dtype: int64
Data types:
time                       object
weather_conditions          int64
road_surface_conditions     int64
speed_limit                 int64
vehicle_type                int64
age_of_casualty             int64
sex_of_casualty             int64
vehicle_manoeuvre           int64
collision_severity          int64
dtype: object
Target distribution:
collision_severity
3    920089
2    277034
1     24164
Name: count, dtype: int64


## Handle Outliers
Detect and cap outliers in numerical features using the IQR method to prevent model skewing.

In [14]:
# Check for outliers using IQR method (for numerical features)
numerical_features = ['speed_limit', 'age_of_casualty']  # Add more if needed

for col in numerical_features:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"{col}: {len(outliers)} outliers detected")

# Cap outliers
for col in numerical_features:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

speed_limit: 17 outliers detected
age_of_casualty: 667 outliers detected


## Feature Selection
Reduce irrelevant features: drop those with high blanks/zeros, extreme outliers, low Weight of Evidence, and high correlation (>0.75).

In [15]:
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Convert 'time' to numeric (minutes since midnight) if present
if 'time' in df.columns:
    df['time'] = df['time'].apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]) if isinstance(x, str) and ':' in x else np.nan)

# 1. Drop features with >50% nulls or 0s (but keep if categorical with <=3 unique values, as 0 may be a flag)
threshold = 0.5
features_to_drop = []
for col in df.columns:
    if col in [target]: continue
    null_pct = df[col].isnull().mean()
    zero_pct = (df[col] == 0).mean()
    unique_vals = df[col].nunique()
    # Drop if high nulls, or high zeros but only if not a potential flag (i.e., if >3 unique values or numerical)
    if null_pct > threshold or (zero_pct > threshold and unique_vals > 3):
        features_to_drop.append(col)
        print(f"Dropping {col}: {null_pct:.2%} nulls, {zero_pct:.2%} zeros, {unique_vals} unique")

df = df.drop(columns=features_to_drop)
features = [f for f in features if f not in features_to_drop]

# 2. Handle extreme outliers (cap at 1st and 99th percentiles instead of dropping rows)
for col in numerical_features:
    if col in df.columns:
        lower_cap = df[col].quantile(0.01)
        upper_cap = df[col].quantile(0.99)
        df[col] = df[col].clip(lower=lower_cap, upper=upper_cap)
        print(f"Capped {col} at {lower_cap:.2f} - {upper_cap:.2f}")

# 3. Weight of Evidence (using mutual information as proxy for relevance)
mi_scores = mutual_info_classif(df[features], df[target])
mi_df = pd.DataFrame({'feature': features, 'mi_score': mi_scores})
mi_df = mi_df.sort_values('mi_score', ascending=False)
print("Mutual Information Scores:")
print(mi_df)

# Drop features with MI < 0.01 (low relevance)
low_mi = mi_df[mi_df['mi_score'] < 0.01]['feature'].tolist()
df = df.drop(columns=low_mi)
features = [f for f in features if f not in low_mi]
print(f"Dropped low MI features: {low_mi}")

# 4. Drop highly correlated features (>0.75 Pearson)
# Only on numeric features
numeric_features = [f for f in features if df[f].dtype in ['int64', 'float64']]
if numeric_features:
    corr_matrix = df[numeric_features].corr()
    high_corr = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > 0.75:
                colname = corr_matrix.columns[i]
                high_corr.append(colname)
                print(f"Dropping {colname} due to high correlation with {corr_matrix.columns[j]}")

    df = df.drop(columns=high_corr)
    features = [f for f in features if f not in high_corr]

print(f"Final features: {features}")

Dropping junction_detail: 0.00% nulls, 51.43% zeros, 8 unique
Dropping pedestrian_crossing_human_control_historic: 0.00% nulls, 82.38% zeros, 5 unique
Dropping pedestrian_crossing_physical_facilities_historic: 0.00% nulls, 67.34% zeros, 8 unique
Dropping pedestrian_crossing: 0.00% nulls, 77.69% zeros, 10 unique
Dropping special_conditions_at_site: 0.00% nulls, 82.95% zeros, 10 unique
Dropping carriageway_hazards_historic: 0.00% nulls, 83.51% zeros, 8 unique
Dropping carriageway_hazards: 0.00% nulls, 93.27% zeros, 14 unique
Dropping collision_adjusted_severity_serious: 0.00% nulls, 57.12% zeros, 103112 unique
Dropping pedestrian_location: 0.00% nulls, 92.07% zeros, 12 unique
Dropping pedestrian_movement: 0.00% nulls, 92.07% zeros, 11 unique
Dropping car_passenger: 0.00% nulls, 81.97% zeros, 5 unique
Dropping bus_or_coach_passenger: 0.00% nulls, 98.86% zeros, 7 unique
Dropping pedestrian_road_maintenance_worker: 0.00% nulls, 98.21% zeros, 4 unique
Dropping casualty_adjusted_severity_seri

## Preprocess Data
Drop rows with missing values in features/target, encode categorical features, and split into train/test sets.

In [21]:
from imblearn.over_sampling import SMOTE

target = 'collision_severity'  # Updated based on accident_df columns

df = df.dropna(subset=features + [target])

le = LabelEncoder()
for col in features:
    if df[col].dtype == 'object':
        df[col] = le.fit_transform(df[col])

X = df[features]
y = df[target]

# Handle imbalance with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)


ImportError: cannot import name '_is_pandas_df' from 'sklearn.utils.validation' (c:\Users\beckk\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\validation.py)

## Train Model
Train a Random Forest classifier with balanced class weights to handle imbalanced data.

In [17]:
model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
model.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


## Compare Models
Train and evaluate multiple models on the balanced dataset.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {acc:.2f}")
    print(classification_report(y_test, y_pred))
    print("-" * 50)

## Evaluate Model
Predict on test set and print classification report with precision, recall, and F1-score for each class.

In [18]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.05      0.68      0.10      4833
           2       0.27      0.35      0.31     55407
           3       0.84      0.50      0.63    184018

    accuracy                           0.47    244258
   macro avg       0.39      0.51      0.34    244258
weighted avg       0.69      0.47      0.54    244258



## Make Predictions
Check the prediction.

In [20]:
sample_input = pd.DataFrame({
    'weather_conditions': [1],
    'road_surface_conditions': [0],
    'speed_limit': [30],
    'vehicle_type': [9],
    'sex_of_casualty': [1],
    'vehicle_manoeuvre': [5]
})
prediction = model.predict(sample_input)
print("Predicted Severity:", "Slight" if prediction[0] == 1 else "Serious" if prediction[0] == 2 else "Fatal")

Predicted Severity: Fatal
