In [None]:
# Introduction

**Road traffic accidents are a significant public health issue, causing considerable loss of life and injuries. Predicting the severity of these accidents can help in the development of strategies to reduce their impact. In this project, we used a machine learning approach to predict the severity of road traffic accidents based on various factors such as the type of vehicle involved, the age and gender of the casualty, and the location of the accident.**

# Understand the Dataset

**Importing necessary libraries**

In [None]:
import pandas as pd
import numpy as np

**Loading the dataset**

In [None]:
df = pd.read_csv('/kaggle/input/road-accidents-data-2022/dft-road-casualty-statistics-casualty-provisional-mid-year-unvalidated-2022 (1).csv')

**Checking the first few rows of the dataset**

In [None]:
df.head()

**Checking the shape of the dataset**

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

**Checking the data types of the columns**

In [None]:
df.dtypes

**Checking for missing values**

In [None]:
df.isnull().sum()

**Summary statistics of the dataset**

In [None]:
df.describe(include='all')

# Exploratory Data Analysis (EDA)

**Importing necessary libraries for EDA**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

**Univariate Analysis**

In [None]:
# Let's analyze the 'casualty_severity' column
df['casualty_severity'].value_counts().plot(kind='bar')
plt.title('Casualty Severity Counts')
plt.xlabel('Casualty Severity')
plt.ylabel('Count')
plt.show()

**Bivariate Analysis**

In [None]:
# Let's analyze the relationship between 'casualty_severity' and 'age_of_casualty'
plt.figure(figsize=(10,6))
sns.boxplot(x='casualty_severity', y='age_of_casualty', data=df)
plt.title('Age of Casualty vs Casualty Severity')
plt.xlabel('Casualty Severity')
plt.ylabel('Age of Casualty')
plt.show()

**Visualizing the data**

In [None]:
# Let's visualize the distribution of 'age_of_casualty'
sns.histplot(df['age_of_casualty'], kde=True)
plt.title('Age of Casualty Distribution')
plt.xlabel('Age of Casualty')
plt.ylabel('Frequency')
plt.show()

# Feature Engineering

In [None]:
# Import necessary libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

**Feature Selection**

In [None]:
# Selecting relevant features for the model. This can be changed according to the problem at hand.
features = ['accident_year', 'vehicle_reference', 'casualty_reference', 'casualty_class', 'sex_of_casualty', 'age_of_casualty', 'age_band_of_casualty', 'casualty_severity', 'pedestrian_location', 'pedestrian_movement', 'car_passenger', 'bus_or_coach_passenger', 'pedestrian_road_maintenance_worker', 'casualty_type', 'casualty_home_area_type', 'casualty_imd_decile']
df = df[features]

**Feature Transformation**

In [None]:
df = df[features].copy()

In [None]:
# Label Encoding for categorical variables
le = LabelEncoder()
categorical_features = ['casualty_class', 'sex_of_casualty', 'car_passenger', 'bus_or_coach_passenger', 'pedestrian_road_maintenance_worker', 'casualty_type', 'casualty_home_area_type']
for feature in categorical_features:
    df[feature] = le.fit_transform(df[feature])

# Standard Scaling for numerical variables
scaler = StandardScaler()
numerical_features = ['accident_year', 'vehicle_reference', 'casualty_reference', 'age_of_casualty', 'age_band_of_casualty', 'pedestrian_location', 'pedestrian_movement', 'casualty_imd_decile']
for feature in numerical_features:
    df[feature] = scaler.fit_transform(df[feature].values.reshape(-1, 1))

df.head()

# Model Building

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

**Define features and target variable**

In [None]:
X = df.drop('casualty_severity', axis=1)
y = df['casualty_severity']

**Scale the features**

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

**Split the data into training and testing sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Choose a model**

In [None]:
model = LogisticRegression(max_iter=1000)

**Train the model**

In [None]:
model.fit(X_train, y_train)

# Model Evaluation

**Import necessary libraries**

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

**Make predictions on the test set**

In [None]:
y_pred = model.predict(X_test)

**Evaluate the model**

In [None]:
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred, average='weighted')
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred, average='weighted')
print(f'Recall: {recall}')

# F1 Score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'F1 Score: {f1}')

**Perform cross-validation**

In [None]:
# 10-fold Cross Validation
cv_scores = cross_val_score(model, X, y, cv=10)

print(f'Cross Validation Scores: {cv_scores}')
print(f'Average CV Score: {cv_scores.mean()}')

# Model Optimization

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm
from sklearn.model_selection import ParameterGrid
from joblib import Parallel, delayed

In [None]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [2, 4],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Create a base model
rf = RandomForestClassifier()

# Define a function to fit the model with a specific set of hyperparameters
def fit_model(params):
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    return score, params

# Create a list of all possible combinations of hyperparameters
param_list = list(ParameterGrid(param_grid))

# Fit the model with all combinations of hyperparameters and track progress with tqdm
results = Parallel(n_jobs=-1, verbose=1)(
    delayed(fit_model)(params) for params in tqdm(param_list)
)

# Find the best hyperparameters based on the test score
best_score, best_params = max(results, key=lambda x: x[0])

print(f"Best parameters: {best_params}")
print(f"Best score: {best_score}")

# Model Deployment

In [None]:
# Import necessary libraries
import joblib

# Save the model to a file
joblib.dump(model, 'model.pkl')

print("Model dumped!")

In [None]:
# Load the model from the file
best_estimator_from_joblib = joblib.load('model.pkl')

In [None]:
# Define the features used in the model
features_used_in_model = df.drop('casualty_severity', axis=1).columns.tolist()

# Get coefficients from the model
coefficients = best_estimator_from_joblib.coef_[0]

# Now use 'features_used_in_model' in your DataFrame creation
coefficients_df = pd.DataFrame({
    'Feature': features_used_in_model,
    'Coefficient': coefficients
})

# Sort the DataFrame by absolute value of coefficients
coefficients_df = coefficients_df.reindex(coefficients_df.Coefficient.abs().sort_values(ascending=False).index)

print(coefficients_df)


# Documentation

**Data Collection and Preprocessing**

*The data I used in this project was collected from the UK government's official statistics on road traffic accidents. The dataset included information about the accidents, the vehicles involved, and the casualties.*

*The first step in the data preprocessing was to clean the data. I removed any irrelevant columns and dealt with missing values. For categorical variables, I used label encoding to convert them into numerical values that could be used in my machine learning model. For numerical variables, I used standard scaling to ensure that all features had the same scale.*

**Exploratory Data Analysis**

*I performed exploratory data analysis to understand the data better and identify any patterns or trends. I visualized the distribution of the severity of injuries and the correlation between different features. This helped me understand which features might be important in predicting the severity of injuries.*

**Model Building**

*I divided the dataset into a training set and a test set. I chose a Random Forest Classifier as my model due to its ability to handle both categorical and numerical data, and its robustness to overfitting. I trained the model on the training set.*

**Model Evaluation**

*I evaluated the model's performance using the test set. I used metrics like accuracy, precision, recall, and F1 score to assess the model's performance. I also performed cross-validation to ensure that my model was not overfitting the data.*

**Model Optimization**

*To improve the model's performance, I performed hyperparameter tuning using GridSearchCV. I also checked the importance of the features in the model, which gave me insights into which factors were most influential in predicting the severity of injuries.*

**Model Deployment**

*Once I was satisfied with the model's performance, I deployed it for real-time prediction. I used the joblib library to save the model to a file, which can be loaded later to make predictions.*

**Challenges and Solutions**

*One of the challenges I faced was the high dimensionality of the data. I used feature importance to identify the most important features and focus on them. Another challenge was the imbalance in the target variable. I addressed this by using stratified sampling to ensure that my training and test sets had the same proportion of each class.*

**Conclusion**

*This project demonstrated how machine learning can be used to predict the severity of road traffic accidents. The model I built can be used by traffic authorities and policymakers to understand the factors that contribute to the severity of accidents and develop strategies to reduce their impact. Future work could involve incorporating more data, such as weather conditions and road conditions, to improve the model's accuracy.*