-------

**INTRODUCTION**

------

This project focuses on analyzing road traffic crash data to predict the risk of high-risk states based on various factors. By utilizing supervised learning techniques, including Logistic Regression and Random Forest, the goal was to classify states as high or low-risk based on input features such as the number of injured, killed, and other contributing factors. <br>


In [1]:
# Import the necessary library
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score


In [2]:
# Load the cleaned dataset
cleaned_road = pd.read_csv('C:/Users/acer/Desktop/Data_science_projects/cleaned_road_data.csv')

In [3]:
cleaned_road

Unnamed: 0.1,Unnamed: 0,Quarter,State,Total_Crashes,Num_Injured,Num_Killed,Total_Vehicles_Involved,SPV,DAD,PWR,FTQ,Other_Factors,Year,Casualty_count,Fatality_rate,Vehicle_crash_ratio
0,0,Q4 2020,Abia,30,146,31,37,19,0,0,0,18,2020,177,1.033333,1.233333
1,1,Q4 2020,Adamawa,77,234,36,94,57,0,0,0,37,2020,270,0.467532,1.220779
2,2,Q4 2020,Akwa Ibom,22,28,7,24,15,0,0,1,8,2020,35,0.318182,1.090909
3,3,Q4 2020,Anambra,72,152,20,83,43,1,0,0,39,2020,172,0.277778,1.152778
4,4,Q4 2020,Bauchi,154,685,90,140,74,0,0,0,66,2020,775,0.584416,0.909091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
513,513,Q1 2024,Rivers,15,32,4,21,17,0,0,0,4,2024,36,0.266667,1.400000
514,514,Q1 2024,Sokoto,24,122,41,52,41,0,0,0,11,2024,163,1.708333,2.166667
515,515,Q1 2024,Taraba,38,98,17,38,17,0,0,0,21,2024,115,0.447368,1.000000
516,516,Q1 2024,Yobe,39,234,13,55,38,0,0,0,17,2024,247,0.333333,1.410256


In [4]:

cleaned_road = cleaned_road.drop(columns=['Unnamed: 0'])

In [5]:

# Define high-risk states based on the mean of Total_Crashes
threshold = cleaned_road['Total_Crashes'].mean()
cleaned_road['High_Risk'] = (cleaned_road['Total_Crashes'] > threshold).astype(int)


In [6]:
cleaned_road

Unnamed: 0,Quarter,State,Total_Crashes,Num_Injured,Num_Killed,Total_Vehicles_Involved,SPV,DAD,PWR,FTQ,Other_Factors,Year,Casualty_count,Fatality_rate,Vehicle_crash_ratio,High_Risk
0,Q4 2020,Abia,30,146,31,37,19,0,0,0,18,2020,177,1.033333,1.233333,0
1,Q4 2020,Adamawa,77,234,36,94,57,0,0,0,37,2020,270,0.467532,1.220779,0
2,Q4 2020,Akwa Ibom,22,28,7,24,15,0,0,1,8,2020,35,0.318182,1.090909,0
3,Q4 2020,Anambra,72,152,20,83,43,1,0,0,39,2020,172,0.277778,1.152778,0
4,Q4 2020,Bauchi,154,685,90,140,74,0,0,0,66,2020,775,0.584416,0.909091,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
513,Q1 2024,Rivers,15,32,4,21,17,0,0,0,4,2024,36,0.266667,1.400000,0
514,Q1 2024,Sokoto,24,122,41,52,41,0,0,0,11,2024,163,1.708333,2.166667,0
515,Q1 2024,Taraba,38,98,17,38,17,0,0,0,21,2024,115,0.447368,1.000000,0
516,Q1 2024,Yobe,39,234,13,55,38,0,0,0,17,2024,247,0.333333,1.410256,0


In [7]:
# Encode categorical features: Quarter and State
label_encoder = LabelEncoder()

cleaned_road['Quarter'] = label_encoder.fit_transform(cleaned_road['Quarter'])
cleaned_road['State'] = label_encoder.fit_transform(cleaned_road['State'])

In [8]:
# Drop 'Total_Crashes' to avoid data leakage
X_cleaned = cleaned_road.drop(columns=['Total_Crashes', 'High_Risk'])
y = cleaned_road['High_Risk']


In [9]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y, test_size=0.2, random_state=23)


**Base Model**

In [10]:
# Scale the features for Logistic Regression
scaler = StandardScaler()

# Scale the training data
X_train_scaled = scaler.fit_transform(X_train)

# Scale the testing data
X_test_scaled = scaler.transform(X_test)

# Initialize Logistic Regression with class weight handling
lr_model = LogisticRegression(class_weight='balanced', random_state=23)

# Fit the Logistic Regression model
lr_model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate the Logistic Regression model
print("Confusion Matrix (Logistic Regression):\n", confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report (Logistic Regression):\n", classification_report(y_test, y_pred_lr))
print("\nWeighted F1 Score (Logistic Regression): {:.4f}".format(f1_score(y_test, y_pred_lr, average='weighted')))

Confusion Matrix (Logistic Regression):
 [[60  2]
 [ 0 42]]

Classification Report (Logistic Regression):
               precision    recall  f1-score   support

           0       1.00      0.97      0.98        62
           1       0.95      1.00      0.98        42

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104


Weighted F1 Score (Logistic Regression): 0.9808


In [11]:

# Train the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf_model.predict(X_test)


In [12]:
# Evaluate the Random Forest model
print("Confusion Matrix (Random Forest):\n", confusion_matrix(y_test, y_pred_rf))
print("\nClassification Report (Random Forest):\n", classification_report(y_test, y_pred_rf))
print("\nWeighted F1 Score (Random Forest): {:.4f}".format(f1_score(y_test, y_pred_rf, average='weighted')))

Confusion Matrix (Random Forest):
 [[59  3]
 [ 3 39]]

Classification Report (Random Forest):
               precision    recall  f1-score   support

           0       0.95      0.95      0.95        62
           1       0.93      0.93      0.93        42

    accuracy                           0.94       104
   macro avg       0.94      0.94      0.94       104
weighted avg       0.94      0.94      0.94       104


Weighted F1 Score (Random Forest): 0.9423


**Hyper-parameter Tuning**

In [13]:
# Define the parameter grid for Logistic Regression
param_grid_logreg = {
    'C': [0.1, 1, 10],          # Regularization strength
    'solver': ['liblinear', 'saga']  # Solver types
}

# Initialize Logistic Regression
logreg = LogisticRegression(max_iter=10000)

# Initialize GridSearchCV
grid_search_logreg = GridSearchCV(logreg, param_grid_logreg, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model
grid_search_logreg.fit(X_train_scaled, y_train)

# Get the best parameters and the best score
print(f"Best Parameters (Logistic Regression): {grid_search_logreg.best_params_}")
print(f"Best Cross-Validation Score (Logistic Regression): {grid_search_logreg.best_score_}")

# Get the best model after grid search
logreg_best_model = grid_search_logreg.best_estimator_



Best Parameters (Logistic Regression): {'C': 10, 'solver': 'liblinear'}
Best Cross-Validation Score (Logistic Regression): 0.9855421686746988


In [14]:
# Evaluate on test set
logreg_test_score = logreg_best_model.score(X_test_scaled, y_test)
print(f"Test Set Accuracy (Logistic Regression): {logreg_test_score}")

# Make predictions on the test set
y_pred_lr = logreg_best_model.predict(X_test_scaled)

# Confusion Matrix
print("Confusion Matrix (Logistic Regression):\n", confusion_matrix(y_test, y_pred_lr))

# Classification Report
print("\nClassification Report (Logistic Regression):\n", classification_report(y_test, y_pred_lr))

# Weighted F1 Score
weighted_f1 = f1_score(y_test, y_pred_lr, average='weighted')
print("\nWeighted F1 Score (Logistic Regression): {:.4f}".format(weighted_f1))

Test Set Accuracy (Logistic Regression): 1.0
Confusion Matrix (Logistic Regression):
 [[62  0]
 [ 0 42]]

Classification Report (Logistic Regression):
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        62
           1       1.00      1.00      1.00        42

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104


Weighted F1 Score (Logistic Regression): 1.0000


**Logistic Regression Conclusion**

Logistic Regression stands out as the stronger model in this analysis, delivering exceptional performance. With the best parameters identified through hyperparameter tuning (C=10, solver=‘liblinear’), the model achieved an overall accuracy of 98% and a weighted F1 score of 0.9808. It demonstrates remarkable ability in predicting high-risk states, with a perfect recall of 1.00, ensuring no high-risk states were missed. Precision was equally impressive, at 1.00 for non-high-risk states and 0.95 for high-risk states, balancing the trade-off between false positives and false negatives effectively.

These metrics highlight Logistic Regression’s reliability, especially in minimizing false negatives—a critical requirement for identifying high-risk states to aid government or decision-making bodies. The hyperparameter tuning process was crucial in optimizing this model, further solidifying its suitability for this task.

In [15]:
# Define the parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],  # Number of trees
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4]     # Minimum number of samples required at a leaf node
}

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model
grid_search_rf.fit(X_train, y_train)

# Get the best parameters and the best score
print(f"Best Parameters (Random Forest): {grid_search_rf.best_params_}")
print(f"Best Cross-Validation Score (Random Forest): {grid_search_rf.best_score_}")



Best Parameters (Random Forest): {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Cross-Validation Score (Random Forest): 0.963767264178666


In [16]:
# Evaluate on test set
rf_best_model = grid_search_rf.best_estimator_
rf_test_score = rf_best_model.score(X_test, y_test)
print(f"Test Set Accuracy (Random Forest): {rf_test_score}")

# Get predictions for evaluation
y_pred_rf = rf_best_model.predict(X_test)

# Confusion matrix
print("\nConfusion Matrix (Random Forest):\n", confusion_matrix(y_test, y_pred_rf))

# Classification report
print("\nClassification Report (Random Forest):\n", classification_report(y_test, y_pred_rf))

# Weighted F1 Score
print("\nWeighted F1 Score (Random Forest): {:.4f}".format(f1_score(y_test, y_pred_rf, average='weighted')))

Test Set Accuracy (Random Forest): 0.9423076923076923

Confusion Matrix (Random Forest):
 [[59  3]
 [ 3 39]]

Classification Report (Random Forest):
               precision    recall  f1-score   support

           0       0.95      0.95      0.95        62
           1       0.93      0.93      0.93        42

    accuracy                           0.94       104
   macro avg       0.94      0.94      0.94       104
weighted avg       0.94      0.94      0.94       104


Weighted F1 Score (Random Forest): 0.9423


**Random Forest Conclusion**

Random Forest, while slightly behind Logistic Regression, also delivered strong results. Hyperparameter tuning selected the following best parameters: n_estimators=200, max_depth=None, min_samples_split=2, and min_samples_leaf=1. With these settings, the model achieved an overall accuracy of 94% and a weighted F1 score of 0.9423.

The confusion matrix revealed that Random Forest correctly identified 39 high-risk instances in the dataset but missed 3, resulting in a recall of 0.93 for high-risk classifications. Precision for high-risk states also stood at 0.93, indicating occasional misclassifications. For non-high-risk states, recall and precision remained high at 0.95, demonstrating strong, but slightly less precise performance than Logistic Regression.

While Random Forest does not outperform Logistic Regression, it offers value in capturing complex relationships within the data. Its strengths may complement Logistic Regression in ensemble methods or when more intricate patterns need exploration.

#### **OVERALL CONCLUSION**
Through cross-validation and hyperparameter tuning, the models were optimized for accuracy, with Logistic Regression achieving 1.0 accuracy on the test set and Random Forest reaching 94%. The models are capable of making predictions that could help in identifying and mitigating high-risk traffic conditions.