Modeling:

The modeling process will train and evaluate two machine-learning models,
1. Random Forest, and
2. Gradient Boosting, to predict crash severity using key features like weather conditions, speed limits, and road characteristics. These models will show the accuracy, and F1-score to identify the best predictors of high-severity crashes in Chicago in 2022.

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
import numpy as np

In [7]:
#loading dataset in a new dataframe 'Car_Crash'
Car_Crash = pd.read_csv('Crash_data.csv')

In [8]:
# Split dataset into features and target
X = Car_Crash.drop(columns=['MOST_SEVERE_INJURY'])
y = Car_Crash['MOST_SEVERE_INJURY']

In [9]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Initialize the scaler
scaler = StandardScaler()

In [11]:
# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

In [12]:
# Transform the test data using the fitted scaler
X_test_scaled = scaler.transform(X_test)

In [13]:
# Initialize the Random Forest models
rf_model = RandomForestClassifier(random_state=42)

In [14]:
# Cross-validation for Random Forest
cv_scores_rf = cross_val_score(rf_model, X_train_scaled, y_train, cv=5, scoring='accuracy')
print( cv_scores_rf)




[0.85456731 0.86658654 0.84975962 0.85817308 0.86418269]


In [15]:
#Printing cv_scores_rf mean value, using .mean() method
print(np.mean(cv_scores_rf))


0.8586538461538462


In [16]:
# Train Random Forest model
rf_model.fit(X_train_scaled, y_train)

Prediction Evaluation for Random Forest Model: Measures the accuracy, precision, recall, and F1-score of the Random Forest model

In [17]:
# Predictions and evaluation for Random Forest
y_pred_rf = rf_model.predict(X_test_scaled)
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))


Random Forest Classification Report:
                          precision    recall  f1-score   support

                   FATAL       0.00      0.00      0.00         1
   INCAPACITATING INJURY       0.00      0.00      0.00        15
 NO INDICATION OF INJURY       0.88      0.99      0.93       910
NONINCAPACITATING INJURY       0.17      0.02      0.04        83
   REPORTED, NOT EVIDENT       0.25      0.03      0.06        32

                accuracy                           0.86      1041
               macro avg       0.26      0.21      0.21      1041
            weighted avg       0.79      0.86      0.82      1041



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

Accuracy: 0.8645533141210374


Gradient Boosting Model:

After evaluating the performance of the Random Forest model, we proceed with another Machine Learning Model which is the Gradient Boosting model. The Gradient Boosting will correct errors iteratively for improved predictive accuracy. By using the second model for the same Car Crash Chicago in the 2022 dataset, we can compare the prediction (Most Severe Injuries).

In [19]:
# Initialize the Gradient Boosting models
gb_model = GradientBoostingClassifier(random_state=42)

In [20]:
# Cross-validation for Gradient Boosting
cv_scores_gb = cross_val_score(gb_model, X_train_scaled, y_train, cv=5, scoring='accuracy')
print( cv_scores_gb)



[0.87259615 0.875      0.86418269 0.87139423 0.86658654]


In [21]:
#print cv_scores_gb mean value, using .mean() method
print(np.mean(cv_scores_gb))

0.869951923076923


In [22]:
# Train Gradient Boosting model
gb_model.fit(X_train_scaled, y_train)

Prediction Evaluation for Gradient Boosting Model: Measures the accuracy, precision, recall, and F1-score.

In [23]:
# Predictions and evaluation for Gradient Boosting
y_pred_gb = gb_model.predict(X_test_scaled)
print("\nGradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Classification Report:
                          precision    recall  f1-score   support

                   FATAL       0.00      0.00      0.00         1
   INCAPACITATING INJURY       0.00      0.00      0.00        15
 NO INDICATION OF INJURY       0.88      0.99      0.93       910
NONINCAPACITATING INJURY       0.25      0.01      0.02        83
   REPORTED, NOT EVIDENT       0.00      0.00      0.00        32

                accuracy                           0.87      1041
               macro avg       0.23      0.20      0.19      1041
            weighted avg       0.79      0.87      0.82      1041



In [24]:
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.8683957732949087


Conclusion:

The Gradient Boosting Model Accuracy is 86.84% and the Random Forest Model accuracy is 86.46%. So, comparing these two models' accuracy, the Gradient Boosting has a more accurate prediction. Recall making it better at identifying all crash severity cases. However, Random Forest and Gradient Boosting both have a precision of 79.00%. In conclusion, Gradient Boosting is better for maximizing overall accuracy.