Modeling:

The modeling process will train and evaluate two machine learning models, Random Forest, and Gradient Boosting, to predict crash severity using key features like weather conditions, speed limits, and road characteristics. These models will show the accuracy, and F1-score to identify the best predictors of high-severity crashes in Chicago in 2022.

Import - Libraries: will help to use the modules that contain functions, and methods

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
import numpy as np

Data Collection: Data collection is the process of gathering information from various formatted types. Here, the data is in CSV file.

In [3]:
#loading dataset in a new dataframe 'Car_Crash'
Car_Crash = pd.read_csv('Crash_analyzed.csv')

In [4]:
#To see the sample of five rows, use .head() method.
Car_Crash.head()

Unnamed: 0,CRASH_DATE,POSTED_SPEED_LIMIT,WEATHER_CONDITION,TRAFFICWAY_TYPE,ROADWAY_SURFACE_COND,STREET_DIRECTION,MOST_SEVERE_INJURY,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH
0,2022-01-31,25,CLEAR,ONE-WAY,DRY,W,NO INDICATION OF INJURY,19,2,1
1,2022-01-01,10,SNOW,PARKING LOT,SNOW OR SLUSH,W,NO INDICATION OF INJURY,16,7,1
2,2022-01-30,25,CLEAR,ONE-WAY,SNOW OR SLUSH,W,NO INDICATION OF INJURY,8,1,1
3,2022-05-28,25,CLEAR,ONE-WAY,DRY,W,NO INDICATION OF INJURY,17,7,5
4,2022-04-16,10,CLEAR,PARKING LOT,DRY,W,NO INDICATION OF INJURY,11,7,4


In [5]:
#To see columns names and dtype in Car_Crash
Car_Crash.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5201 entries, 0 to 5200
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   CRASH_DATE            5201 non-null   object
 1   POSTED_SPEED_LIMIT    5201 non-null   int64 
 2   WEATHER_CONDITION     5201 non-null   object
 3   TRAFFICWAY_TYPE       5201 non-null   object
 4   ROADWAY_SURFACE_COND  5201 non-null   object
 5   STREET_DIRECTION      5201 non-null   object
 6   MOST_SEVERE_INJURY    5201 non-null   object
 7   CRASH_HOUR            5201 non-null   int64 
 8   CRASH_DAY_OF_WEEK     5201 non-null   int64 
 9   CRASH_MONTH           5201 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 406.5+ KB


Deleting columns using .drop() method in python

In [6]:
# Drop unnecessary columns
Crash_data = Car_Crash.drop(columns=['CRASH_DATE'])

In [7]:
#To see number of rows and columns in Crash_data 
Crash_data.shape

(5201, 9)

Before split the dataset, encode the categorical features from the dataset

In [8]:
# Encode categorical features
label_encoders = {}
for column in ['WEATHER_CONDITION', 'TRAFFICWAY_TYPE', 'ROADWAY_SURFACE_COND', 'STREET_DIRECTION', 'MOST_SEVERE_INJURY']:
    le = LabelEncoder()
    Crash_data[column] = le.fit_transform(Crash_data[column])
    label_encoders[column] = le


In [9]:
# Split dataset into features and target
X = Crash_data.drop(columns=['MOST_SEVERE_INJURY'])
y = Crash_data['MOST_SEVERE_INJURY']

In [10]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5201 entries, 0 to 5200
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   POSTED_SPEED_LIMIT    5201 non-null   int64
 1   WEATHER_CONDITION     5201 non-null   int32
 2   TRAFFICWAY_TYPE       5201 non-null   int32
 3   ROADWAY_SURFACE_COND  5201 non-null   int32
 4   STREET_DIRECTION      5201 non-null   int32
 5   CRASH_HOUR            5201 non-null   int64
 6   CRASH_DAY_OF_WEEK     5201 non-null   int64
 7   CRASH_MONTH           5201 non-null   int64
dtypes: int32(4), int64(4)
memory usage: 243.9 KB


In [11]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Cross-validation: 
A technique to assess model performance by splitting the data into multiple training and testing subsets.

In [12]:
# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [13]:
# Random Forest Model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

In [15]:
# Cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_train, y_train, cv=cv, scoring='accuracy')
print("Random Forest Cross-Validation Accuracy Scores:", rf_cv_scores)



Random Forest Cross-Validation Accuracy Scores: [0.86658654 0.85576923 0.86538462 0.86177885 0.86418269]


In [16]:
#To check mean value for the CV Accuracy
print("Mean CV Accuracy:", np.mean(rf_cv_scores))

Mean CV Accuracy: 0.8627403846153847


Prediction Evaluation for Random Forest Model:
Measures the accuracy, precision, recall, and F1-score of the Random Forest model.

In [17]:
# Predictions and evaluation for Random Forest
y_pred_rf = rf_model.predict(X_test)
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))



Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00        15
           2       0.88      0.99      0.93       910
           3       0.20      0.02      0.04        83
           4       0.17      0.03      0.05        32

    accuracy                           0.87      1041
   macro avg       0.25      0.21      0.21      1041
weighted avg       0.79      0.87      0.82      1041



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
#Accuracy Score
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

Accuracy: 0.8655139289145053


In [19]:
#Precision Score
print("Precision:", precision_score(y_test, y_pred_rf, average='weighted'))

Precision: 0.789918844828565


  _warn_prf(average, modifier, msg_start, len(result))


In [20]:
#Recall Score
print("Recall:", recall_score(y_test, y_pred_rf, average='weighted'))

Recall: 0.8655139289145053


In [21]:
# F1 Score
print("F1 Score:", f1_score(y_test, y_pred_rf, average='weighted'))

F1 Score: 0.8180924274063786


 Gradient Boosting Model:

After evaluating the performance of the Random Forest model, we proceed with another Machine Learning Model which is the Gradient Boosting model. The Gradient Boosting will correct errors iteratively for improved predictive accuracy. By using the second model for the same Car Crash Chicago in the 2022 dataset, we can compare the prediction (Most Severe Injuries). 

In [22]:
# Gradient Boosting Model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

Cross-validation: Here, we are going to improve the model's performance and stability using Cross-Validation technique for the Gradient Boosting model.

In [23]:
# Cross-validation for Gradient Boosting
gb_cv_scores = cross_val_score(gb_model, X_train, y_train, cv=cv, scoring='accuracy')
print("\nGradient Boosting Cross-Validation Accuracy Scores:", gb_cv_scores)





Gradient Boosting Cross-Validation Accuracy Scores: [0.87139423 0.86778846 0.87139423 0.87019231 0.87019231]


In [24]:
#To check the mean value for the CV accuracy
print("Mean CV Accuracy:", np.mean(gb_cv_scores))

Mean CV Accuracy: 0.8701923076923077


Prediction Evaluation for Gradient Boosting Model:
Measures the accuracy, precision, recall, and F1-score of the Gradient Boosting model.

In [26]:
# Predictions and evaluation for Gradient Boosting
y_pred_gb = gb_model.predict(X_test)
print("\nGradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb))



Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00        15
           2       0.87      1.00      0.93       910
           3       0.00      0.00      0.00        83
           4       0.00      0.00      0.00        32

    accuracy                           0.87      1041
   macro avg       0.17      0.20      0.19      1041
weighted avg       0.76      0.87      0.81      1041



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [27]:
#Accuracy Score
print("Accuracy:", accuracy_score(y_test, y_pred_gb))

Accuracy: 0.8712776176753122


In [28]:
#Precision Score
print("Precision:", precision_score(y_test, y_pred_gb, average='weighted'))

Precision: 0.7645734157035043


  _warn_prf(average, modifier, msg_start, len(result))


In [29]:
#Recall Score
print("Recall:", recall_score(y_test, y_pred_gb, average='weighted'))

Recall: 0.8712776176753122


In [30]:
#F1 Score
print("F1 Score:", f1_score(y_test, y_pred_gb, average='weighted'))

F1 Score: 0.8144454361423052


Conclusion: 

The Gradient Boosting Accuracy is 87.13% and the Random Forest accuracy is 86.55%. So, comparing these two models' accuracy, the Gradient Boosting has more accuracy. Recall making it better at identifying all crash severity cases. However, Random Forest showed a higher precision of 78.99% and Gradient Boosting has a precision of 76.45%. In conclusion, Gradient Boosting is better for maximizing overall accuracy, while Random Forest is preferable if precision is more critical.