Pre-Processing and Training Data

Now, the Car Crash 2022 in chicago city is going to be used for training the machine learning model. Training data refers to the data used to train the model, where the model learns the relationships between the features and the target variable.

Import - Libraries: will help to use the modules that contain functions, and methods

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split,  cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler


Load Data or Data Collection: Data collection is the process of gathering information from various formats. Here, the data is in a CSV file.

In [2]:
#loading dataset in a new dataframe 'Car_Crash'
Car_Crash = pd.read_csv('Crash_analyzed.csv')

In [3]:
#To see the sample of five rows, use .head() method.
Car_Crash.head()

Unnamed: 0,CRASH_DATE,POSTED_SPEED_LIMIT,WEATHER_CONDITION,TRAFFICWAY_TYPE,ROADWAY_SURFACE_COND,STREET_DIRECTION,MOST_SEVERE_INJURY,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH
0,2022-01-31,25,CLEAR,ONE-WAY,DRY,W,NO INDICATION OF INJURY,19,2,1
1,2022-01-01,10,SNOW,PARKING LOT,SNOW OR SLUSH,W,NO INDICATION OF INJURY,16,7,1
2,2022-01-30,25,CLEAR,ONE-WAY,SNOW OR SLUSH,W,NO INDICATION OF INJURY,8,1,1
3,2022-05-28,25,CLEAR,ONE-WAY,DRY,W,NO INDICATION OF INJURY,17,7,5
4,2022-04-16,10,CLEAR,PARKING LOT,DRY,W,NO INDICATION OF INJURY,11,7,4


In [4]:
#To see columns names and dtype in Car_Crash
Car_Crash.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5201 entries, 0 to 5200
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   CRASH_DATE            5201 non-null   object
 1   POSTED_SPEED_LIMIT    5201 non-null   int64 
 2   WEATHER_CONDITION     5201 non-null   object
 3   TRAFFICWAY_TYPE       5201 non-null   object
 4   ROADWAY_SURFACE_COND  5201 non-null   object
 5   STREET_DIRECTION      5201 non-null   object
 6   MOST_SEVERE_INJURY    5201 non-null   object
 7   CRASH_HOUR            5201 non-null   int64 
 8   CRASH_DAY_OF_WEEK     5201 non-null   int64 
 9   CRASH_MONTH           5201 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 406.5+ KB


If the Car_Crash dataset has 'CRASH_DATE' for this data training process, this insight won't be effective. So, drop this column from this dataset.

In [5]:
# Drop unnecessary columns
Crash_data = Car_Crash.drop(columns=['CRASH_DATE'])


In [6]:
#To see number of rows and columns in Crash_data 
Crash_data.shape

(5201, 9)

Need to convert categorical data into numeric format for machine learning models. We will encode categorical features using the .fit_transform method in Label Encoders.

In [7]:
# Encoding categorical features
label_encoders = {}
for column in ['WEATHER_CONDITION', 'TRAFFICWAY_TYPE', 'ROADWAY_SURFACE_COND', 'STREET_DIRECTION', 'MOST_SEVERE_INJURY']:
    le = LabelEncoder()
    Crash_data[column] = le.fit_transform(Crash_data[column])
    label_encoders[column] = le


Split Dataset:
Splitting Dataset into features and target, stored in X, and Y variables. 'MOST_SEVERE_INJURY' is the target variable, the rest are features

In [8]:
# Split dataset into features and target
X = Crash_data.drop(columns=['MOST_SEVERE_INJURY'])
y = Crash_data['MOST_SEVERE_INJURY']


In [9]:
#To see the X variables columns and Dtype
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5201 entries, 0 to 5200
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   POSTED_SPEED_LIMIT    5201 non-null   int64
 1   WEATHER_CONDITION     5201 non-null   int32
 2   TRAFFICWAY_TYPE       5201 non-null   int32
 3   ROADWAY_SURFACE_COND  5201 non-null   int32
 4   STREET_DIRECTION      5201 non-null   int32
 5   CRASH_HOUR            5201 non-null   int64
 6   CRASH_DAY_OF_WEEK     5201 non-null   int64
 7   CRASH_MONTH           5201 non-null   int64
dtypes: int32(4), int64(4)
memory usage: 243.9 KB


In [10]:
print(y)

0       2
1       2
2       2
3       2
4       2
       ..
5196    3
5197    2
5198    2
5199    2
5200    2
Name: MOST_SEVERE_INJURY, Length: 5201, dtype: int32


Train-test split: 
Here, appling the Train-test split form sklearn. We reserve 20% of the data for testing to evaluate the model's performance (y- train) which is target variable. X-train contains columns'WEATHER_CONDITION', 'TRAFFICWAY_TYPE', 'ROADWAY_SURFACE_COND', 'STREET_DIRECTION', POSTED_SPEED_LIMIT, CRASH_HOUR	CRASH_DAY_OF_WEEK, CRASH_MONTH. 

In [11]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
#To see number of rows and columns in X_train and X_test
X_train.shape, X_test.shape

((4160, 8), (1041, 8))

In [13]:
#To see number of rows and columns in y_train and y_test
y_train.shape, y_test.shape

((4160,), (1041,))

In [14]:
# Initialize the scaler
scaler = StandardScaler()

In [15]:
# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

In [17]:
print(X_train_scaled)

[[ 0.2937593   2.38717428  0.11770509 ... -1.06106744 -1.10636596
  -0.05447764]
 [ 0.2937593  -0.22701328  0.68247231 ...  0.00499718  0.41830798
   0.6449883 ]
 [ 0.2937593   1.0800805   0.11770509 ...  0.36035205 -0.08991667
   0.6449883 ]
 ...
 [-2.0617509   1.0800805   0.11770509 ...  0.89338436  0.92653262
   0.6449883 ]
 [ 1.07892937 -0.55378673  0.11770509 ... -0.70571257 -0.08991667
   0.6449883 ]
 [ 0.2937593  -0.55378673  0.68247231 ...  0.89338436  1.43475727
  -1.4534095 ]]


In [16]:
# Transform the test data using the fitted scaler
X_test_scaled = scaler.transform(X_test)

In [18]:
print(X_test_scaled)

[[ 0.2937593   1.0800805  -1.57659656 ...  0.71570692 -0.08991667
  -0.75394357]
 [ 0.2937593  -0.55378673  0.11770509 ...  0.18267461 -1.10636596
   1.34445423]
 [ 0.2937593  -0.55378673  0.11770509 ... -0.17268026  0.92653262
  -0.05447764]
 ...
 [ 1.07892937 -0.55378673 -1.29421296 ...  0.53802949  1.43475727
   0.6449883 ]
 [ 0.2937593  -0.55378673  0.11770509 ... -0.17268026 -0.08991667
  -1.4534095 ]
 [ 1.07892937  1.0800805   0.11770509 ...  0.36035205  1.43475727
   0.6449883 ]]


In [19]:
# Random Forest Model
rf_model = RandomForestClassifier(random_state=42)

Here, we are going to improve the model's performance and stability using Cross-Validation technique

In [21]:
# Perform cross-validation to evaluate model stability
cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=5, scoring='accuracy')  

print(cv_scores)




[0.85576923 0.87139423 0.85576923 0.86057692 0.86658654]


In [22]:
#See the Cross-Validation mean score using .mean() function
print("Mean Cross-Validation Score:", cv_scores.mean())

Mean Cross-Validation Score: 0.8620192307692307


In [23]:
# Train the Random Forest Model with the scaled data
rf_model.fit(X_train_scaled, y_train)


In [24]:
# Predictions and evaluation for Random Forest
y_pred_rf = rf_model.predict(X_test_scaled)


In [25]:
print( y_pred_rf)

[2 2 2 ... 2 2 2]


In [26]:
#To see the Classification Report
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00        15
           2       0.88      0.99      0.93       910
           3       0.20      0.02      0.04        83
           4       0.17      0.03      0.05        32

    accuracy                           0.87      1041
   macro avg       0.25      0.21      0.21      1041
weighted avg       0.79      0.87      0.82      1041



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [27]:
#To see the Accuracy 
print("Accuracy:", accuracy_score(y_test, y_pred_rf))

Accuracy: 0.8655139289145053


Conclusion:

The Random Forest model predicts the severity of traffic crashes in Chicago. Using the cross-validation technique, the model's average accuracy is (Mean Cross-Validation Score: 0.8620) The model produced an accuracy: 0.8655139289145053. The Random Forest model is well-suited for identifying high-risk conditions for traffic crashes, providing valuable insights for targeted road safety interventions.
