## Introduction

In this notebook, we will use some techniques in data science that can help us find some insights from the Data-Collisions dataset and our ultimate goal is to build a machine learning model that can help us predict the accident severity based on enough features provided. As usual, we should start from explorartory data analysis (EDA) and feature engineering. Then we will proceed to data modeling and model evaluation part. 

In [46]:
import pandas as pd
import numpy as np

In [47]:
df = pd.read_csv('Data-Collisions.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Data Explorartion & Feature Engineering

Below, we can have a full look of our dataset. It is in a shape of 194673 rows and 38 columns. As in the capstone project instruction, we are expected to predict the severity of an accident, therefore, the column 'SEVERITYCODE' should be the target we are looking for. And the severity is represented by the number of the severity code. Now, we seems to have a basic understanding of our goal. 

In [48]:
df.shape

(194673, 38)

In [49]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


To ensure the data in severity code is what we expect, we can do the following check. The result shows that there are only two types of severity, Type I or Type II. 

In [50]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

The next step to build a model, we need to have a look at the features. Namely, what features are the most valuable to predict an accident severity? This is the part of feature enigneering! 

In [51]:
# Check columns in this dataset. 
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

After carefully looking through the whole dataset, I decided to select columns in the array of features to be the final dataset we want to use in data modelling. The reason is that I believe that these features to be the most relevant ones that could cause an accident and the severity of it. However, as we learned that in building machine learning model steps, we cannot use string type columns to train our models. Hence we need to convert non-numeric columns into numeric ones. And we can use dummy variables to achieve this conversion. The columns need to be converted into dummies are specificly listed in the array of dummy. 

In [52]:
features = ['SEVERITYCODE','ADDRTYPE','SEVERITYDESC','COLLISIONTYPE','PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT','JUNCTIONTYPE','SDOT_COLDESC','WEATHER','ROADCOND','LIGHTCOND','ST_COLDESC']
dummy = ['ADDRTYPE','SEVERITYDESC','COLLISIONTYPE','JUNCTIONTYPE','SDOT_COLDESC','WEATHER','ROADCOND','LIGHTCOND','ST_COLDESC']


In [53]:
df = pd.DataFrame(df[features])
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,JUNCTIONTYPE,SDOT_COLDESC,WEATHER,ROADCOND,LIGHTCOND,ST_COLDESC
0,2,Intersection,Injury Collision,Angles,2,0,0,2,At Intersection (intersection related),"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Overcast,Wet,Daylight,Entering at angle
1,1,Block,Property Damage Only Collision,Sideswipe,2,0,0,2,Mid-Block (not related to intersection),"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE ...",Raining,Wet,Dark - Street Lights On,From same direction - both going straight - bo...
2,1,Block,Property Damage Only Collision,Parked Car,4,0,0,3,Mid-Block (not related to intersection),"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",Overcast,Dry,Daylight,One parked--one moving
3,1,Block,Property Damage Only Collision,Other,3,0,0,3,Mid-Block (not related to intersection),"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Clear,Dry,Daylight,From same direction - all others
4,2,Intersection,Injury Collision,Angles,2,0,0,2,At Intersection (intersection related),"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Raining,Wet,Daylight,Entering at angle


We can use a for loop to finish the conversion and have a new dataset that includes the new dummy variables. 

In [54]:
for name in dummy: 
    dummies = pd.get_dummies(df[name])
    df = pd.concat([df,dummies],axis=1)
    df = df.drop([name],axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,Alley,Block,Intersection,Injury Collision,Property Damage Only Collision,...,Vehicle Strikes Pedalcyclist,Vehicle Strikes Railway Vehicle,Vehicle Struck by City Road or Construction Machinery,Vehicle Struck by Other Road or Construction Machinery,Vehicle backing hits pedestrian,Vehicle going straight hits pedestrian,Vehicle hits Pedestrian - All Other Actions,Vehicle overturned,Vehicle turning left hits pedestrian,Vehicle turning right hits pedestrian
0,2,2,0,0,2,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,2,0,0,2,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,4,0,0,3,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,3,0,0,3,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2,2,0,0,2,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


## Data Modelling

After careful selection, I decided to use three classic machine learning models to give predictions. The models are: Logistic Regression, KNN, SVM models. I chose them because they are suitable to classification instead of regression. And after typical modelling preparation steps, we are good to go! 

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [56]:
# Feature and target split. 
y = df['SEVERITYCODE']
X = df.loc[:, df.columns != 'SEVERITYCODE']

In [57]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [58]:
#Normalize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Logistic Regression

In [59]:
#Fit a logistic regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

#Score it
train_prediction = logreg_model.predict(X_train)
test_prediction = logreg_model.predict(X_test)
accuracy_train = accuracy_score(train_prediction, y_train)
accuracy_test = accuracy_score(test_prediction, y_test)

print(f"Score on training set: {accuracy_train}")
print(f"Score on test set: {accuracy_test}")

Score on training set: 1.0
Score on test set: 1.0


### KNN Classifier

In [38]:
from sklearn.neighbors import KNeighborsClassifier

# Split our data into train & test sets

# Instantiate the model & fit it to our data
KNN_model = KNeighborsClassifier()
KNN_model.fit(X_train, y_train)

# Score the model on the test set
test_predictions = KNN_model.predict(X_test)
test_accuracy = accuracy_score(test_predictions, y_test)
print(f"Test set accuracy: {test_accuracy}")

Test set accuracy: 0.9975343521253371


### Decision Tree 

In [60]:
from sklearn.tree import DecisionTreeClassifier

#Fit to the training data
DT_model = DecisionTreeClassifier()
DT_model.fit(X_train, y_train)

print(f"The TRAIN classification accuracy is:  {DT_model.score(X_train,y_train)}")
print(f"The TEST classification accuracy is:  {DT_model.score(X_test,y_test)}")

The TRAIN classification accuracy is:  1.0
The TEST classification accuracy is:  1.0


### SVM Classifier

In [61]:
from sklearn.svm import LinearSVC

SVM_model = LinearSVC()
SVM_model.fit(X_train, y_train)

print(f"The TRAIN classification accuracy is: {SVM_model.score(X_train,y_train)}")
print(f"The TEST classification accuracy is: {SVM_model.score(X_test,y_test)}")

The TRAIN classification accuracy is: 1.0
The TEST classification accuracy is: 1.0




The data modelling results are surprisingly good because almost all models have a 100% accuracy on the test dataset. Considering there are only two types of severity I or II, the results might be relatively easy to predict based on detailed features in the form of dummy variables. However, if to choose only one model from above, I will choose decision tree or logistic regression because they are the two that have the quickest computing speed. 