# Car Accident Severity Model

*This project serves as the IBM Applied Data Science Capstone for finishing the **IBM Data Science Professional Certificate**.*

## Business Understanding

The model described in this document has the objective of predicting the probability of collision and its severity, based on the traffic records provided by Seattle Police Department, which includes certain data such as weather, road and light conditions, as well as the incident location and the involved vehicles type

## Data understanding

The following dataset, `Data-Collisions.csv` contains all collisions provided by SPD and recorded by Traffic Records, specifying the type of collision, and classifies each by 37 different attributes

Many of the columns in the dataset are some type of identification, having a unique ID or a characteristic of the data itself. These wouldn't provide useful information to the model, so they can be deleted: OBJECTID, INCKEY, COLDETKEY, REPORTNO, STATUS, ADDRTYPE, INTKEY, LOCATION, EXCEPTRSNDESC, SDOTCOLNUM

On the other hand, EXCEPTRSNCODE could be used for helping to classify outliers, as it specifies on which cases not enough information was recovered

The SEVERITYCODE will be our dependent variable $y$, as it's the one we want the model to predict

The date and time of the incident can be used for feature engineering, as the day of the week and time of the accident could be important factors.

The following variables could be very related with the outcome: JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND, as they specify the current conditions. On the other hand, the SEGLANEKEY, CROSSWALKKEY and X and Y coordinates could give an idea of the location where is most probable to happen accidents

In [1]:
import pandas as pd
import numpy as np

In [2]:
!wget -O Data-Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

Data-Collisions.csv: Permission denied


In [3]:
df = pd.read_csv('Data-Collisions.csv')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
df.describe()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,65070.0,194673.0,194673.0,194673.0,194673.0,194673.0,194673.0,114936.0,194673.0,194673.0
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,37558.450576,1.298901,2.444427,0.037139,0.028391,1.92078,13.867768,7972521.0,269.401114,9782.452
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,51745.990273,0.457778,1.345929,0.19815,0.167413,0.631047,6.868755,2553533.0,3315.776055,72269.26
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,23807.0,1.0,0.0,0.0,0.0,0.0,0.0,1007024.0,0.0,0.0
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,28667.0,1.0,2.0,0.0,0.0,2.0,11.0,6040015.0,0.0,0.0
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,29973.0,1.0,2.0,0.0,0.0,2.0,13.0,8023022.0,0.0,0.0
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,33973.0,2.0,3.0,0.0,0.0,2.0,14.0,10155010.0,0.0,0.0
max,2.0,-122.238949,47.734142,219547.0,331454.0,332954.0,757580.0,2.0,81.0,6.0,2.0,12.0,69.0,13072020.0,525241.0,5239700.0


## Data Preparation

First, the columns related with data classification or IDs will be deleted:

In [5]:
df.drop(['OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNDESC', 'SDOTCOLNUM'], axis=1)

Unnamed: 0,SEVERITYCODE,X,Y,EXCEPTRSNCODE,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,...,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.703140,,2,Injury Collision,Angles,2,0,0,...,Overcast,Wet,Daylight,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,,1,Property Damage Only Collision,Sideswipe,2,0,0,...,Raining,Wet,Dark - Street Lights On,,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.334540,47.607871,,1,Property Damage Only Collision,Parked Car,4,0,0,...,Overcast,Dry,Daylight,,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,,1,Property Damage Only Collision,Other,3,0,0,...,Clear,Dry,Daylight,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,,2,Injury Collision,Angles,2,0,0,...,Raining,Wet,Daylight,,,10,Entering at angle,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194668,2,-122.290826,47.565408,,2,Injury Collision,Head On,3,0,0,...,Clear,Dry,Daylight,,,24,From opposite direction - both moving - head-on,0,0,N
194669,1,-122.344526,47.690924,,1,Property Damage Only Collision,Rear Ended,2,0,0,...,Raining,Wet,Daylight,,,13,From same direction - both going straight - bo...,0,0,N
194670,2,-122.306689,47.683047,,2,Injury Collision,Left Turn,3,0,0,...,Clear,Dry,Daylight,,,28,From opposite direction - one left turn - one ...,0,0,N
194671,2,-122.355317,47.678734,,2,Injury Collision,Cycles,2,0,1,...,Clear,Dry,Dusk,,,5,Vehicle Strikes Pedalcyclist,4308,0,N


Now, 2 engineered features will be created: Day of the Week and Hour, based on INCDTTM

In [6]:
df['INCDTTM'] = pd.to_datetime(df['INCDTTM'])
df["DayOfWeek"] = df["INCDTTM"].dt.dayofweek
df["Hour"] = df["INCDTTM"].dt.hour

In [7]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR,DayOfWeek,Hour
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,,,,10,Entering at angle,0,0,N,2,14
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N,2,18
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,,4323031.0,,32,One parked--one moving,0,0,N,3,10
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,,,,23,From same direction - all others,0,0,N,4,9
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,,4028032.0,,10,Entering at angle,0,0,N,2,8


In [8]:
df.groupby(['DayOfWeek'])['SEVERITYCODE'].count()

DayOfWeek
0    26338
1    28556
2    28778
3    29324
4    32333
5    27389
6    21955
Name: SEVERITYCODE, dtype: int64

## Feature Selection

Now let's select the features to build the models

In [9]:
Features=['SEVERITYCODE','WEATHER','ROADCOND','LIGHTCOND','WEATHER','COLLISIONTYPE','DayOfWeek','Hour']

In [10]:
dff=df[Features]

In [11]:
dff.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER.1,COLLISIONTYPE,DayOfWeek,Hour
0,2,Overcast,Wet,Daylight,Overcast,Angles,2,14
1,1,Raining,Wet,Dark - Street Lights On,Raining,Sideswipe,2,18
2,1,Overcast,Dry,Daylight,Overcast,Parked Car,3,10
3,1,Clear,Dry,Daylight,Clear,Other,4,9
4,2,Raining,Wet,Daylight,Raining,Angles,2,8


Many of the selected features requiere a numerical encoding, as the ML algorithms can't work with strings

In [12]:
df.WEATHER.unique()

array(['Overcast', 'Raining', 'Clear', nan, 'Unknown', 'Other', 'Snowing',
       'Fog/Smog/Smoke', 'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'], dtype=object)

In [13]:
encode={
'WEATHER':{'Clear':0,'Raining':1,'Overcast':2,'Unknown':3,'Snowing':4,'Other':5,'Fog/Smog/Smoke':6,'Sleet/Hail/Freezing Rain':7,'Blowing Sand/Dirt':8,'Severe Crosswind':9,'Partly Cloudy':10},
'ROADCOND': {'Dry':0, 'Wet':1, 'Unknown':2,'Ice':3, 'Snow/Slush':4, 'Other':5, 'Standing Water':6, 'Sand/Mud/Dirt':7,'Oil':8},
'LIGHTCOND':{'Daylight':0,'Dark - Street Lights On':1, 'Unknown': 2, 'Dusk':3, 'Dawn':4, 'Dark - No Street Lights':5, 'Dark - Street Lights Off':6, 'Other':7, 'Dark - Unknown Lighting': 8},
'COLLISIONTYPE':{'Parked Car':0,'Angles':1,'Rear Ended':2,'Other':3,'Sideswipe':4,'Left Turn':5, 'Pedestrian':6,'Cycles':7,'Right Turn':8,'Head On':9}}

In [14]:
dff.replace(encode, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


In [15]:
dff.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER.1,COLLISIONTYPE,DayOfWeek,Hour
0,2,2.0,1.0,0.0,2.0,1.0,2,14
1,1,1.0,1.0,1.0,1.0,4.0,2,18
2,1,2.0,0.0,0.0,2.0,0.0,3,10
3,1,0.0,0.0,0.0,0.0,3.0,4,9
4,2,1.0,1.0,0.0,1.0,1.0,2,8


Some of the fields are blank, so let's delete them from the dataset

In [16]:
dff.shape

(194673, 8)

In [17]:
dff=dff.dropna()

In [18]:
dff.shape

(189316, 8)

In [19]:
X = dff[['WEATHER','ROADCOND','LIGHTCOND','WEATHER','COLLISIONTYPE','DayOfWeek','Hour']]

In [20]:
y = dff['SEVERITYCODE']

## Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [21]:
from sklearn import preprocessing
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 1.10590652,  1.10590652,  0.70885797, -0.59511633,  1.10590652,
         1.10590652, -0.60448251, -0.49067672,  0.36762741],
       [ 0.20508688,  0.20508688,  0.70885797,  0.34715119,  0.20508688,
         0.20508688,  0.79046466, -0.49067672,  0.94461195],
       [ 1.10590652,  1.10590652, -0.60850212, -0.59511633,  1.10590652,
         1.10590652, -1.06946491,  0.02939984, -0.20935713],
       [-0.69573277, -0.69573277, -0.60850212, -0.59511633, -0.69573277,
        -0.69573277,  0.32548227,  0.5494764 , -0.35360327],
       [ 0.20508688,  0.20508688,  0.70885797, -0.59511633,  0.20508688,
         0.20508688, -0.60448251, -0.49067672, -0.4978494 ]])

## Train-Test Split

Data is splitted in Training and Testing sets to later be used as part of each model evaluations.

On the other hand, a second subset of the training data will be required to evaluate the best K for the KNN model

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (132521, 9) (132521,)
Test set: (56795, 9) (56795,)


# K Nearest Neighbor(KNN)

In [23]:
X_train_KNN, X_test_KNN, y_train_KNN, y_test_KNN = train_test_split( X_train, y_train, test_size=0.3, random_state=4)

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

### Running KNN Models with Different Ks

The following code will run KNN Models using K from 1 to 10 to evaluate the one with the highest accuracy

In [25]:
Ks = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train_KNN,y_train_KNN)
    yhat=neigh.predict(X_test_KNN)
    mean_acc[n-1] = metrics.accuracy_score(y_test_KNN, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test_KNN)/np.sqrt(yhat.shape[0])

mean_acc

array([0.67125286, 0.71846467, 0.69612898, 0.7227658 , 0.70702015,
       0.72643811, 0.71708127, 0.73202203, 0.72163393, 0.73260055,
       0.72701663, 0.73400911, 0.7268154 , 0.73599618, 0.72902885,
       0.73657469, 0.73149382, 0.73629801, 0.73413487])

In [26]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

The best accuracy was with 0.7365746912493397 with k= 16


Now, let's run the model with k=16 using all the training data

In [37]:
k = 16
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

KNeighborsClassifier(n_neighbors=16)

# Decision Tree

In [28]:
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_train,y_train)

# Support Vector Machine

Let's use a Radial Basis Function

In [29]:
from sklearn import svm
svm_rbf = svm.SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train) 

SVC()

# Logistic Regression

In [30]:
from sklearn.linear_model import LogisticRegression# Logistic Regression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)

# Model Evaluation using Test set

In [31]:
import sklearn
from sklearn.metrics import f1_score

## Predicting Test set using the different models

Using our previous set models, we're going to make predictions for the X_test data

In [38]:
neigh_predict = neigh.predict(X_test)
dtree_predict = dtree.predict(X_test)
svm_predict = svm_rbf.predict(X_test)
LR_predict = LR.predict(X_test)

Now, let's create a matrix, showing the results of each evaluation

In [39]:
evaluations = [[f1_score(y_test, neigh_predict, average='weighted')]]
evaluations.append([f1_score(y_test, dtree_predict, average='weighted')])
evaluations.append([f1_score(y_test, svm_predict, average='weighted')])
evaluations.append([f1_score(y_test, LR_predict, average='weighted')])

In [40]:
evaluations

[[0.6943831163278058],
 [0.6841403309887303],
 [0.6792877657727993],
 [0.658271262182948]]

In [41]:
dfeval = pd.DataFrame(evaluations, columns = ["F1-score"] , index=['KNN', 'Decision Tree', 'SVM', 'LogisticRegression']) 

In [42]:
dfeval

Unnamed: 0,F1-score
KNN,0.694383
Decision Tree,0.68414
SVM,0.679288
LogisticRegression,0.658271


Based on the F1-score of each model, the best model was KNN