# Introduction to Car accident severity analysis and data understanding

## Introduction

This project aims at better understanding of car accidents with respect to their severity. We divide car accidents into two classes 'injury collision' and 'property damage collision'. Using data obtained from car accidents in Seattle we try to build a model that could predict severity of a car accident based on accident's details such as date and time, number of people involved, location, weather...

Better understanding of causes that lead to severe car accidents could be utilized to adopt measures that could prevent severe car accidents. 

In this project we will mainly use classification algorithms to build a model that could classify car accidents according to severity.




## Data Understanding

We use data provided by Seattle Traffic Management Division (metadata describing our dataset are available at https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf). Dataset contains 194673 entries with 38 attributes, however, not every attribute will be useful for our analysis.

First, let us extract columns that could be potentially useful in our project.

In [174]:
import pandas as pd
import numpy as np
import folium
import matplotlib as mtp
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import preprocessing
import re

In [175]:
df=pd.read_csv('Data-Collisions.csv')
df.shape

(194673, 38)

In [176]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In the first step we drop redundant columns such as 'INCKEY', 'COLDETKEY', 'REPORTNO','STATUS', 'SEVERITYCODE.1', 'SEVERITYDESC', 'SDOT_COLDESC'. We also drop 'LOCATION' column, as it does not provide the accurate address of the accident.

In [177]:
df.drop(columns=['INTKEY', 'COLDETKEY', 'REPORTNO','STATUS', 'SEVERITYCODE.1', 'SEVERITYDESC', 'SDOT_COLDESC','LOCATION','EXCEPTRSNCODE','EXCEPTRSNDESC','SEGLANEKEY','CROSSWALKKEY','SDOTCOLNUM','X','Y'],inplace=True)
df.shape

(194673, 23)

Now let us take a look at missing values in our data frame.

In [178]:
columns=[]
missing=[]
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    columns.append(column)
    missing.append(missing_data[column].value_counts())
    
missing_values=pd.DataFrame(missing)    
missing_values.replace(np.nan, 0, inplace=True)

missing_values.drop([0],axis=1,inplace=True)

#display number of missing values for each of the attributes
missing_values


Unnamed: 0,True
SEVERITYCODE,0.0
OBJECTID,0.0
INCKEY,0.0
ADDRTYPE,1926.0
COLLISIONTYPE,4904.0
PERSONCOUNT,0.0
PEDCOUNT,0.0
PEDCYLCOUNT,0.0
VEHCOUNT,0.0
INCDATE,0.0


We see that the dataset is well-defined in a sense that every accident has a severitycode and objectid assigned. Now we consider one attribute after the other to decide if it is a good candidate for the feature set. At the same time we deal with missing values in each column.

1. ADDRTYPE

Column ADDRTYPE takes three values 'Block', 'Intersection', 'Alley'. Most accidents happened at 'Block' (significantly more than at the other two places). As a result, we decided to replace missing values by 'Block'.
 

In [179]:
# number of values for ADDRTYPE
df['ADDRTYPE'].value_counts().to_frame()

Unnamed: 0,ADDRTYPE
Block,126926
Intersection,65070
Alley,751


In [180]:
df['ADDRTYPE'].replace(np.nan, 'Block', inplace=True)
# 
df.groupby(['ADDRTYPE'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
ADDRTYPE,SEVERITYCODE,Unnamed: 2_level_1
Alley,1,0.890812
Alley,2,0.109188
Block,1,0.764947
Block,2,0.235053
Intersection,1,0.572476
Intersection,2,0.427524


Clearly, most severe accidents happen at intersections.

2. COLLISIONTYPE
In the case of COLLISIONTYPE we do not observe any leading type unlike in the case of ADDRTYPE. For this reason, we replace missing values by 'Other'.



In [181]:
df['COLLISIONTYPE'].value_counts().to_frame()

Unnamed: 0,COLLISIONTYPE
Parked Car,47987
Angles,34674
Rear Ended,34090
Other,23703
Sideswipe,18609
Left Turn,13703
Pedestrian,6608
Cycles,5415
Right Turn,2956
Head On,2024


In [182]:
df['COLLISIONTYPE'].replace(np.nan, 'Block', inplace=True)

df.groupby(['COLLISIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
COLLISIONTYPE,SEVERITYCODE,Unnamed: 2_level_1
Angles,1,0.607083
Angles,2,0.392917
Block,1,0.787724
Block,2,0.212276
Cycles,2,0.876085
Cycles,1,0.123915
Head On,1,0.56917
Head On,2,0.43083
Left Turn,1,0.605123
Left Turn,2,0.394877


We observe significant differences in the ratio between severe and not severe accidents among different types of collisions. This makes COLLISIONTYPE a good attribute for our analysis.

3. PERSONCOUNT
We decided to group PERSONCOUNT values to two groups - less than two people and two or more people involved, since these two categories show different properties with respect to severity. High number of people involved shows more severe cases than low number. In the analysis we are going to exclude column PERSONCOUNT and use columns PERSONCOUNT_BINNED instead.

In [183]:
#creating two categories for PERSONCOUNT - less than two people involved ('Low_num') and two or more people involved ('High_num')
bins=np.array([0,2,max(df['PERSONCOUNT'])])
group_names = ['Low_num','High_num']


df['PERSONCOUNT_BINNED'] = pd.cut(df['PERSONCOUNT'], bins, labels=group_names, include_lowest=True )
df.groupby(['PERSONCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
PERSONCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
Low_num,1,0.752733
Low_num,2,0.247267
High_num,1,0.589936
High_num,2,0.410064


4. PEDCOUNT
We observe a signigicant difference between cases where no pedestrian was involved and where a pedestrian took a part. Therefore, we group data into two groups 'zero pedestrians' (takes value 0) and 'pedestrian involved' (takes value 1).

5. PEDCYLCOUNT 
The same applies as for PEDCOUNT data.

6. VEHCOUNT
Similar to PERSONCOUNT

In [184]:
df['PEDCOUNT_BINNED'] = df['PEDCOUNT'].apply(lambda x: 1 if (x>0)  else 0)

df.groupby(['PEDCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True)

PEDCOUNT_BINNED  SEVERITYCODE
0                1               0.723295
                 2               0.276705
1                2               0.899409
                 1               0.100591
Name: SEVERITYCODE, dtype: float64

In [185]:
df['PEDCYLCOUNT_BINNED'] = df['PEDCYLCOUNT'].apply(lambda x: 1 if (x>0)  else 0)

df.groupby(['PEDCYLCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True)

PEDCYLCOUNT_BINNED  SEVERITYCODE
0                   1               0.717832
                    2               0.282168
1                   2               0.876185
                    1               0.123815
Name: SEVERITYCODE, dtype: float64

In [199]:
#creating two categories for VEHCOUNT 
bins=np.array([0,1,2,3,max(df['VEHCOUNT'])])
group_names = ['Zero','One','Two','More']


df['VEHCOUNT_BINNED'] = pd.cut(df['VEHCOUNT'], bins, labels=group_names, include_lowest=True )
df.groupby(['VEHCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()



Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
VEHCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
Zero,1,0.502741
Zero,2,0.497259
One,1,0.756526
One,2,0.243474
Two,1,0.579554
Two,2,0.420446
More,1,0.548113
More,2,0.451887


7. INCDATE
Let us take a look at the date of the incident. Is there a signigicant difference between weekend and weekday accidents?


In [187]:
df['INCDATE'] = pd.to_datetime(df['INCDATE'])
df['INCDATE'].head()

#show day of week
df['DAYOFWEEK'] = df['INCDATE'].dt.dayofweek

#decide if accident happend on weekend or not
df['WEEKEND'] = df['DAYOFWEEK'].apply(lambda x: 1 if (x>3)  else 0)
#weekend severity score
print(df.groupby(['WEEKEND'])['SEVERITYCODE'].value_counts(normalize=True).to_frame())

#determine month of the accident
df['MONTH']=df['INCDATE'].dt.month

#accident happened in summer/winter
df['SUMMER'] = df['MONTH'].apply(lambda x: 1 if (x>3 and x<10)  else 0)
#summer severity score
print(df.groupby(['SUMMER'])['SEVERITYCODE'].value_counts())

SEVERITYCODE
WEEKEND SEVERITYCODE              
0       1                 0.694865
        2                 0.305135
1       1                 0.709722
        2                 0.290278
SUMMER  SEVERITYCODE
0       1               68570
        2               28272
1       1               67915
        2               29916
Name: SEVERITYCODE, dtype: int64


Unfortunately, INCDATE did not provide any useful information, as we do not see significant differences between weekday/weekend accident severity and summer/winter accident severity.

8. JUNCTIONTYPE



In [188]:
df["JUNCTIONTYPE"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True)



JUNCTIONTYPE                                       SEVERITYCODE
At Intersection (but not related to intersection)  1               0.703051
                                                   2               0.296949
At Intersection (intersection related)             1               0.567362
                                                   2               0.432638
Driveway Junction                                  1               0.696936
                                                   2               0.303064
Mid-Block (but intersection related)               1               0.679816
                                                   2               0.320184
Mid-Block (not related to intersection)            1               0.783920
                                                   2               0.216080
Ramp Junction                                      1               0.674699
                                                   2               0.325301
Unknown                 

In [189]:
df["WEATHER"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)


WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1               0.732143
                          2               0.267857
Clear                     1               0.677509
                          2               0.322491
Fog/Smog/Smoke            1               0.671353
                          2               0.328647
Other                     1               0.860577
                          2               0.139423
Overcast                  1               0.684456
                          2               0.315544
Partly Cloudy             2               0.600000
                          1               0.400000
Raining                   1               0.662815
                          2               0.337185
Severe Crosswind          1               0.720000
                          2               0.280000
Sleet/Hail/Freezing Rain  1               0.752212
                          2               0.247788
Snowing                   1               0

In [190]:
df["ROADCOND"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)


ROADCOND        SEVERITYCODE
Dry             1               0.678227
                2               0.321773
Ice             1               0.774194
                2               0.225806
Oil             1               0.625000
                2               0.375000
Other           1               0.674242
                2               0.325758
Sand/Mud/Dirt   1               0.693333
                2               0.306667
Snow/Slush      1               0.833665
                2               0.166335
Standing Water  1               0.739130
                2               0.260870
Unknown         1               0.909955
                2               0.090045
Wet             1               0.668134
                2               0.331866
Name: SEVERITYCODE, dtype: float64

In [191]:
df["LIGHTCOND"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)


LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1               0.782694
                          2               0.217306
Dark - Street Lights Off  1               0.736447
                          2               0.263553
Dark - Street Lights On   1               0.701589
                          2               0.298411
Dark - Unknown Lighting   1               0.636364
                          2               0.363636
Dawn                      1               0.670663
                          2               0.329337
Daylight                  1               0.668116
                          2               0.331884
Dusk                      1               0.670620
                          2               0.329380
Other                     1               0.778723
                          2               0.221277
Unknown                   1               0.909081
                          2               0.090919
Name: SEVERITYCODE, dtype: float64

In [192]:
df.groupby(['HITPARKEDCAR'])['SEVERITYCODE'].value_counts(normalize=True)
df['HITPARKEDCAR'].replace(to_replace=['Y','N'], value=[0,1],inplace=True)

## Feature selection and preparation

Now that we have decided which attributes might be of use, we have to make a new data frame in a format suitable for classification algoritms. First, we have to select desired columns and then we need to replace columns with object type by dummy columns.

In [193]:
index=['HITPARKEDCAR']
Feature=df[index]

In [200]:
Feature = pd.concat([Feature,pd.get_dummies(df['ADDRTYPE'])], axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['COLLISIONTYPE'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['PERSONCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['PEDCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['VEHCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['WEATHER'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['ROADCOND'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['JUNCTIONTYPE'])],axis=1)


Feature.head()

Unnamed: 0,HITPARKEDCAR,Alley,Block,Intersection,Angles,Block.1,Cycles,Head On,Left Turn,Other,...,Dusk,Other.1,Unknown,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Unknown.1
0,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## Data preparation summary

We selected 10 significant features of the data for analysis. We replaced missing values meaningfully and transformed the data in a format readable by classification algorithms.