# Car accident severity analysis for Coursera Capstone project

## Introduction

This project aims at better understanding of car accidents with respect to their severity. We divide car accidents into two classes 'injury collision' and 'property damage collision'. Using data obtained from car accidents in Seattle we try to build a model that could predict severity of a car accident based on accident's details such as date and time, number of people involved, location, weather...

Better understanding of causes that lead to severe car accidents could be utilized to adopt measures that could prevent severe car accidents. 

In this project we will mainly use classification algorithms to build a model that could classify car accidents according to severity.




## Data Understanding

We use data provided by Seattle Traffic Management Division (metadata describing our dataset are available at https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf). Dataset contains 194673 entries with 38 attributes, however, not every attribute will be useful for our analysis.

First, let us extract columns that could be potentially useful in our project.

In [1]:
import pandas as pd
import numpy as np
import folium

import matplotlib as mtp
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import preprocessing
import re

In [2]:
df=pd.read_csv('Data-Collisions.csv')
df.shape

(194673, 38)

In [3]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

## Data cleaning

In the first step we drop redundant columns such as 'INCKEY', 'COLDETKEY', 'REPORTNO','STATUS', 'SEVERITYCODE.1', 'SEVERITYDESC', 'SDOT_COLDESC'. We also drop 'LOCATION' column, as it does not provide the accurate address of the accident.

In [4]:
df.drop(columns=['INTKEY', 'INCKEY','COLDETKEY', 'REPORTNO','STATUS', 'SEVERITYCODE.1', 'SEVERITYDESC', 'SDOT_COLDESC','LOCATION','EXCEPTRSNCODE','EXCEPTRSNDESC','SEGLANEKEY','CROSSWALKKEY','SDOTCOLNUM','X','Y','OBJECTID','ST_COLDESC', 'ST_COLCODE','SDOT_COLCODE'],inplace=True)
df.shape

(194673, 18)

Now let us take a look at missing values in our data frame.

In [5]:
columns=[]
missing=[]
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    columns.append(column)
    missing.append(missing_data[column].value_counts())
    
missing_values=pd.DataFrame(missing)    
missing_values.replace(np.nan, 0, inplace=True)

missing_values.drop([0],axis=1,inplace=True)

#display number of missing values for each of the attributes
missing_values


Unnamed: 0,True
SEVERITYCODE,0.0
ADDRTYPE,1926.0
COLLISIONTYPE,4904.0
PERSONCOUNT,0.0
PEDCOUNT,0.0
PEDCYLCOUNT,0.0
VEHCOUNT,0.0
INCDATE,0.0
INCDTTM,0.0
JUNCTIONTYPE,6329.0


In [6]:
#dropping columns with significant number of missing values
df.drop(columns=['INATTENTIONIND','PEDROWNOTGRNT','SPEEDING'],inplace=True)
print(df.shape)
df.head()

(194673, 15)


Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
0,2,Intersection,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM,At Intersection (intersection related),N,Overcast,Wet,Daylight,N
1,1,Block,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,N
2,1,Block,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,11/18/2004 10:20:00 AM,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,N
3,1,Block,Other,3,0,0,3,2013/03/29 00:00:00+00,3/29/2013 9:26:00 AM,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,N
4,2,Intersection,Angles,2,0,0,2,2004/01/28 00:00:00+00,1/28/2004 8:04:00 AM,At Intersection (intersection related),0,Raining,Wet,Daylight,N


Now that we are left with only 15 attributes, we see that the dataset is well-defined in a sense that every accident has a severitycode and objectid assigned. Now we consider one attribute after the other to decide if it is a good candidate for the feature set. At the same time we deal with missing values in each column.

1. ADDRTYPE

Column ADDRTYPE takes three values 'Block', 'Intersection', 'Alley'. Most accidents happened at 'Block' (significantly more than at the other two places). As a result, we decided to replace missing values by 'Block'.
 

In [7]:
# number of values for ADDRTYPE
df['ADDRTYPE'].value_counts().to_frame()

Unnamed: 0,ADDRTYPE
Block,126926
Intersection,65070
Alley,751


In [8]:
df['ADDRTYPE'].replace(np.nan, 'Block', inplace=True)
# 
df.groupby(['ADDRTYPE'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
ADDRTYPE,SEVERITYCODE,Unnamed: 2_level_1
Alley,1,0.890812
Alley,2,0.109188
Block,1,0.764947
Block,2,0.235053
Intersection,1,0.572476
Intersection,2,0.427524


Clearly, most severe accidents happen at intersections.

2. COLLISIONTYPE
In the case of COLLISIONTYPE we do not observe any leading type unlike in the case of ADDRTYPE. For this reason, we replace missing values by 'Other'.



In [9]:
df['COLLISIONTYPE'].replace(np.nan, 'Other', inplace=True)

df.groupby(['COLLISIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
COLLISIONTYPE,SEVERITYCODE,Unnamed: 2_level_1
Angles,1,0.607083
Angles,2,0.392917
Cycles,2,0.876085
Cycles,1,0.123915
Head On,1,0.56917
Head On,2,0.43083
Left Turn,1,0.605123
Left Turn,2,0.394877
Other,1,0.749956
Other,2,0.250044


We observe significant differences in the ratio between severe and not severe accidents among different types of collisions. This makes COLLISIONTYPE a good attribute for our analysis.

3. PERSONCOUNT
We decided to group PERSONCOUNT values to two groups - less than two people and two or more people involved, since these two categories show different properties with respect to severity. High number of people involved shows more severe cases than low number. In the analysis we are going to exclude column PERSONCOUNT and use columns PERSONCOUNT_BINNED instead.

In [10]:
#creating two categories for PERSONCOUNT - less than two people involved ('Low_num') and two or more people involved ('High_num')
bins=np.array([0,2,max(df['PERSONCOUNT'])])
group_names = ['Low_num','High_num']


df['PERSONCOUNT_BINNED'] = pd.cut(df['PERSONCOUNT'], bins, labels=group_names, include_lowest=True )
df.groupby(['PERSONCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
PERSONCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
Low_num,1,0.752733
Low_num,2,0.247267
High_num,1,0.589936
High_num,2,0.410064


4. PEDCOUNT
We observe a signigicant difference between cases where no pedestrian was involved and where a pedestrian took a part. Therefore, we group data into two groups 'zero pedestrians' (takes value 0) and 'pedestrian involved' (takes value 1).

5. PEDCYLCOUNT 
The same applies as for PEDCOUNT data.

6. VEHCOUNT
Similar to PERSONCOUNT

In [11]:
df['PEDCOUNT_BINNED'] = df['PEDCOUNT'].apply(lambda x: 1 if (x>0)  else 0)

df.groupby(['PEDCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
PEDCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
0,1,0.723295
0,2,0.276705
1,2,0.899409
1,1,0.100591


In [12]:
df['PEDCYLCOUNT_BINNED'] = df['PEDCYLCOUNT'].apply(lambda x: 1 if (x>0)  else 0)

df.groupby(['PEDCYLCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
PEDCYLCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
0,1,0.717832
0,2,0.282168
1,2,0.876185
1,1,0.123815


In [13]:
#creating two categories for VEHCOUNT 
bins=np.array([0,1,2,3,max(df['VEHCOUNT'])])
group_names = ['Zero','One','Two','More']


df['VEHCOUNT_BINNED'] = pd.cut(df['VEHCOUNT'], bins, labels=group_names, include_lowest=True )
df.groupby(['VEHCOUNT_BINNED'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()



Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
VEHCOUNT_BINNED,SEVERITYCODE,Unnamed: 2_level_1
Zero,1,0.502741
Zero,2,0.497259
One,1,0.756526
One,2,0.243474
Two,1,0.579554
Two,2,0.420446
More,1,0.548113
More,2,0.451887


7. INCDATE
Let us take a look at the date of the incident. Is there a signigicant difference between weekend and weekday accidents?


In [14]:
df['INCDATE'] = pd.to_datetime(df['INCDATE'])
df['INCDATE'].head()

#show day of week
df['DAYOFWEEK'] = df['INCDATE'].dt.dayofweek

#decide if accident happend on weekend or not
df['WEEKEND'] = df['DAYOFWEEK'].apply(lambda x: 1 if (x>3)  else 0)
#weekend severity score
print(df.groupby(['WEEKEND'])['SEVERITYCODE'].value_counts(normalize=True).to_frame())

#determine month of the accident
df['MONTH']=df['INCDATE'].dt.month

#accident happened in summer/winter
df['SUMMER'] = df['MONTH'].apply(lambda x: 1 if (x>3 and x<10)  else 0)
#summer severity score
print(df.groupby(['SUMMER'])['SEVERITYCODE'].value_counts(normalize=True).to_frame())

SEVERITYCODE
WEEKEND SEVERITYCODE              
0       1                 0.694865
        2                 0.305135
1       1                 0.709722
        2                 0.290278
                     SEVERITYCODE
SUMMER SEVERITYCODE              
0      1                 0.708061
       2                 0.291939
1      1                 0.694207
       2                 0.305793


In [15]:
df.drop(columns=['SUMMER','MONTH','DAYOFWEEK','WEEKEND'],inplace=True)

Unfortunately, INCDATE did not provide any useful information, as we do not see significant differences between weekday/weekend accident severity and summer/winter accident severity.

8. JUNCTIONTYPE



In [16]:
df["JUNCTIONTYPE"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()



Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
JUNCTIONTYPE,SEVERITYCODE,Unnamed: 2_level_1
At Intersection (but not related to intersection),1,0.703051
At Intersection (but not related to intersection),2,0.296949
At Intersection (intersection related),1,0.567362
At Intersection (intersection related),2,0.432638
Driveway Junction,1,0.696936
Driveway Junction,2,0.303064
Mid-Block (but intersection related),1,0.679816
Mid-Block (but intersection related),2,0.320184
Mid-Block (not related to intersection),1,0.78392
Mid-Block (not related to intersection),2,0.21608


9. WEATHER

In [17]:
df["WEATHER"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()


Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
WEATHER,SEVERITYCODE,Unnamed: 2_level_1
Blowing Sand/Dirt,1,0.732143
Blowing Sand/Dirt,2,0.267857
Clear,1,0.677509
Clear,2,0.322491
Fog/Smog/Smoke,1,0.671353
Fog/Smog/Smoke,2,0.328647
Other,1,0.860577
Other,2,0.139423
Overcast,1,0.684456
Overcast,2,0.315544


10. ROADCOND

In [18]:
df["ROADCOND"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()


Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
ROADCOND,SEVERITYCODE,Unnamed: 2_level_1
Dry,1,0.678227
Dry,2,0.321773
Ice,1,0.774194
Ice,2,0.225806
Oil,1,0.625
Oil,2,0.375
Other,1,0.674242
Other,2,0.325758
Sand/Mud/Dirt,1,0.693333
Sand/Mud/Dirt,2,0.306667


11. LIGHTCOND

In [19]:
df["LIGHTCOND"].replace(np.nan, 'Unknown', inplace=True)
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()


Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
LIGHTCOND,SEVERITYCODE,Unnamed: 2_level_1
Dark - No Street Lights,1,0.782694
Dark - No Street Lights,2,0.217306
Dark - Street Lights Off,1,0.736447
Dark - Street Lights Off,2,0.263553
Dark - Street Lights On,1,0.701589
Dark - Street Lights On,2,0.298411
Dark - Unknown Lighting,1,0.636364
Dark - Unknown Lighting,2,0.363636
Dawn,1,0.670663
Dawn,2,0.329337


12. HITPARKED

In [20]:
df["HITPARKEDCAR"].replace('N', 0, inplace=True)
df["HITPARKEDCAR"].replace('Y', 1, inplace=True)
df["HITPARKEDCAR"]=df["HITPARKEDCAR"].astype("int")
df.groupby(['HITPARKEDCAR'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
HITPARKEDCAR,SEVERITYCODE,Unnamed: 2_level_1
0,1,0.691983
0,2,0.308017
1,1,0.937916
1,2,0.062084


13. UNDERINFL

For this feature, new problem arises - we have to group values 0 and N (for "not under influence") and 1 and Y (for "under influence"). At the same time we replace missing values by 0, since the majority of people involved in the accident were not under influence.

In [21]:
df["UNDERINFL"].replace(np.nan, 0, inplace=True)
df["UNDERINFL"].replace('N', 0, inplace=True)
df["UNDERINFL"].replace('Y', 1, inplace=True)
#changing the data type to integer

df["UNDERINFL"]=df["UNDERINFL"].astype("int")
df.groupby(['UNDERINFL'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEVERITYCODE
UNDERINFL,SEVERITYCODE,Unnamed: 2_level_1
0,1,0.705603
0,2,0.294397
1,1,0.609473
1,2,0.390527


In [22]:
#size of the dataset after cleaning
df.shape
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR,PERSONCOUNT_BINNED,PEDCOUNT_BINNED,PEDCYLCOUNT_BINNED,VEHCOUNT_BINNED
0,2,Intersection,Angles,2,0,0,2,2013-03-27 00:00:00+00:00,3/27/2013 2:54:00 PM,At Intersection (intersection related),0,Overcast,Wet,Daylight,0,Low_num,0,0,One
1,1,Block,Sideswipe,2,0,0,2,2006-12-20 00:00:00+00:00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,0,Low_num,0,0,One
2,1,Block,Parked Car,4,0,0,3,2004-11-18 00:00:00+00:00,11/18/2004 10:20:00 AM,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,0,High_num,0,0,Two
3,1,Block,Other,3,0,0,3,2013-03-29 00:00:00+00:00,3/29/2013 9:26:00 AM,Mid-Block (not related to intersection),0,Clear,Dry,Daylight,0,High_num,0,0,Two
4,2,Intersection,Angles,2,0,0,2,2004-01-28 00:00:00+00:00,1/28/2004 8:04:00 AM,At Intersection (intersection related),0,Raining,Wet,Daylight,0,Low_num,0,0,One


## Feature selection and preparation

Now that we have decided which attributes might be of use, we have to make a new data frame in a format suitable for classification algoritms. First, we have to select desired columns and then we need to replace columns with object type by dummy columns.

In [23]:
index=['HITPARKEDCAR','UNDERINFL']
Feature=df[index]

In [24]:
Feature = pd.concat([Feature,pd.get_dummies(df['ADDRTYPE'])], axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['COLLISIONTYPE'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['PERSONCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['PEDCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['PEDCYLCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['VEHCOUNT_BINNED'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['WEATHER'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['ROADCOND'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])],axis=1)
Feature=pd.concat([Feature,pd.get_dummies(df['JUNCTIONTYPE'])],axis=1)


Feature.head()

Unnamed: 0,HITPARKEDCAR,UNDERINFL,Alley,Block,Intersection,Angles,Cycles,Head On,Left Turn,Other,...,Dusk,Other.1,Unknown,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Unknown.1
0,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## Data visualization

Let us visualize the accidents on map of Seattle. Firsty, we display the accidents together with their severity. Due to memory demands, we plot only first 10000 incidents.

In [25]:
from folium import plugins

seattle=folium.Map(location=[47.6178622,-122.3164431],zoom_start=11)
seattle
# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()
# read the file once again
df_map=pd.read_csv('Data-Collisions.csv')
df_map=df_map[['X','Y','SEVERITYCODE']].head(10000)
df_map=df_map.dropna()

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_map.Y, df_map.X, df_map.SEVERITYCODE):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle

It might be interesting to see the distribution of severe/not severe incidents on a map. To this end we will employ chlorophlet map with the color spectrum showing the number of accidents in a zip-code area.

First, we have to extract zip code from the data. Since column LOCATION did not include this information, we will use Google GeoData API. This API returns results about position given by coordinates which we have at our disposal.


#getting formatted address from Google API
google_api_key='removed_from_script'

import requests
address=[]
def get_coordinates(api_key, coordinates, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}&key={}'.format(coordinates,api_key)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        return results[0]['formatted_address']
    except:         
        return None

for item in range(0,df.shape[0]):    
    coordinates_x = str(df.iloc[item,2])
    coordinates_y=str(df.iloc[item,1])
    coordinates=coordinates_x+','+coordinates_y
    
    address.append(get_coordinates(google_api_key,coordinates))

address=pd.DataFrame(address)
address.to_csv('address.csv')

In [26]:
#transform data from address.csv and append it to our existing data frame
positions=[]
f = open("addresses.csv", "r")

positions=(f.readlines())

#extracting postal code using regular expressions
import re
zip_code=[]
for item in positions:
    y=re.findall('WA ([0-9.]+)', item)
    y=str(y).strip("'[]'")
    zip_code.append(y)



In [27]:
#geojson data for Seattle
seattle_geo = r'seattle.json'

In [28]:
#new column called zip_code
df['zip_code']=zip_code



In [29]:
#data frame for severity code=1
df1 = df[df['SEVERITYCODE'] ==1]
#only columns severity code and zip_code are of interest
df1=df1[['SEVERITYCODE','zip_code']]

#the same for severity code=2
df2=df[df['SEVERITYCODE'] ==2]
df2=df2[['SEVERITYCODE','zip_code']]

In [30]:
#count accidents with the same zip_code for df1 and df2
df1_map_data=df1['zip_code'].value_counts().to_frame()
df1_map=df1_map_data.reset_index()

df2_map_data=df2['zip_code'].value_counts().to_frame()
df2_map=df2_map_data.reset_index()


In [31]:
world_map = folium.Map(location=[47.6178622,-122.3164431],zoom_start=11)

# generate choropleth map using the number of cases in zip areas for severity=1
world_map.choropleth(
    geo_data=seattle_geo,
    data=df1_map,
    columns=[ 'index','zip_code'],
    key_on='feature.properties.ZCTA5CE10',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Car accidents without injury in Seattle'
)

# display map
world_map

In [32]:
world_map = folium.Map(location=[47.6178622,-122.3164431],zoom_start=11)

# generate choropleth map for accident with severity=2
world_map.choropleth(
    geo_data=seattle_geo,
    data=df2_map,
    columns=[ 'index','zip_code'],
    key_on='feature.properties.ZCTA5CE10',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Severe car accicents in Seattle'
)

# display map
world_map

We want to utilize zip_code data in analysis. Lets group areas by similar percentages with respect to severity.

In [33]:
#lets see severity percentage for areas
df_zip=df.groupby(['zip_code'])['SEVERITYCODE'].value_counts(normalize=True).to_frame()


In [34]:
#lets cut data into three categories according to ratio of severity 1. Write down zip codes for corresponding categories
#betweenn 0.68 and 0.72 for severity 1
group1=[ '98103', '98101',  '98108', '98117', 
       '98122', '98136',   '98104', '98105', '98134', 
       '98118', '98144', '98115',   '98146', 
        '98119', '98109', '98126', '98107', 
        '98178'  ,'dummy' ]
#less than 0.68 for severity 1
group2=['98106','98125','98133','98155','98164','98177','98181','98195']       
      
#more than 0.72 for severity 1
group3=['98102','98111','98112','98116','98121','98124','98154','98174','98199']       

#replace nan values
df["zip_code"].replace(np.nan, 'dummy', inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR,PERSONCOUNT_BINNED,PEDCOUNT_BINNED,PEDCYLCOUNT_BINNED,VEHCOUNT_BINNED,zip_code
0,2,Intersection,Angles,2,0,0,2,2013-03-27 00:00:00+00:00,3/27/2013 2:54:00 PM,At Intersection (intersection related),0,Overcast,Wet,Daylight,0,Low_num,0,0,One,98125
1,1,Block,Sideswipe,2,0,0,2,2006-12-20 00:00:00+00:00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,0,Low_num,0,0,One,98103
2,1,Block,Parked Car,4,0,0,3,2004-11-18 00:00:00+00:00,11/18/2004 10:20:00 AM,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,0,High_num,0,0,Two,98101
3,1,Block,Other,3,0,0,3,2013-03-29 00:00:00+00:00,3/29/2013 9:26:00 AM,Mid-Block (not related to intersection),0,Clear,Dry,Daylight,0,High_num,0,0,Two,98174
4,2,Intersection,Angles,2,0,0,2,2004-01-28 00:00:00+00:00,1/28/2004 8:04:00 AM,At Intersection (intersection related),0,Raining,Wet,Daylight,0,Low_num,0,0,One,98108


In [35]:
#defining three areas
df['Area1'] = df['zip_code'].apply(lambda x: 1 if (x in group1)  else 0)
df['Area2'] = df['zip_code'].apply(lambda x: 1 if (x in group2)  else 0)
df['Area3'] = df['zip_code'].apply(lambda x: 1 if (x in group3)  else 0)


In [36]:
#extending the feature set
Feature=pd.concat([Feature,df['Area1']],axis=1)
Feature=pd.concat([Feature,df['Area2']],axis=1)
Feature=pd.concat([Feature,df['Area3']],axis=1)

Feature.head()


Unnamed: 0,HITPARKEDCAR,UNDERINFL,Alley,Block,Intersection,Angles,Cycles,Head On,Left Turn,Other,...,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Unknown,Area1,Area2,Area3
0,0,0,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
4,0,0,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0


## Data preparation summary

We selected 15 significant features of the data for analysis. We replaced missing values meaningfully and transformed the data in a format readable by classification algorithms. The data are split into two groups - target data and Feature data.

# Data modelling

We consider two algorithms - decision tree and logistic regression. To find the best parameters, we use grid search algorithm. This method makes the task of finding the best parameters feasible as we do not have to manually split the dataset into train/test set and perform cross-validation to get the best results.

In [37]:
#defining the feature set and target set
df_features=Feature
df_target=df['SEVERITYCODE']

#replacing severity code =2 by 1 and 1 by 0, so that for f1 score true positives are true severity cases
df_target.replace(1,0,inplace=True)
df_target.replace(2,1,inplace=True)

df_target=df_target.values



Decision Tree

In [46]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# defining different scoring methods to be considered
scoring = { 'Accuracy': make_scorer(accuracy_score),'F1': make_scorer(f1_score),'recall': make_scorer(recall_score),'precision': make_scorer(precision_score)}

#setting parameters for grid search
parameters = {'criterion': ('gini', 'entropy'), 'splitter':('best', 'random'),'max_depth':[6,7,8,9]}

model = DecisionTreeClassifier()
clf = GridSearchCV(model,
                  parameters,
                  scoring=scoring, refit='recall')
clf.fit(df_features, df_target)
results = clf.cv_results_

Results for desicion tree

The best parameters for model is desicion tree with criterion: etrophy, number of levels 8, splitter set to random. Mean accuracy for this model is 0.76. In the results report of grid search we can find f1 score for these parameters which is equal to 0.41. Recall score=0.28 and presicion score=0.75.

In [47]:
print(clf.best_params_)
print("best accuracy score", clf.best_score_)
results


{'criterion': 'entropy', 'max_depth': 8, 'splitter': 'best'}
best accuracy score 0.277600584018156


{'mean_fit_time': array([1.29079919, 1.23423538, 1.25961967, 1.30539041, 1.40552764,
        1.40962934, 1.57742171, 1.58021927, 1.13249793, 1.14508905,
        1.73632345, 1.27341042, 1.46888938, 1.38933916, 1.49987001,
        1.46469073]),
 'std_fit_time': array([0.23006863, 0.24433241, 0.01902024, 0.02235897, 0.03724595,
        0.04411534, 0.03105164, 0.03378596, 0.03862535, 0.02511821,
        0.63027424, 0.08070541, 0.11018492, 0.06068196, 0.01847604,
        0.01166466]),
 'mean_score_time': array([0.18508496, 0.11792712, 0.11832685, 0.12452326, 0.1185277 ,
        0.12132397, 0.12232499, 0.13012013, 0.12312431, 0.12112541,
        0.1630991 , 0.12192397, 0.11672735, 0.11892633, 0.11672826,
        0.11752791]),
 'std_score_time': array([0.12335632, 0.00296479, 0.0030701 , 0.0069982 , 0.00135523,
        0.00466879, 0.00553209, 0.00172002, 0.00751758, 0.00248072,
        0.07543726, 0.0034614 , 0.00116622, 0.00493548, 0.0007487 ,
        0.0018535 ]),
 'param_criterion': masked

Logistic regression

In [48]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

#parameters for grid search
parameters = {'solver':('newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'), 'C':[0.01,0.1,1]}

model = LogisticRegression()
clf = GridSearchCV(model,
                  parameters,
                  scoring=scoring,
                  refit='recall')
clf.fit(df_features, df_target)
results = clf.cv_results_

In [49]:
print(clf.best_params_)
results

{'C': 1, 'solver': 'newton-cg'}


{'mean_fit_time': array([12.18444595,  5.65389533,  1.49407468,  3.39989185,  5.1943809 ,
        18.84751601,  5.80640264,  1.89322686,  5.50618701, 10.36837382,
        33.89448681,  6.62069626,  2.27159038, 12.27520509, 24.10935593]),
 'std_fit_time': array([0.54816385, 0.36400149, 0.0931729 , 0.14706305, 0.11208778,
        1.24467119, 0.23635983, 0.03442679, 1.02930542, 2.68635199,
        4.34500794, 1.65622161, 0.19605418, 0.22458928, 0.55973267]),
 'mean_score_time': array([0.15870252, 0.13191895, 0.14431071, 0.13571739, 0.13471599,
        0.14610896, 0.12772012, 0.13092179, 0.14970765, 0.22805772,
        0.24444809, 0.12951927, 0.12352433, 0.12632251, 0.12192492]),
 'std_score_time': array([0.02301053, 0.00540048, 0.00996575, 0.00810373, 0.00548898,
        0.00748907, 0.00386528, 0.00424092, 0.01000256, 0.0740516 ,
        0.07290198, 0.01627064, 0.00360851, 0.00694105, 0.00178795]),
 'param_C': masked_array(data=[0.01, 0.01, 0.01, 0.01, 0.01, 0.1, 0.1, 0.1, 0.1, 0.1,
     

Results for logistic regression

The best parameters for model is logistic regression with criterion: C: 1, solver: 'newton-cg'. Mean accuracy for this model is 0.76. In the results report of grid search we can find f1 score for these parameters which is equal to 0.44. Recall score=0.32 and presicion score=0.71.