# **Capstone Project - Car accident severity**

### 1) Introduction/Business Problem

#####  Task: Clearly define a problem or an idea of your choice. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

The aim of this study is to predict the severity of a car accident. This would be a tool that, given the weather and the road conditions and other variables, warms people about the possibility of getting into a car accident and how severe it would be, so that people would drive more carefully or even change their travel if they are able to. 

### 2) Data

##### Task: Describe the data that you will be using to solve the problem or execute your idea. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('Data-Collisions.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a look at the database.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
SEVERITYCODE      194673 non-null int64
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null obje

'SEVERITYCODE' is our target label. Let's analyse it.

In [4]:
df['SEVERITYCODE'].head()

0    2
1    1
2    1
3    1
4    2
Name: SEVERITYCODE, dtype: int64

In [5]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

We have just two classes. This will simplify our job with the classifier later. However, in the meta data there is an explanation for the severity code:

1) property damage

2) injury


Now let's analyse the features and select the ones that are appropriate for the problem. In order to be suitable for classification problems, features should be categorical variables. In case we have some useful feature that is not categorical, we could still get dunny variables from them, so we do not exclude non-categorical variables a priori.

Let's start with variables that refers to the location of the accident.

In [6]:
df[['X','Y','LOCATION']]

Unnamed: 0,X,Y,LOCATION
0,-122.323148,47.703140,5TH AVE NE AND NE 103RD ST
1,-122.347294,47.647172,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N
2,-122.334540,47.607871,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST
3,-122.334803,47.604803,2ND AVE BETWEEN MARION ST AND MADISON ST
4,-122.306426,47.545739,SWIFT AVE S AND SWIFT AV OFF RP
...,...,...,...
194668,-122.290826,47.565408,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST
194669,-122.344526,47.690924,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST
194670,-122.306689,47.683047,20TH AVE NE AND NE 75TH ST
194671,-122.355317,47.678734,GREENWOOD AVE N AND N 68TH ST


'LOCATION' is text about where the car accident happened not good as categorical variable and since it is a description of two streets, not good to be transformed as dummy variable. 'X','Y' are coordinates. We could retrieve the district of the city with geopy and give a dummy variable to each district. Let's try:

In [7]:
!pip install geopy



In [8]:
from geopy.geocoders import Nominatim
locator = Nominatim(user_agent='myGeocoder')
a = locator.reverse('47.703140,-122.323148')
a

Location(Thornton Creek Apartments, 514, Northeast 103rd Street, Licton Springs, Maple Leaf, Seattle, King County, Washington, 98125, United States of America, (47.7033299, -122.3220510871665, 0.0))

The problem is again that accident can happen in between two disctrits as in the example: Licton Springs and Maple Leaf. This is not a good variable to be transformed into a dummy one, since it would be a combination of 2 or more variables (i.e. the districts). We could still separate the city in north and south (or East and West) with the coordinates. In case I would have some time left I will try it. For now, we drop 'X', 'Y' and 'LOCATION'.

In [9]:
df.drop(['X', 'Y', 'LOCATION'], axis=1, inplace=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 35 columns):
SEVERITYCODE      194673 non-null int64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
SDOT_COLDESC      194673 non-null object


Next variables 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO' are just codes and we discard them.

In [11]:
df.drop(['OBJECTID', 'INCKEY', 'COLDETKEY','REPORTNO'], axis=1, inplace=True)

In [12]:
df['STATUS'].value_counts()

Matched      189786
Unmatched      4887
Name: STATUS, dtype: int64

'STATUS' is not described into the meta data and I frankly do not have any idea what it refers to, so I discard this too.

In [13]:
df.drop(['STATUS'], axis=1, inplace=True)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 30 columns):
SEVERITYCODE      194673 non-null int64
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
SDOT_COLDESC      194673 non-null object
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null objec

In [15]:
df['ADDRTYPE'].value_counts()

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

'ADDRTYPE' looks like a good variable that we can transform into a dummy one. We keep it. The next 'INTKEY','EXCEPTRSNCODE','EXCEPTRSNDESC' are codes or not valuable info. 'SEVERITYCODE.1' looks like a copy of 'SEVERITYCODE' and 'SEVERITYDESC' contains the same info, but in a text format. We discard all this columns.

In [16]:
df.drop(['INTKEY','EXCEPTRSNCODE','EXCEPTRSNDESC','SEVERITYCODE.1','SEVERITYDESC'], axis=1, inplace=True)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 25 columns):
SEVERITYCODE      194673 non-null int64
ADDRTYPE          192747 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
SDOT_COLDESC      194673 non-null object
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SDOTCOLNUM        114936 non-null float64
SPEEDING          9333 non-null object
ST_COLCODE        194655 non-null object
ST_COLDESC        189769 non-null obje

In [18]:
df['COLLISIONTYPE'].value_counts()

Parked Car    47987
Angles        34674
Rear Ended    34090
Other         23703
Sideswipe     18609
Left Turn     13703
Pedestrian     6608
Cycles         5415
Right Turn     2956
Head On        2024
Name: COLLISIONTYPE, dtype: int64

We keep 'COLLISIONTYPE' since it will be a good dummy variable. We also keep 'PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT' that are good categorical variables.

In [19]:
df[['INCDATE','INCDTTM']]

Unnamed: 0,INCDATE,INCDTTM
0,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM
1,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM
2,2004/11/18 00:00:00+00,11/18/2004 10:20:00 AM
3,2013/03/29 00:00:00+00,3/29/2013 9:26:00 AM
4,2004/01/28 00:00:00+00,1/28/2004 8:04:00 AM
...,...,...
194668,2018/11/12 00:00:00+00,11/12/2018 8:12:00 AM
194669,2018/12/18 00:00:00+00,12/18/2018 9:14:00 AM
194670,2019/01/19 00:00:00+00,1/19/2019 9:25:00 AM
194671,2019/01/15 00:00:00+00,1/15/2019 4:48:00 PM


'INCDATE' and 'INCDTTM' reports the time and date of the accident. We keep 'INCDTTM' since it is more complete and discrd 'INCDATE'.

In [20]:
df.drop(['INCDATE'], axis=1, inplace=True)

In [21]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

We keep 'JUNCTIONTYPE' and 'SDOT_COLCODE' that describe the accident with a code and discard 'SDOT_COLDESC'.

In [22]:
df['SDOT_COLDESC'].value_counts()

MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE          85209
MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END                    54299
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE          9928
NOT ENOUGH INFORMATION / NOT APPLICABLE                          9787
MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT                    8856
MOTOR VEHCILE STRUCK PEDESTRIAN                                  6518
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE           5852
MOTOR VEHICLE STRUCK OBJECT IN ROAD                              4741
MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE            3104
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE         1604
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE          1440
PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE             1312
MOTOR VEHICLE OVERTURNED IN ROAD                                  479
MOTOR VEHICLE STRUCK PEDALCYCLIST, REAR END                       181
PEDALCYCLIST STRUCK 

In [23]:
df.drop(['SDOT_COLDESC'], axis=1, inplace=True)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 23 columns):
SEVERITYCODE      194673 non-null int64
ADDRTYPE          192747 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SDOTCOLNUM        114936 non-null float64
SPEEDING          9333 non-null object
ST_COLCODE        194655 non-null object
ST_COLDESC        189769 non-null object
SEGLANEKEY        194673 non-null int64
CROSSWALKKEY      194673 non-null int64

We keep 'INATTENTIONIND','UNDERINFL','WEATHER','ROADCOND','LIGHTCOND','PEDROWNOTGRNT','SPEEDING'and 'HITPARKEDCAR'. We discard 'SEGLANEKEY','CROSSWALKKEY','SDOTCOLNUM' and 'ST_COLCODE','ST_COLDESC' since they have similar information of 'SDOT_COLCODE'.

In [25]:
df.drop(['SEGLANEKEY','CROSSWALKKEY','SDOTCOLNUM','ST_COLCODE','ST_COLDESC'], axis=1, inplace=True)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 18 columns):
SEVERITYCODE      194673 non-null int64
ADDRTYPE          192747 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SPEEDING          9333 non-null object
HITPARKEDCAR      194673 non-null object
dtypes: int64(6), object(12)
memory usage: 26.7+ MB


We convert object-type columns in categorical variables

In [27]:
df['INATTENTIONIND']=df['INATTENTIONIND'].fillna(0)
df['INATTENTIONIND'].replace('Y',1, inplace=True)
df['INATTENTIONIND']=df['INATTENTIONIND'].astype(int)

In [28]:
df['INATTENTIONIND'].value_counts()

0    164868
1     29805
Name: INATTENTIONIND, dtype: int64

In [29]:
df['UNDERINFL'].unique()

array(['N', '0', nan, '1', 'Y'], dtype=object)

In [30]:
df['UNDERINFL'].replace('Y',1, inplace=True)
df['UNDERINFL'].replace('N',0, inplace=True)
df['UNDERINFL'].replace('1',1, inplace=True)
df['UNDERINFL'].replace('0',0, inplace=True)
df['UNDERINFL']=df['UNDERINFL'].fillna(0)
df['UNDERINFL']=df['UNDERINFL'].astype(int)

In [31]:
df['UNDERINFL'].value_counts()

0    185552
1      9121
Name: UNDERINFL, dtype: int64

In [32]:
df['PEDROWNOTGRNT'].value_counts()

Y    4667
Name: PEDROWNOTGRNT, dtype: int64

In [33]:
df['PEDROWNOTGRNT'].replace('Y',1, inplace=True)
df['PEDROWNOTGRNT']=df['PEDROWNOTGRNT'].fillna(0)
df['PEDROWNOTGRNT']=df['PEDROWNOTGRNT'].astype(int)

In [34]:
df['PEDROWNOTGRNT'].value_counts()

0    190006
1      4667
Name: PEDROWNOTGRNT, dtype: int64

In [35]:
df['SPEEDING'].value_counts()

Y    9333
Name: SPEEDING, dtype: int64

In [36]:
df['SPEEDING'].replace('Y',1, inplace=True)
df['SPEEDING']=df['SPEEDING'].fillna(0)
df['SPEEDING']=df['SPEEDING'].astype(int)

In [37]:
df['SPEEDING'].head()

0    0
1    0
2    0
3    0
4    0
Name: SPEEDING, dtype: int64

In [38]:
df['HITPARKEDCAR'].value_counts()

N    187457
Y      7216
Name: HITPARKEDCAR, dtype: int64

In [39]:
df['HITPARKEDCAR'].replace('Y',1, inplace=True)
df['HITPARKEDCAR'].replace('N',0, inplace=True)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 18 columns):
SEVERITYCODE      194673 non-null int64
ADDRTYPE          192747 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
INATTENTIONIND    194673 non-null int64
UNDERINFL         194673 non-null int64
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     194673 non-null int64
SPEEDING          194673 non-null int64
HITPARKEDCAR      194673 non-null int64
dtypes: int64(11), object(7)
memory usage: 26.7+ MB


I will deal with the non-categorical variables in the EDA section.