# **Car Accident Severity in Seattle**
## Applied Data Science Capstone - Coursera

NOTE: This notebook shows the process of building a machine learning model for accident severity prediction. It is part of the final capstone project in Coursera to obtain the IBM Professional Certificate in Data Science.

## **Introduction**

Car accidents happen every day for a variety of reasons and these have significant socioeconomic costs. Efforts to raise drivers' awareness towards mindful driving have been promoted across the USA and the authorities try to provide the conditions (e.g. road signs, traffic lights, traffic information, radars) to mitigate the probability of accidents happening. 
Today we have the data and the modeling capacities to even better understand the conditions that promote severe accidents and this project intends to build a machine learning model to better inform decision-makers in the city of Seattle using available data. This model will help the authorities to take appropriate measures to reduce accident severity and improve traffic safety.


## **Data**

In [18]:
import numpy as np
import pandas as pd

In [19]:
df= pd.read_csv(r'C:\Users\marco\Desktop\Data Science\IBM Coursera\Capstone project\Data-Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


The data was provided by the Seattle Police Department and corresponds to collisions registered between 2004 and 2020. The data is stored in a CSV file, presenting 38 columns and 194673 rows. It describes the details of each accident, including weather conditions, collision type, date/time of accident and location.

In the dataset we have 3 types of variables: integers (12), floats (4) and objects (22), as we can see below:  

In [20]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

The variable SEVERITYCODE encodes the Seattle Department of Transport accident severity metric and this will be our 'dependent variable' (the variable we want to predict). The numerical codes and their meaning are as follows:

* 0: Unknown
* 1: Property damage
* 2: Injury
* 2b: Serious injury
* 3: Fatality

The data is unbalanced, since we have many more instances of 'severity 1' compared with 'severity 2'. Data must be balanced and normalized in the data processsing step. We have 37 attributes (columns) that can be used for building the model , but not all are useful.

At this stage, the following columns were dropped from the dataset as they were deemed not useful for the model: 'OBJECTID', 'INCKEY', 'REPORTNO', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'JUNCTIONTYPE', 'STATUS', 'COLDETKEY', 'LOCATION', 'INTKEY', 'INCDATE', 'INCDTTM','SDOT_COLDESC', 'SDOTCOLNUM', 'ST_COLDESC', 'SEGLANEKEY' and 'CROSSWALKKEY'

In [21]:
df['SEVERITYCODE'].value_counts().to_frame()

Unnamed: 0,SEVERITYCODE
1,136485
2,58188


By analysing the dataset, we can see that there are only two levels (out of five) of 'severity' registered:
- 1: 136485 registrations
- 2: 58188 registrations

The data is unbalanced, since we have many more instances of 'severity 1' compared with 'severity 2'. Data must be balanced and normalized in the data processsing step.

We have 37 attributes (columns) that can be used for building the model , but not all are useful.

At this stage, the following columns were dropped from the dataset as they were deemed not useful for the model: 'OBJECTID', 'INCKEY', 'REPORTNO', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'JUNCTIONTYPE', 'STATUS', 'COLDETKEY', 'LOCATION', 'INTKEY', 'INCDATE', 'INCDTTM','SDOT_COLDESC', 'SDOTCOLNUM', 'ST_COLDESC', 'SEGLANEKEY' and 'CROSSWALKKEY'

In [22]:
df.drop(columns=['OBJECTID', 'INCKEY', 'REPORTNO', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'JUNCTIONTYPE', 'STATUS', 'COLDETKEY', 'LOCATION', 'INTKEY', 'INCDATE', 'INCDTTM','SDOT_COLDESC', 'SDOTCOLNUM', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY'], inplace= True)
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,HITPARKEDCAR
0,2,-122.323148,47.70314,Intersection,Angles,2,0,0,2,11,,N,Overcast,Wet,Daylight,,,10,N
1,1,-122.347294,47.647172,Block,Sideswipe,2,0,0,2,16,,0,Raining,Wet,Dark - Street Lights On,,,11,N
2,1,-122.33454,47.607871,Block,Parked Car,4,0,0,3,14,,0,Overcast,Dry,Daylight,,,32,N
3,1,-122.334803,47.604803,Block,Other,3,0,0,3,11,,N,Clear,Dry,Daylight,,,23,N
4,2,-122.306426,47.545739,Intersection,Angles,2,0,0,2,11,,0,Raining,Wet,Daylight,,,10,N
