# Coursera Capstone

This notebook is dedicated to analysing car crash severity data and building a ML model that can help prevent such accidents by warning users when road conditions are accident-prone.

In [61]:
import numpy as np
import pandas as pd

In [62]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


### Business Understanding

The idea behind this project is to develop a system that will warn road-users of the chances and severity of an accident occuring to them. The four factors that determine an accident are weather and road conditions, vehicle characteristics and finally human error. It is easy to deduce that weather and road conditions will have a meaningful impact on the chances and severity of an accident. For vehicle characteristics: a motorbike will have increased chances of a severe accident most of the time due to decreased protection when compared to a car for example. All of these factors influence the final result and must be taken into consideration. Human error is difficult to quantify unless the distraction/error made by the driver has been recorded, as well as being diifcult to predict (that would have to besolved with another model with a different set of attributes). We must look for the aforementioned attributes in our dataset, and use them to deduce the severity and chances of accidents happening with varying road, weather and vehicle conditions.

### Data Understanding

Extracting the data and displaying it.

In [63]:
df_collisions = pd.read_csv('Data-Collisions.csv', low_memory=False)
df_collisions

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.703140,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.334540,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194668,2,-122.290826,47.565408,219543,309534,310814,E871089,Matched,Block,,...,Dry,Daylight,,,,24,From opposite direction - both moving - head-on,0,0,N
194669,1,-122.344526,47.690924,219544,309085,310365,E876731,Matched,Block,,...,Wet,Daylight,,,,13,From same direction - both going straight - bo...,0,0,N
194670,2,-122.306689,47.683047,219545,311280,312640,3809984,Matched,Intersection,24760.0,...,Dry,Daylight,,,,28,From opposite direction - one left turn - one ...,0,0,N
194671,2,-122.355317,47.678734,219546,309514,310794,3810083,Matched,Intersection,24349.0,...,Dry,Dusk,,,,5,Vehicle Strikes Pedalcyclist,4308,0,N


The data is all car-accidents in Seattle; provided by SPD and recorded by Traffic Records

All the attributes that are not a direct result of the accident taking place should be considered. We can only use the set of attributes that were present before the accident happened. We cannot predict accidents with data that is a direct result of that accident. 
As previously mentioned, road, weather & vehicle conditions are determining factors of an accident; as well as location and location type. Human error is another one, but it is almost impossible to determine the chances of the driver being distracted with this dataset. It is only once the accident has happned that we may find out if there was human error involved.

Initially, just by looking at the metadata, the attributes I would use are:

ADDRTYPE: Alley, Block, Intersection
LOCATION: General location of collision
JUNCTIONTYPE: Category of junction at which the collision took place
WEATHER: Weather description at the location and during the time of the collision
ROADCOND: Road condition at the location and during the time of the collision
LIGHTCOND: Light conditions at the location and during the time of the collision
INDCTTM: Date and time of incident. Accidents at night are more likely to be severe etc. Accidents during rush hour are more likely to occur.....

PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT, INJURIES, SERIOUSINJURIES, FATALITIES, HITPARKEDCAR are all attributes that easily help in deducing accident severity. But because this is a predictive model, and these attributes cannot be known before the actual accident happens, they are meaningless when trying to predict accident severity and probability. For example, instead of VEHCOUNT, traffic density per location would be an attribute that we could know beforehand and therfore a meaningful attribute for the model. The same for the other attributes. How many people were travelling in the car that had the accident would be useful as well, as it is easily known how many people you are taking in your car before the actual trip. I will not go through the explanation of every single attribute; the main idea is to select attributes that are known and can be qauntified before the accident happened and taht directly contribute to the final result (accident or not and its severity). Data that is recorded as a result of the accident is obviously not useful for predictions.

Our label and what we are trying to predict is SEVERITYCODE, which displays whether there was an accident and its severity. This is what we will try to predict in the future. 

We must also look at each column (attribute) of the dataset and determine whether data is skewed, biased or missing. We must correct these issues if our model is to be precise. We can also do statistical tests as an initial screening to determine correlations between different attributes and their respective labels.

In [64]:
##Dropping useless columns as well as inspecting for missing values and checking for bias

df = df_collisions[['ADDRTYPE','LOCATION','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','INCDTTM','SEVERITYCODE']]
df

Unnamed: 0,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE
0,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,3/27/2013 2:54:00 PM,2
1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,12/20/2006 6:55:00 PM,1
2,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Mid-Block (not related to intersection),Overcast,Dry,Daylight,11/18/2004 10:20:00 AM,1
3,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,3/29/2013 9:26:00 AM,1
4,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,1/28/2004 8:04:00 AM,2
...,...,...,...,...,...,...,...,...
194668,Block,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,11/12/2018 8:12:00 AM,2
194669,Block,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,Mid-Block (not related to intersection),Raining,Wet,Daylight,12/18/2018 9:14:00 AM,1
194670,Intersection,20TH AVE NE AND NE 75TH ST,At Intersection (intersection related),Clear,Dry,Daylight,1/19/2019 9:25:00 AM,2
194671,Intersection,GREENWOOD AVE N AND N 68TH ST,At Intersection (intersection related),Clear,Dry,Dusk,1/15/2019 4:48:00 PM,2


In [65]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

In [66]:
df.drop(['JUNCTIONTYPE'], axis=1 ,inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [67]:
df

Unnamed: 0,ADDRTYPE,LOCATION,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE
0,Intersection,5TH AVE NE AND NE 103RD ST,Overcast,Wet,Daylight,3/27/2013 2:54:00 PM,2
1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Raining,Wet,Dark - Street Lights On,12/20/2006 6:55:00 PM,1
2,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Overcast,Dry,Daylight,11/18/2004 10:20:00 AM,1
3,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Clear,Dry,Daylight,3/29/2013 9:26:00 AM,1
4,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,Raining,Wet,Daylight,1/28/2004 8:04:00 AM,2
...,...,...,...,...,...,...,...
194668,Block,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,Clear,Dry,Daylight,11/12/2018 8:12:00 AM,2
194669,Block,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,Raining,Wet,Daylight,12/18/2018 9:14:00 AM,1
194670,Intersection,20TH AVE NE AND NE 75TH ST,Clear,Dry,Daylight,1/19/2019 9:25:00 AM,2
194671,Intersection,GREENWOOD AVE N AND N 68TH ST,Clear,Dry,Dusk,1/15/2019 4:48:00 PM,2


In [68]:
df['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [69]:
df['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [70]:
df['LOCATION'].value_counts()

BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    276
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    271
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          265
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    254
6TH AVE AND JAMES ST                                              252
                                                                 ... 
NW 57TH ST BETWEEN 32ND AVE NW AND NW 58TH ST                       1
1ST AVE NE BETWEEN NE 57TH ST AND NE 58TH ST                        1
NE 38TH ST BETWEEN 42ND AVE NE AND 43RD AVE NE                      1
S GENESEE ST BETWEEN 29TH N AVE S AND M L KING JR WR WAY S          1
38TH AVE SW AND SW JUNEAU ST                                        1
Name: LOCATION, Length: 24102, dtype: int64

In [71]:
df['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [72]:
df['ADDRTYPE'].value_counts()

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

In [73]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [74]:
df['INCDTTM'].value_counts()

11/2/2006                96
10/3/2008                91
11/5/2005                83
12/4/2004                74
6/1/2006                 73
                         ..
5/31/2012 8:41:00 AM      1
1/31/2005 1:18:00 PM      1
2/24/2014 6:34:00 AM      1
9/27/2015 6:34:00 AM      1
7/16/2014 11:49:00 AM     1
Name: INCDTTM, Length: 162058, dtype: int64

We can see from the Time and Date of the incidents that some entries contain the time of day whereas some contain just the day it occured. I will delete the time of day data and just concentrate on the dates.

In [75]:
df["INCDTTM"]= df["INCDTTM"].str.split(" ", n = 1, expand = True) ## I REMOVE THE TIME FROM THE DATE

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [82]:
df['INCDTTM'].value_counts() ###We can see it worked! Now remove the year!

12/20    643
11/2     642
10/28    639
11/1     637
11/15    635
        ... 
12/29    398
7/4      385
12/26    324
12/25    194
2/29     169
Name: INCDTTM, Length: 366, dtype: int64

In [80]:
df["INCDTTM"]= df["INCDTTM"].str.rsplit("/", n = 1, expand = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [84]:
df['INCDTTM'].value_counts() ### IDEALLY I WOULD SPLIT DATES INTO 3 CATEGORIES: HOLIDAYS & FESTIVES, WORK DAYS, WEEKEND. DAYS IMMEDIATELY AFTER
## A BIG HOLIDAY EVENT ALSO NOTED. ESPECIALLY EVENING AFTER BIG NATIONAL/LOCAL EVENTS (USUALLY PPL DRIVING HOME IN THE EVENING AFTER THE CELEBRATION)

##I WOULD ALSO USE THE TIME DATA TO FIGURE OUT IF THE ACCIDENT OCCURED AT NIGHT, RUSH HOUR or whether the accident occured outside of these times
### making these three categories. RUSH HOUR =more traffic, night time= less visibilty and higher cahnce of user error due to tiredness or alcohol related... 
### THE REMAINING CATEGORY WOULD BE THE REMAINDER OF NIGHT & AND RUSH HOUR CATEGORIES
### Due to time constraints I will only use the dates without the years, and I will not create extra categories (although this would be the best practice)

12/20    643
11/2     642
10/28    639
11/1     637
11/15    635
        ... 
12/29    398
7/4      385
12/26    324
12/25    194
2/29     169
Name: INCDTTM, Length: 366, dtype: int64

In [85]:
df

Unnamed: 0,ADDRTYPE,LOCATION,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE
0,Intersection,5TH AVE NE AND NE 103RD ST,Overcast,Wet,Daylight,3/27,2
1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Raining,Wet,Dark - Street Lights On,12/20,1
2,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Overcast,Dry,Daylight,11/18,1
3,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Clear,Dry,Daylight,3/29,1
4,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,Raining,Wet,Daylight,1/28,2
...,...,...,...,...,...,...,...
194668,Block,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,Clear,Dry,Daylight,11/12,2
194669,Block,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,Raining,Wet,Daylight,12/18,1
194670,Intersection,20TH AVE NE AND NE 75TH ST,Clear,Dry,Daylight,1/19,2
194671,Intersection,GREENWOOD AVE N AND N 68TH ST,Clear,Dry,Dusk,1/15,2


## Data Preprocessing

In [86]:
x = df[['ADDRTYPE','LOCATION','WEATHER','ROADCOND','LIGHTCOND','INCDTTM']]
y = df[['SEVERITYCODE']]