## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

In this project, we seek to determine the **Severity of Car Crash Collisions** using machine learning. The data that we are using is for Seattle City, it comes from the SPD and was recorded by Traffic Records. Included in the GitHub repo is the metadata for the dataset as well as the dataset itself as of 10/8/2020. I have also included links to the metadata and dataset in the repository description. 

Determining the Severity of a Car Crash could be useful for car manufactures when desgining what kind of safety features a car may have, or be useful for actuaries when assesing insurance risk. For this reason, we will not be looking at the exact locations where car crashes occured, but the general features surrounding the car crash such as: Weather, Speeding, Collision Type, and State Codes to name a few.

## Data <a name="data"></a>

Based on the factor we want to predict, variables that might influence our model could include:
* Collision Address Type
* Collision Type
* Pedestrian Count
* Vehicle Count
* Weather
* Road Conditions
* Speeding

Using a variety of factors could allow us to predict which factors are strongest in determining the Severity of a Car Crash. This could allow individuals to determine when they should stay more vigilant to prevent Severe accidents, or allow car companies to determine the best was to make thier cars safer (car location areas to reinforce).

#### Selecting Columns

In [93]:
import numpy as np
import pandas as pd
import csv

In [94]:
# Load in dataset
df = pd.read_csv("C:/Users/chanm/Desktop/Coursera_Capstone/dataset/Data-Collisions.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [95]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


Location doesn't matter to our analysis, so we can drop the X and Y columns as well as columns pertaining to specific locations. We can also drop the columns that indicate report number, status, and keys since they are only there to identify the unique id of the crash.

We can also drop the time columns, since the LIGHTCOND column is already a good indicator for the time of the crash and visibility conditions. Other columns like the WEATHER column also are indicitive of seasonal conditions. 

Finally, we can drop columns that are essentially related, JUNCTIONTYPE and ADDRTYPE basically say the same things. So will will just keep one, ADDRTYPE for its ease of use. 

COLLISIONTYPE indicates the type of collision that has occured, this is less specific than the SDOT_COLCODE or ST_COLCODE columns, but essentially contains the same information. COLLISIONTYPE would be better for us to use as we are looking to create a general model, and should give us more accurate results than if we use the SDOT_COLCODE or ST_COLCODE which are more specific and could have outliers which could skew our model. SDOT_COLCODE and ST_COLCODE are similar as well, they are just codes used to describe the situation. For that reason we drop SDOT_COLCODE and SDOT_COLDESC as they indicate similar situations to ST_COLCODE and ST_COLDESC which we drop as well.

Funnily enough, ST_COLCODE and ST_COLDESC are the same thing, ST_COLDESC is the description for the categorical variable ST_COLCODE, therefore, we only need to keep one. SEVERITYCODE and SEVERITYDESC act in the same manner as well. 

COLLISIONTYPE is also a categorical variable that covers HITPARKEDCAR, therefore we can eliminate HITPARKEDCAR.

In [96]:
# Drop columns by index. These columns are latitude and logitute coordinates, as well as unique crash identifiers
df.drop(df.iloc[:, 1:7], inplace = True, axis = 1)

# Dropping columns with specific Location Data and their descriptions
df.drop(['INTKEY','LOCATION','EXCEPTRSNCODE','EXCEPTRSNDESC','SEGLANEKEY','CROSSWALKKEY'], inplace = True, axis = 1)

# Dropping duplicated column
df.drop(['SEVERITYCODE.1'], inplace = True, axis = 1)

# Drop Date and time columns
df.drop(['INCDATE','INCDTTM'], inplace = True, axis = 1)

# Drop another unique identifier column
df.drop(['SDOTCOLNUM'], inplace = True, axis = 1)

# Drop essentially similar columns
df.drop(['JUNCTIONTYPE','SDOT_COLCODE','SDOT_COLDESC','ST_COLCODE','ST_COLDESC','SEVERITYDESC','HITPARKEDCAR'], inplace = True, axis = 1)

df.head()

Unnamed: 0,SEVERITYCODE,STATUS,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING
0,2,Matched,Intersection,Angles,2,0,0,2,,N,Overcast,Wet,Daylight,,
1,1,Matched,Block,Sideswipe,2,0,0,2,,0,Raining,Wet,Dark - Street Lights On,,
2,1,Matched,Block,Parked Car,4,0,0,3,,0,Overcast,Dry,Daylight,,
3,1,Matched,Block,Other,3,0,0,3,,N,Clear,Dry,Daylight,,
4,2,Matched,Intersection,Angles,2,0,0,2,,0,Raining,Wet,Daylight,,


In [97]:
list(df.columns)

['SEVERITYCODE',
 'STATUS',
 'ADDRTYPE',
 'COLLISIONTYPE',
 'PERSONCOUNT',
 'PEDCOUNT',
 'PEDCYLCOUNT',
 'VEHCOUNT',
 'INATTENTIONIND',
 'UNDERINFL',
 'WEATHER',
 'ROADCOND',
 'LIGHTCOND',
 'PEDROWNOTGRNT',
 'SPEEDING']

In [98]:
df.count()

SEVERITYCODE      194673
STATUS            194673
ADDRTYPE          192747
COLLISIONTYPE     189769
PERSONCOUNT       194673
PEDCOUNT          194673
PEDCYLCOUNT       194673
VEHCOUNT          194673
INATTENTIONIND     29805
UNDERINFL         189789
WEATHER           189592
ROADCOND          189661
LIGHTCOND         189503
PEDROWNOTGRNT       4667
SPEEDING            9333
dtype: int64

#### Dealing with NaNs

We can see that there are a few columns with NaN values. Lets see what we can do about them.

In [99]:
# Sum of NaN's in each column
df.isnull().sum()

SEVERITYCODE           0
STATUS                 0
ADDRTYPE            1926
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SPEEDING          185340
dtype: int64

The STATUS column indicates if a collision was matched to a party or not. Generally a STATUS of 'Unmatched' will be the reason why there are NaN values, as not enough information could be gathered

In [100]:
df['STATUS'].value_counts()

Matched      189786
Unmatched      4887
Name: STATUS, dtype: int64

In [101]:
# Selecting Rows Where STATUS has a 'Matched' Value
df = df.loc[df['STATUS'] == 'Matched']

# Now we can drop STATUS as a column
df.drop(['STATUS'], inplace = True, axis = 1)

In [102]:
df.isnull().sum()

SEVERITYCODE           0
ADDRTYPE            1817
COLLISIONTYPE         21
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INATTENTIONIND    159981
UNDERINFL              0
WEATHER              197
ROADCOND             128
LIGHTCOND            286
PEDROWNOTGRNT     185119
SPEEDING          180453
dtype: int64

We can see that only using matched data will allow us to increase the accuracy of our model as well as reduce a large number of NaN values

In [103]:
df['ADDRTYPE'].value_counts(dropna = False)

Block           123663
Intersection     63559
NaN               1817
Alley              747
Name: ADDRTYPE, dtype: int64

In [104]:
df['COLLISIONTYPE'].value_counts(dropna = False)

Parked Car    47986
Angles        34674
Rear Ended    34089
Other         23703
Sideswipe     18608
Left Turn     13703
Pedestrian     6607
Cycles         5415
Right Turn     2956
Head On        2024
NaN              21
Name: COLLISIONTYPE, dtype: int64

We will replace the NaN values in ADDRTYPE with the value 'Other' since there are special cases which might not fall into the other categories of ADDRTYPE. We will do the same for COLLISIONTYPE for the same reason.

In [105]:
df['ADDRTYPE'].fillna('Other', inplace = True)
df['COLLISIONTYPE'].fillna('Other', inplace = True)

In [106]:
df['INATTENTIONIND'].value_counts(dropna = False)

NaN    159981
Y       29805
Name: INATTENTIONIND, dtype: int64

It looks like INATTENTIONIND is an indicator for whether or not an individual was paying attention. Therefore we would be correct in turning NaN values into 'N'.

In [107]:
df['WEATHER'].value_counts(dropna = False)

Clear                       111134
Raining                      33144
Overcast                     27713
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
NaN                            197
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [108]:
df['ROADCOND'].value_counts(dropna = False)

Dry               124508
Wet                47473
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
NaN                  128
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [109]:
df['LIGHTCOND'].value_counts(dropna = False)

Daylight                    116135
Dark - Street Lights On      48506
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
NaN                            286
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [110]:
df['WEATHER'].fillna('Unknown', inplace = True)
df['ROADCOND'].fillna('Unknown', inplace = True)
df['LIGHTCOND'].fillna('Unknown', inplace = True)

We can just group missing values into the Unknown category 

In [111]:
df['PEDROWNOTGRNT'].value_counts(dropna = False)

NaN    185119
Y        4667
Name: PEDROWNOTGRNT, dtype: int64

In [112]:
df['SPEEDING'].value_counts(dropna = False)

NaN    180453
Y        9333
Name: SPEEDING, dtype: int64

In [113]:
df['INATTENTIONIND'].value_counts(dropna = False)

NaN    159981
Y       29805
Name: INATTENTIONIND, dtype: int64

In [114]:
df['PEDROWNOTGRNT'].fillna('N', inplace = True)
df['SPEEDING'].fillna('N', inplace = True)
df['INATTENTIONIND'].fillna('N', inplace = True)

Simply assigning NaN values to 'N' should be obvious as these columns are indicator columns.

In [115]:
df['UNDERINFL'].value_counts(dropna = False)

N    100274
0     80391
Y      5126
1      3995
Name: UNDERINFL, dtype: int64

In [118]:
df['UNDERINFL'].replace({"0": "N", "1": "Y"}, inplace=True)

Replacing UNDERINFL values to be binary

In [120]:
df.isnull().sum()

SEVERITYCODE      0
ADDRTYPE          0
COLLISIONTYPE     0
PERSONCOUNT       0
PEDCOUNT          0
PEDCYLCOUNT       0
VEHCOUNT          0
INATTENTIONIND    0
UNDERINFL         0
WEATHER           0
ROADCOND          0
LIGHTCOND         0
PEDROWNOTGRNT     0
SPEEDING          0
dtype: int64

In [121]:
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING
0,2,Intersection,Angles,2,0,0,2,N,N,Overcast,Wet,Daylight,N,N
1,1,Block,Sideswipe,2,0,0,2,N,N,Raining,Wet,Dark - Street Lights On,N,N
2,1,Block,Parked Car,4,0,0,3,N,N,Overcast,Dry,Daylight,N,N
3,1,Block,Other,3,0,0,3,N,N,Clear,Dry,Daylight,N,N
4,2,Intersection,Angles,2,0,0,2,N,N,Raining,Wet,Daylight,N,N


## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>