# Capstone Project :Car Accident Severity
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Average number of car accidents in the U.S. every year is 6 million. 3 million people in the U.S. are injured every year in car accidents. Analysing the conditions that contribute to these accidents would lead to the prevention of significant loss of life and financial resources.

The project is aimed at **predicting the severity of a car accident** given the **location, weather, road and visibility conditions** in order to reduce the frequency of car collusions in a community based on dataset provided by **Seattle** PD. Consequently, this analysis would aid drivers to exercise more caution while driving or even choose an alternative route or time for their travel if possible. It could also potentially help the local government, the police department and car insurance providers to gain deeper insight into road accidents.

## Data <a name="data"></a>

The detailed dataset of all road collisions (since 2004 to present) can be found here. This data was provided by the Seattle Police Department and recorded by Traffic Records Department. The dataset consists of 37 independent fields and 194673 records, which includes both numerical and categorical data. The dependent field or label for the data set is SEVERITYCODE, which describes the fatality of an accident. The values under this label are categorised into fatality (3), serious injury (2b), injury (2), prop damage (1) and unknown (0).

## Methodology <a name="methodology"></a>

In [105]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from pylab import rcParams
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, jaccard_score, log_loss
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split


In [3]:
url ='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df = pd.read_csv(url)
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [5]:
df['SPEEDING'].replace(np.nan,'N',inplace=True)
df.value_counts("SPEEDING")

SPEEDING
N    185340
Y      9333
dtype: int64

In [50]:
df_acc = df[['SEVERITYCODE','ADDRTYPE','LOCATION', 'JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','SPEEDING']]
df_acc.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,N
1,1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,N
2,1,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Mid-Block (not related to intersection),Overcast,Dry,Daylight,N
3,1,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
4,2,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,N


In [51]:
df_acc.describe(include = "all")

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
count,194673.0,192747,191996,188344,189592,189661,189503,194673
unique,,3,24102,7,11,9,9,2
top,,Block,BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB ...,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
freq,,126926,276,89800,111135,124510,116137,185340
mean,1.298901,,,,,,,
std,0.457778,,,,,,,
min,1.0,,,,,,,
25%,1.0,,,,,,,
50%,1.0,,,,,,,
75%,2.0,,,,,,,


In [52]:
missing_data = df_acc.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

LOCATION
False    191996
True       2677
Name: LOCATION, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64

SPEEDING
False    194673
Name: SPEEDING, dtype: int64



In [53]:
df_acc['WEATHER'].replace(np.nan,'Unknown', inplace=True)
df_acc['ROADCOND'].replace(np.nan,'Unknown', inplace=True)
df_acc['LIGHTCOND'].replace(np.nan,'Unknown', inplace=True)
df_acc['JUNCTIONTYPE'].replace(np.nan,'Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(



### UNDERSAMPLING

In [54]:
df_acc.value_counts("SEVERITYCODE")

SEVERITYCODE
1    136485
2     58188
dtype: int64

In [55]:
#to balance label by undersampling

target="SEVERITYCODE"
minority_class_len = len(df_acc[df_acc[target] ==2])
majority_class_indices = df_acc[df_acc[target] ==1].index
random_majority_indices = np.random.choice(majority_class_indices,minority_class_len, replace = False)
minority_class_indices = df_acc[df_acc[target] ==2].index

under_sample_indices = np.concatenate([minority_class_indices, random_majority_indices])
df_acc = df_acc.loc[under_sample_indices]
df_acc.value_counts("SEVERITYCODE")

SEVERITYCODE
2    58188
1    58188
dtype: int64


### EXPLORATORY DATA ANALYSIS

In [56]:
df_acc.describe(include="all")

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
count,116376.0,115419,115075,116376,116376,116376,116376,116376
unique,,3,19769,7,11,9,9,2
top,,Block,AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
freq,,71449,180,49571,67872,75994,71482,110360
mean,1.5,,,,,,,
std,0.500002,,,,,,,
min,1.0,,,,,,,
25%,1.0,,,,,,,
50%,1.5,,,,,,,
75%,2.0,,,,,,,


In [57]:
df.value_counts("ADDRTYPE")

ADDRTYPE
Block           126926
Intersection     65070
Alley              751
dtype: int64

In [58]:
df.value_counts("LOCATION")

LOCATION
BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    276
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    271
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          265
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    254
6TH AVE AND JAMES ST                                              252
                                                                 ... 
NE 70TH ST BETWEEN 51ST AVE NE AND 52ND AVE NE                      1
39TH AVE E AND E LEE ST                                             1
39TH AVE E AND HILLSIDE DR E                                        1
39TH AVE E AND MCGILVRA ER BLVD E                                   1
10TH AVE AND E ALDER ST                                             1
Length: 24102, dtype: int64

In [59]:
df.value_counts("WEATHER")

WEATHER
Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
dtype: int64

In [60]:
df.value_counts("ROADCOND")

ROADCOND
Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
dtype: int64

In [61]:
df.value_counts("LIGHTCOND")

LIGHTCOND
Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
dtype: int64

In [62]:
df.value_counts("SPEEDING")

SPEEDING
N    185340
Y      9333
dtype: int64



### MODEL EVALUATION

In [63]:
# label encoding

from sklearn import preprocessing 

label_encoder = preprocessing.LabelEncoder() 
df_acc['WEATHER']=df_acc['WEATHER'].astype('str')
df_acc['WEATHER_cat']= label_encoder.fit_transform(df_acc['WEATHER']) 

df_acc['ADDRTYPE']=df_acc['ADDRTYPE'].astype('str')
df_acc['ADDRTYPE_cat']= label_encoder.fit_transform(df_acc['ADDRTYPE']) 

df_acc['JUNCTIONTYPE']=df_acc['JUNCTIONTYPE'].astype('str')
df_acc['JUNCTIONTYPE_cat']= label_encoder.fit_transform(df_acc['JUNCTIONTYPE']) 

df_acc['ROADCOND']=df_acc['ROADCOND'].astype('str')
df_acc['ROADCOND_cat']= label_encoder.fit_transform(df_acc['ROADCOND'])

df_acc['LIGHTCOND']=df_acc['LIGHTCOND'].astype('str')
df_acc['LIGHTCOND_cat']= label_encoder.fit_transform(df_acc['LIGHTCOND'])


df_acc['SPEEDING']=df_acc['SPEEDING'].astype('str')
df_acc['SPEEDING_cat']= label_encoder.fit_transform(df_acc['SPEEDING'])

df_acc

Unnamed: 0,SEVERITYCODE,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,WEATHER_cat,ADDRTYPE_cat,JUNCTIONTYPE_cat,ROADCOND_cat,LIGHTCOND_cat,SPEEDING_cat
0,2,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,N,4,2,1,8,5,0
4,2,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,N,6,2,1,8,5,0
7,2,Intersection,BROADWAY AND E PIKE ST,At Intersection (intersection related),Clear,Dry,Daylight,N,1,2,1,0,5,0
9,2,Intersection,41ST AVE SW AND SW THISTLE ST,At Intersection (intersection related),Clear,Dry,Daylight,N,1,2,1,0,5,0
14,2,Block,ROOSEVELT WAY NE BETWEEN NE 47TH ST AND NE 50T...,Mid-Block (not related to intersection),Clear,Dry,Dark - Street Lights On,N,1,1,4,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46006,1,Block,YALE AVE N BETWEEN HARRISON ST AND REPUBLICAN ST,Mid-Block (not related to intersection),Overcast,Wet,Daylight,N,4,1,4,8,5,0
123549,1,Block,M L KING JR WR WAY S BETWEEN S LILAC ST AND S ...,Mid-Block (not related to intersection),Overcast,Dry,Daylight,N,4,1,4,0,5,0
62642,1,Block,ALASKAN WY VI NB BETWEEN S ROYAL BROUGHAM WAY ...,Mid-Block (not related to intersection),Clear,Dry,Daylight,N,1,1,4,0,5,0
146494,1,Intersection,5TH AVE S AND S MAIN ST,At Intersection (intersection related),Unknown,Unknown,Unknown,N,10,2,1,7,8,0


In [77]:
#initialization
x= df_acc[['WEATHER_cat','ROADCOND_cat','LIGHTCOND_cat','SPEEDING_cat','JUNCTIONTYPE_cat']]
y=df_acc['SEVERITYCODE']


In [73]:
#Normalizing the dataset
x=preprocessing.StandardScaler().fit(x).transform(x)
x

array([[ 0.30420861,  1.46622972,  0.32270634, -0.23347913, -1.14714423],
       [ 0.99642915,  1.46622972,  0.32270634, -0.23347913, -1.14714423],
       [-0.7341222 , -0.7185665 ,  0.32270634, -0.23347913, -1.14714423],
       ...,
       [-0.7341222 , -0.7185665 ,  0.32270634, -0.23347913,  0.89057867],
       [ 2.38087022,  1.19313019,  2.06449439, -0.23347913, -1.14714423],
       [ 0.30420861,  1.46622972,  0.32270634, -0.23347913,  0.2113377 ]])

In [78]:
#Splitting the data as 70 % for training and 30 % for testing

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
print("Train set:", x_train.shape, y_train.shape)
print("Test set:", x_test.shape, y_test.shape)

Train set: (81463, 5) (81463,)
Test set: (34913, 5) (34913,)



### MODELLING AND PREDICTIONS

#### K-Nearest Neighbors

In [99]:
# Training the Model
from sklearn.neighbors import KNeighborsClassifier
k=25

kneigh = KNeighborsClassifier(n_neighbors = k).fit(x_train, y_train)
k_y_pred = kneigh.predict(x_test)
k_y_pred[0:5]

array([1, 1, 1, 1, 2], dtype=int64)

In [111]:
#Model Evaluation
j1=jaccard_score(y_test,k_y_pred)
f1=f1_score(y_test,k_y_pred, average = 'macro')
print("Jaccard Score: ",j1)
print("F1 Score: ",f1)

Jaccard Score:  0.45526428543606123
F1 Score:  0.597453791995599


#### Decision Tree

In [112]:
# Training the Model
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier (criterion = 'entropy', max_depth = 7)

dt.fit(x_train, y_train)
dt_y_pred = dt.predict(x_test)
dt_y_pred[0:5]

array([2, 1, 1, 1, 2], dtype=int64)

In [113]:
#Model Evaluation
j2=jaccard_score(y_test,dt_y_pred)
f2=f1_score(y_test,dt_y_pred, average = 'macro')
print("Jaccard Score: ",j2)
print("F1 Score: ",f2)

Jaccard Score:  0.42543690154150027
F1 Score:  0.615804885120825


#### Linear Regression

In [114]:
# Training the Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
lr = LogisticRegression(C = 6, solver = 'liblinear').fit(x_train, y_train)

lr_y_pred = lr.predict(x_test)
lr_y_prob = lr.predict_proba(x_test)
lr_y_prob

array([[0.50009414, 0.49990586],
       [0.65871039, 0.34128961],
       [0.76371336, 0.23628664],
       ...,
       [0.58112228, 0.41887772],
       [0.39855113, 0.60144887],
       [0.5153299 , 0.4846701 ]])

In [115]:
#Model Evaluation
j3=jaccard_score(y_test,lr_y_pred)
f3=f1_score(y_test,lr_y_pred, average = 'macro')
print("Jaccard Score: ",j3)
print("F1 Score: ",f3)
print("Log Loss: ",log_loss(y_test,lr_y_prob))

Jaccard Score:  0.45723828078406425
F1 Score:  0.6080281533732805
Log Loss:  0.6560436338545689


## Results and Discussion <a name="results"></a>

In [125]:
result = {'ML Model':['KNN','Decision Tree', 'Linear Regression'], 'Jaccard Score':[j1, j2, j3], 'F1 Score':[f1, f2, f3]}
result = pd.DataFrame.from_dict(result)
result

Unnamed: 0,ML Model,Jaccard Score,F1 Score
0,KNN,0.455264,0.597454
1,Decision Tree,0.425437,0.615805
2,Linear Regression,0.457238,0.608028


Evaluation metrics used to test the accuracy of our models were jaccard index and f-1 score. Choosing different k, max depth and hyperparameter C values helped to improve the accuracy of our models.

From this exercise, we notice that we can train our model to determine the severity of an accident to a certain extent using the given dataset.

## Conclusion <a name="conclusion"></a>

Based on evaluation of above models, it can be concluded thatLinear Regression is the most ideal model for this case.