# Capstone Project - Car accident severity
### Shengyu Wu

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)



## Introduction

Car accidents have a significant impact on individuals, their families and the nation. It is always one of the top issues in society. According to NSC, in US, an estimated 38,800 people lost their lives to car crashes in 2019 – a 2% decline from 2018 (39,404 deaths) and a 4% decline from 2017 (40,231 deaths). About 4.4 million people were injured seriously enough to require medical attention in crashes last year.Therefore,if there was an algorithm that can predict severity of car accidents, it could be efficient and faster for police to arrive the accident scene and give right help.

This project is attempting to classify varies factors that will cause the accidents and predict the level of severity by leveraging data collected from different kinds of car accidents.

## Data <a name="data"></a>

For this dataset, we can see there are 38 different attributes, some of them are relatively not important to analyze the car accident severity. As a result, we drop them to emphasize the main factors. For data cleaning, miss values will be replaced by 'other' or 'unknown' based on the information in each column.

Importing needed packages

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
from sklearn.utils import resample
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import matplotlib.image as mpimg
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
import matplotlib as mpl

Download data set

In [2]:
!wget -O Data-Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

--2020-09-01 16:58:54--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘Data-Collisions.csv’


2020-09-01 16:58:56 (40.9 MB/s) - ‘Data-Collisions.csv’ saved [73917638/73917638]



Load data from csv file

In [3]:
df = pd.read_csv('Data-Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


Main feature selection

In [4]:
df1 = df.drop([ "SEVERITYCODE.1", "OBJECTID", "INCKEY", "COLDETKEY", "REPORTNO","STATUS", "ADDRTYPE", "INTKEY", 
          "EXCEPTRSNCODE","EXCEPTRSNDESC", "PEDCYLCOUNT", "PEDCOUNT", "SDOT_COLCODE", "SDOT_COLDESC", "ST_COLCODE", "SEGLANEKEY", "CROSSWALKKEY", "SDOTCOLNUM", 
           "INCDATE", "INCDTTM", "PEDROWNOTGRNT", "UNDERINFL", "HITPARKEDCAR", "ST_COLDESC", "SEVERITYDESC"], axis=1)
df1.head()

Unnamed: 0,SEVERITYCODE,X,Y,LOCATION,COLLISIONTYPE,PERSONCOUNT,VEHCOUNT,JUNCTIONTYPE,INATTENTIONIND,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,-122.323148,47.70314,5TH AVE NE AND NE 103RD ST,Angles,2,2,At Intersection (intersection related),,Overcast,Wet,Daylight,
1,1,-122.347294,47.647172,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Sideswipe,2,2,Mid-Block (not related to intersection),,Raining,Wet,Dark - Street Lights On,
2,1,-122.33454,47.607871,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Parked Car,4,3,Mid-Block (not related to intersection),,Overcast,Dry,Daylight,
3,1,-122.334803,47.604803,2ND AVE BETWEEN MARION ST AND MADISON ST,Other,3,3,Mid-Block (not related to intersection),,Clear,Dry,Daylight,
4,2,-122.306426,47.545739,SWIFT AVE S AND SWIFT AV OFF RP,Angles,2,2,At Intersection (intersection related),,Raining,Wet,Daylight,


In [5]:
df1.columns

Index(['SEVERITYCODE', 'X', 'Y', 'LOCATION', 'COLLISIONTYPE', 'PERSONCOUNT',
       'VEHCOUNT', 'JUNCTIONTYPE', 'INATTENTIONIND', 'WEATHER', 'ROADCOND',
       'LIGHTCOND', 'SPEEDING'],
      dtype='object')

Find counts of unique values for columns

In [6]:
df1['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [7]:
df1['COLLISIONTYPE'].value_counts()

Parked Car    47987
Angles        34674
Rear Ended    34090
Other         23703
Sideswipe     18609
Left Turn     13703
Pedestrian     6608
Cycles         5415
Right Turn     2956
Head On        2024
Name: COLLISIONTYPE, dtype: int64

In [8]:
df1['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

In [9]:
df1['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [10]:
df1['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [11]:
df1['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

Deal with missing value

In [12]:
# Sum the missing value of the dataset
df1.isna().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
LOCATION            2677
COLLISIONTYPE       4904
PERSONCOUNT            0
VEHCOUNT               0
JUNCTIONTYPE        6329
INATTENTIONIND    164868
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
SPEEDING          185340
dtype: int64

In [13]:
# Replace missing value with 'other'or 'unknown'
df1['COLLISIONTYPE'].replace(np.NaN, "Other", inplace=True)

In [14]:
df1['JUNCTIONTYPE'].replace(np.NaN, "Unknown", inplace=True)

In [15]:
df1['INATTENTIONIND'].replace(np.NaN, "N", inplace=True)

In [16]:
df1['WEATHER'].replace(np.NaN, "Unknown", inplace=True)

In [17]:
df1['ROADCOND'].replace(np.NaN, "Unknown", inplace=True)

In [18]:
df1['LIGHTCOND'].replace(np.NaN, "Unknown", inplace=True)

In [19]:
df1['SPEEDING'].replace(np.NaN, "N", inplace=True)

In [21]:
avg_X = df1["X"].astype("float").mean(axis=0)
df1['X'].replace(np.NaN, avg_X, inplace=True)

In [22]:
avg_Y = df1["Y"].astype("float").mean(axis=0)
df1['Y'].replace(np.NaN, avg_Y, inplace=True)

In [23]:
df1['LOCATION'].replace(np.NaN, "Unknown", inplace=True)

In [24]:
df1.isna().sum()

SEVERITYCODE      0
X                 0
Y                 0
LOCATION          0
COLLISIONTYPE     0
PERSONCOUNT       0
VEHCOUNT          0
JUNCTIONTYPE      0
INATTENTIONIND    0
WEATHER           0
ROADCOND          0
LIGHTCOND         0
SPEEDING          0
dtype: int64

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results

## Discussion

## Conclusion <a name="conclusion"></a>