# Capstone Project - Car Accident Severity

## Introduction
Car accidents, or in general traffic accidents are a serious problem of the modern society. The World Health Organisation estimates that very year, road accidents result in more than 1.3 million deaths, 20 to 50 million of non-fatal injuries and costs economies 3% of their anual gross domestic product through lost resources, productivity and collateral damage. It is thus important to determine the factors leading to accidents, in order to develop strategies to eliminate or mitigate them to reduce the occurences of traffic accidents.

Traffic accidents lead to a variety of consequences, ranging from altercations, minor property damages to the more severe loss of human lives. Having studied the factors causing traffic accidents, a subsequent, important step is to then determine **what affects the level of severity of accidents**.

### Problem
Aside from understanding the factors causing accidents, it is also imperative to **understand what causes severe accidents** so that targeted, prioritied strategies can be developed to reduce high severity accident occurences first, as an efficient use of limited resources. 

### Interest
With such insights, country agencies can efficiently allocate resources to reduce high severity accidents by eliminating mitigatable factors (e.g. improving lighting conditions at specific junctions) and amerliorate the consequences of un-mitigatable accidents (e.g. deploy more medical/evacuation personnel at regions where and/or periods during which high severity accidents are likely to occur to to increase survivability of those invovled).


## Data Acquisition & Understanding
### Data Source
In order to answer the question on which factors affect the severity of accidents, the data should include information/attributes on the weather conditions, location, number and types of parties involved, other event factors and preferably the labelled data attribute of accident severity.

Luckily, a convenient data source has been kindly provided by the course intstructors <u>[here](https://www.coursera.org/learn/applied-data-science-capstone/supplement/Nh5uS/downloading-example-dataset)</u>. Metadata of the dataset can be found <u>[here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf)</u>.

### Data Description
The provided dataset contains 194,673 entries (rows), with 38 different features (columns). Each entry contains information regarding an accident incident, generally including information on:

* **Severity of the accident**
   * This includes severity class/code, severity description  
* **Location of the accident**
   * This includes (X, y) coordinates, address, location type, junction type
* **Date-Time of the accident**
   * This includes the date and the time
* **Environment conditions**
   * This includes the weather, road surface conditions, lighting conditions
* **Parties involved**
   * This includes the number of pedestrian, vehichles, cyclists involved
* **Event information**
   * This includes information on the type of collision, the description of the collision and if the vehicle was speeding

A snapshot of the dataset is shown below.

In [1]:
# import relevant libraries
import pandas as pd
import numpy as np

In [2]:
import csv
import requests

url = 'https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'

# download csv dataset onto local directory
with requests.get(url, stream=True) as response:
    response.raise_for_status()

    with open("Data_collisions.csv", "wb") as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
        file.flush()

In [3]:
# load as pandas dataframe
df = pd.read_csv('Data_collisions.csv', index_col=0)
df.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
SEVERITYCODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N


In [4]:
df.columns

Index(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS',
       'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC',
       'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT',
       'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE', 'INCDTTM',
       'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND',
       'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT',
       'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY',
       'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

### Initial Feature Selection
The main targeted feature would be the *SEVERITYCODE*, which represents the two possible state/class of the accidents - (1) Property damage only, (2) Injury. Given that this is a binary classification problem, Gradient Boosting Classifier (GBC) which was found to normally provide superior results for binary and multi-class classification tasks will be implemented.

The table below shows a list of features that will be dropped and those that will be used in the subsequent Exploratory Data Analysis (EDA) and feature engineering:

| Feature Categories | Features to keep | Features to drop |
| -------------------|------------------|------------------|
| Severity of incident |SEVERITYCODE | SEVERITYDESC|
| Junction & location | ADDRTYPE, JUNCTIONTYPE, CROSSWALKKEY, SEGLANEKEY | INTKEY, X, Y, LOCATION |
| Date Time | INCDATE, INCDTTM | - |
| Parties Involved | PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT, WEATHER, ROADCOND, LIGHTCOND, HITPARKEDCAR| - |
| Event Information | - |ST_COLCODE, ST_COLDESC, COLLISIONTYPE, SDOT_COLCODE, SDOT_COLDESC, SDOTCOLNUM, SPEEDING, INATTENTIONIND |
| Others | STATUS, EXCEPTRSNCODE, EXCEPTRSNDESC | OBJECTID, INCKEY, COLDETKEY, REPORTNO, UNDERINFL, PEDROWNOTGRNT |

#### Severity
There are several redundant features found in the dataset. For example, *SEVERITYDESC* describes the type of severity, either as "injury collision" or "property damage only" and this is similar to the information *SEVERITYCODE* presents. It is necessary to remove *SEVERITYDESC* or we run the risk of causing target leakage.

#### Junction & Location
Further redundancies include *INTKEY* which corresponds to the specific collision intersection. Using this information may result in overfitting due to high specificity. Instead, *ADDRTYPE*, which classifies the collision address into 3 general classes is a much better feature to use.

Another feature *JUNCTIONTYPE* also presents information regarding the category of the junction where the accidents occur. A closer inspection of the values reveal that it can adds a layer of dimension to *ADDRTYPE* as it contains information on whether the accdident is related to a junction or not, i.e. the address can be a non-junction, but still be junction related. As such, *JUNCTIONTYPE* would be retained. *CROSSWALKKEY* could also be used to indicate if the accident junction involves a crosswalk, encoding could be used to label rows with valid crosswalk keys (non-zero values) and those without (zero values).

Regarding location features in the dataset which includes *X, Y, LOCATION*, while these are useful to identify certain specific hotspot location, they do not present general information about the location features, i.e. we are unable to decipher the features that may be of importance such as curvature, inclination, speed limits of the roads at the specific location. It is not recommended to use these features without further feature engineering.

*SEGLANEKEY* indicates the key for the lane segment where the accident occured, i.e. cycling lane or car lanes. We could include this feature for further analysis on relevancy.

#### Date-Time
*INCDATE, INCDTTM* contains the datetime information of the incident and are relevant.

#### Environmental Conditions
*WEATHER, ROADCOND, LIGHTCOND* which represents the environmental conditions surrounding the accident event are relevant.

#### Parties Involved
Other relevant features would include *PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT* which provides information on the number of different parties involved in the accident and are thus highly relevant. They indicate if cyclists, pedestrians and/or vehicles are involved.

The *HITPARKEDCAR* feature is an interesting feature which presents information on whether a parked vehicle was involved in the accident or not. We could include this feature for further analysis on relevancy.

#### Event Information
*ST_COLCODE, ST_COLDESC, COLLISIONTYPE, SDOT_COLCODE, SDOT_COLDESC, SDOTCOLNUM* features contain information about the collision event, i.e. how the accidents occur, whether it involves rear-ending, accident at an angle etc. While these are useful information they are information that are generated after the accident has occured by SDOT and are not readily and reliably available before and during the collision, they are thus removed.

The *SPEEDING* feature presents information on whether speeding (i.e. speed above a stipulated speed limit) was a factor of the collision or not. This feature is surprisingly not very useful as it does not indicate how much above what speed limit, nor how this classification was derived. It may have been subjectively derived by SDOT (e.g. if speeding happened, but SDOT did not judge it to be a factor). The *INATTENTIONIND* feature has similar attributes and thus would also not be included.

#### Others
Features *OBJECTID, INCKEY, COLDETKEY, REPORTNO* can also be removed as the contain keys and identification numbers which does not provide further information currently.

Features *STATUS, EXCEPTRSNCODE, EXCEPTRSNDESC* can be used be used to clean the data.

For *UNDERINFL, PEDROWNOTGRNT*, these features are related to laws that have been broken in the event of the accident. Given that these are traffic laws which are already enforced, these will not be of interest to the current study.


In [5]:
# list of features to keep

features = ['STATUS','ADDRTYPE', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE',
            'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
            'INCDTTM','JUNCTIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
            'SEGLANEKEY','CROSSWALKKEY', 'HITPARKEDCAR']