<a href="https://www.bigdatauniversity.com"><img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="400" align="center" /></a>

# <div align="center">Coursera Capstone</div>
### <div align="center">Author: <a href="https://www.linkedin.com/in/kaming-yip-22b03a1a0/">Kaming Yip</a>&emsp;&emsp;Date: Aug. 24, 2020</div>

### 1. Introduction/Business Problem

Say you are driving to another city for work or to visit some friends, or even simply on the way to your vacation. If there was an accident happened on the way, the traffic would be terrible and everyone would be stuck in that slow lane. Cars already stuck in traffic have no choice but to wait no matter how long it takes to pass the accident scene; however, incoming cars or drivers would have no idea what happens at the front of the lane but to also join the lane. This could take couples of hours to completely clean the field and go back to normal.

However, if there is something in place that could warn you, given the weather and the road conditions, the possibility of you getting into a car accident and how severe it would be, it would be much better for you to drive more carefully or even change your travel if you are able to. It can be imagined that, with these useful prompts, tens of thousands of drivers can be saved from the unexpected or undesirable accidents and better control their travel plans.

Therefore, in this project, we will be working on a case study which is to predict the severity of an accident based on the attributes such as weather, road, and light conditions, so that we can provide suggestions for drivers based on the environment to maintain safe and efficient driving.


### 2. Data Description

The dataset we will use in this project is the <a href="https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv">shared data</a> originally provided by <a href="https://www.seattle.gov/transportation">Seattle Department of Transportation(SDOT)</a> Traffic Management Division, Traffic Records Group, and modified to particularly meet the project criteria.

Let's take a look at the head of the dataset.

In [1]:
import pandas as pd

In [2]:
collision_data = pd.read_csv("Data-Collisions.csv", low_memory = False)
collision_data.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


The entire dataset has 194,673 observations(rows) and 38 attributes(columns). Each row is a collision happened in Seattle recorded by <a href="https://www.seattle.gov/police/">Seattle Police Department(SPD)</a> from January 2004 through May 2020. The metadata of the dataset can be found <a href="https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf">in here</a>.

The first column, named <i>SEVERITYCODE</i>, is the labeled data(i.e. the output variable), which corresponds to the severity of each collision:

|   Severity Code   |   Description   |   Count   |
|-------------------|-----------------|-----------|
|3                  |Fatality         |NA         |
|2b                 |Serious Injury   |NA         |
|2                  |Injury           |136,485    |
|1                  |Prop Damage      |58,188     |
|0                  |Unknown          |NA         |

It is worth noting that the labels are unbalanced, with 136,485 cases in a severity of 2 and 58,188 cases in a severity of 1. We will definitely want to balance the data, otherwise, the model created by this original dataset will be a biased one, <i>garbage in, garbage out</i>. And we will handle this problem in the Methodology section.

The remaining columns are the attributes that some of them will be used to train the predictive model and try to predict the different accidents' severity. Among the columns, we focus specifically on those which have high correlation with the labeled data, for example location, weather condition, road condition, light condition, junction type, car speeding, number of people and vehicles involved in, and so forth. Feature engineering will also be applied in this project to improve the predictability of the model. We will explain more on how and why we will choose the features in the Methodology section.


### 3. Methodology

#### 3.1 Data Overview

SEVERITYCODE, X, Y, ADDRTYPE, COLLISIONTYPE, PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT,
VEHCOUNT, INCDATE, INCDTTM, JUNCTIONTYPE, INATTENTIONIND, UNDERINFL, WEATHER, ROADCOND,
LIGHTCOND, PEDROWNOTGRNT, SPEEDING, HITPARKEDCAR 



In [23]:
import folium
from IPython.display import display

In [24]:
latitudes = list(collision_data["X"].dropna())[:100]
longitudes = list(collision_data["Y"].dropna())[:100]
labels = list(collision_data.loc[collision_data["X"].notnull(), "SEVERITYCODE"])[:100]

In [26]:
seattle_lat = -122.3321
seattle_lng = 47.6062
seattle_map = folium.Map(location=[-122.3321, 47.6062], zoom_start = 5.5)

seattle_map

In [11]:
accidents = folium.map.FeatureGroup()

for lat, lng in zip(latitudes, longitudes):
    accidents.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius = 3,
            color = "yellow",
            fill = True,
            fill_color = "blue",
            fill_opacity = 0.6
        )
    )



for lat, lng, label in zip(latitudes, longitudes, labels):
    folium.Marker([lat, lng], popup = label).add_to(seattle_map)
    
seattle_map.add_child(accidents)

In [None]:
# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Y, df_incidents.X):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add pop-up text to each marker on the map
latitudes = list(df_incidents.Y)
longitudes = list(df_incidents.X)
labels = list(df_incidents.Category)

for lat, lng, label in zip(latitudes, longitudes, labels):
    folium.Marker([lat, lng], popup=label).add_to(sanfran_map)    
    
# add incidents to map
sanfran_map.add_child(incidents)

In [None]:
from folium import plugins

# let's start again with a clean copy of the map of San Francisco
sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(sanfran_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
sanfran_map

In [None]:
print(collision_data.columns)
collision_data["SEVERITYDESC"].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
test = collision_data.copy()
test["Y"].head()

In [None]:
test[test["X"].notnull()]

In [None]:

corr = collision_data.corr()
print(corr)
sns.heatmap(corr, xticklabels = corr.columns, yticklabels = corr.columns, linewidths = 0.1, cmap = "coolwarm_r")
plt.show()