# Capstone Project - Predicting Accident Severity factors (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

The city council has recently made a pledge to try and reduce the number of injury collisions in their city after a high-profile case involving a kid getting injured made national news. However, they currently disagree on what the most effective solution for this is.

They've pulled up the SDOT accident collision data to see which factors (if any) can predict if an accident will be property damage or an injury collision. They hope to 

From there, they can see which factor has the greatest predictive power, and implement the appropriate solution.

Here are some of the different competing factors (and solutions) that many of the city council members believe are the root cause:

1. Certain locations are dangerous, so if this has the highest correlation, then we need to find the intersections with the highest number of accidents to fix them.

2. Time of day may play a large role, so if this has the highest correlation, then we may need to install more streetlights.

3. The type of address (Block, Intersection, Alley) and Collision (Rear-end, Left Turn, etc.) may play a role, which means reviewing how the city approaches designing those types of roadways.

4. Weather or road-condition may play a role.

5. Other miscellaneous factors, such as # of pedestrians or bikes, may play a role.






## Data <a name="data"></a>

Existing data set and possible SDOT Collisions in Seattle
Additional data for crash statistics and how it appears from national average

Because the city council has different hypotheses on what we need to do, we should explore these before looking at different solutions. To do this, we need to first import relevant libraries such as matplotlib, pandas, numpy, and folium. 

After that, we can read in the relevant 



In [1]:
%matplotlib inline

import matplotlib as mpl
import pandas as pd
import numpy as np

import folium
from folium import plugins

print('Folium installed and imported!')

print("Hello Capstone Project Course!")

Folium installed and imported!
Hello Capstone Project Course!


In [2]:
crash_data = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')

print('Data downloaded and read into a dataframe!')

  interactivity=interactivity, compiler=compiler, result=result)


Data downloaded and read into a dataframe!


In [3]:
crash_data.isnull().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

In [4]:
# clean up the dataset to remove unnecessary columns (eg. REG) 
#crash_data.drop(['INCKEY','OBJECTID','COLDETKEY','SDOTCOLNUM'], axis=1, inplace=True)


# for sake of consistency, let's also make all column labels of type string
crash_data.columns = list(map(str, crash_data.columns))

#Remove Missing X and Y values 
# simply drop whole row with NaN in "price" column
crash_data.dropna(subset=["X","Y"], axis=0, inplace=True)

# reset index, because we droped two rows
crash_data.reset_index(drop=True, inplace=True)

crash_data.isnull().sum()


SEVERITYCODE           0
X                      0
Y                      0
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE               0
INTKEY            124591
LOCATION               0
EXCEPTRSNCODE     107639
EXCEPTRSNDESC     185653
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4757
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        4193
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    160163
UNDERINFL           4737
WEATHER             4925
ROADCOND            4858
LIGHTCOND           5012
PEDROWNOTGRNT     184694
SDOTCOLNUM         77621
SPEEDING          180619
ST_COLCODE            18
ST_COLDESC          4757
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

## Methodology <a name="methodology"></a>

I will first have to do a bit of data wrangling. There are a number of different non-integer values for columns, which means that I will first have to come up with numerical codes for certain conditions (such as road condition, weather, and type of address).

I will also need to deal with missing values for a number of the columns. Several columns are missing thousands of values, which means that I will take the average of certain columns to filling in the missing values, and then I will drop others (or not use them).

Lastly, I will need to do some binning/category ranges for values such as X and Y. Given that I will probably use a Folium map to examine the accident data, I will need to come up with districts or other groups to categorize which areas are most dangerous.

After this, I will split the revised data into training and test data. I will separate out the training values into two categories: 1 (Property damage) or 2 (Injury collision).

From there, I will start by examining Pearson correlation with the SEVERITYCODE values to see if there is a good predictor for SEVERITYCODE 2. If nothing in particular stands out, I will try out different modeling techniques to see if there is a good fit for the training data.

To make this easier, I will likely visualize the data in a number of different ways depending on what solution I am trying to pursue. Some examples will likely include Folium Maps, but also simple box plot/line graphs and scatter plots.

From there, I will determine which factors I believe to be most relevant to the problem, and test it on the test data. I will examine the R^2/MSE to determine whether or not this is a good fit for the data.

https://github.com/Kaiwong3006/Coursera_Capstone/tree/master





In [5]:
#Correlations without data wrangling
crash_data.corr()

#It seems like Personcount, Pedcount, and Pedcylcount and Crosswalkkey are the closest correlation. But that's without breaking down X and Y


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.02119,0.022581,0.022586,0.004849,1.0,0.128866,0.246722,0.214969,-0.058067,0.185926,0.005814,0.104878,0.176014
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.02119,0.009956,-0.023848,1.0,0.946085,0.945539,0.045476,0.02119,-0.062879,0.025104,0.034791,-0.095751,-0.034854,0.969311,0.028291,0.05655
INCKEY,0.022581,0.010309,-0.027396,0.946085,1.0,0.999996,0.046684,0.022581,-0.062269,0.025094,0.031422,-0.109595,-0.026313,0.990651,0.019731,0.048362
COLDETKEY,0.022586,0.0103,-0.027415,0.945539,0.999996,1.0,0.046652,0.022586,-0.062174,0.025086,0.031372,-0.109669,-0.026172,0.990651,0.019615,0.048242
INTKEY,0.004849,0.120754,-0.114935,0.045476,0.046684,0.046652,1.0,0.004849,-0.000281,-0.003988,0.000478,-0.013624,0.007741,0.033923,-0.010282,0.019296
SEVERITYCODE.1,1.0,0.010309,0.017737,0.02119,0.022581,0.022586,0.004849,1.0,0.128866,0.246722,0.214969,-0.058067,0.185926,0.005814,0.104878,0.176014
PERSONCOUNT,0.128866,0.012887,-0.01385,-0.062879,-0.062269,-0.062174,-0.000281,0.128866,1.0,-0.024764,-0.040317,0.37564,-0.136945,0.011847,-0.022093,-0.03341
PEDCOUNT,0.246722,0.011304,0.010178,0.025104,0.025094,0.025086,-0.003988,0.246722,-0.024764,1.0,-0.017461,-0.265337,0.267683,0.022448,0.001577,0.567358


In [6]:
#But wait, X and Y correlations are wrong!
#It looks like there might be certain neighborhoods that might have higher density, but the points are not surrounding something 
#If X and Y are highly co-ordinated, that would be like if there was a high concentration at a Stadium or something in particular
#So location may play a role, but not X 

crash_table_X = crash_data["X"].value_counts(dropna=True)
crash_table_X

crash_table_Y = crash_data["Y"].value_counts(dropna=True)
crash_table_Y

#X/Y Measures indicate only 5 values at exact same spot. So what else can we look at? 

47.708655    265
47.717173    254
47.604161    252
47.725036    239
47.579673    231
            ... 
47.556705      1
47.709101      1
47.513899      1
47.565438      1
47.563521      1
Name: Y, Length: 23839, dtype: int64

In [7]:
# get the first 500 crimes in the crash_data dataframe
limit = 2000
crash_data = crash_data.iloc[0:limit, :]

# Seattle latitude and longitude values
latitude = 47.61
longitude = -122.33

# create map and display it
seattle_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# display the map of Seattle
seattle_map

# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

# loop through the 500 crimes and add each to the incidents feature group
for lat, lng, in zip(crash_data.Y, crash_data.X):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add incidents to map
seattle_map.add_child(incidents)

In [8]:
#Looking at Seattle Chloropleth map
# download countries geojson file
!wget http://boundaries-api.seattle.io/boundaries -O seattle.geojson
print('GeoJSON file downloaded!')

--2020-09-28 18:24:17--  http://boundaries-api.seattle.io/boundaries
Resolving boundaries-api.seattle.io (boundaries-api.seattle.io)... 52.88.223.222
Connecting to boundaries-api.seattle.io (boundaries-api.seattle.io)|52.88.223.222|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://boundaries-api.seattle.io/boundaries [following]
--2020-09-28 18:24:17--  https://boundaries-api.seattle.io/boundaries
Connecting to boundaries-api.seattle.io (boundaries-api.seattle.io)|52.88.223.222|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘seattle.geojson’

seattle.geojson         [ <=>                ]     214  --.-KB/s    in 0s      

2020-09-28 18:24:18 (638 KB/s) - ‘seattle.geojson’ saved [214]

GeoJSON file downloaded!


In [9]:
# create a plain world map
seattle_map2 = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='OpenStreetMap')

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map2)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(crash_data.Y, crash_data.X, crash_data.SEVERITYDESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map2



In [10]:
#Look at the same map now only with 1's and 2's as distinctions

crash_data.sort_values(by=["SEVERITYCODE"], ascending= 1, inplace = True)

Severe_crash = crash_data[crash_data['SEVERITYCODE'] == 2]

# create a plain world map
seattle_map3 = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='OpenStreetMap')

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map3)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(Severe_crash.Y, Severe_crash.X, Severe_crash.SEVERITYDESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map3



In [11]:
#So while there are a disproportionate # of crashes in certain areas, that doesn't necessarily equate to severity of crash
#To show this, let's look more at data

crash_data[["X","Y","SEVERITYCODE"]].corr()

#Future research: ArcGIS into co-ordinates for better correlation

Unnamed: 0,X,Y,SEVERITYCODE
X,1.0,-0.095699,0.019089
Y,-0.095699,1.0,0.023628
SEVERITYCODE,0.019089,0.023628,1.0
