# Capstone Project - Predicting Accident Severity factors (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

The city council has recently made a pledge to try and reduce the number of injury collisions in their city after a high-profile case involving a kid getting injured made national news. However, they currently disagree on what the most effective solution for this is.

They've pulled up the SDOT accident collision data to see which factors (if any) can predict if an accident will be property damage or an injury collision. They hope to 

From there, they can see which factor has the greatest predictive power, and implement the appropriate solution.

Here are some of the different competing factors (and solutions) that many of the city council members believe are the root cause:

1. Certain locations are dangerous, so if this has the highest correlation, then we need to find the intersections with the highest number of accidents to fix them.

2. Time of day may play a large role, so if this has the highest correlation, then we may need to install more streetlights.

3. The type of address (Block, Intersection, Alley) and Collision (Rear-end, Left Turn, etc.) may play a role, which means reviewing how the city approaches designing those types of roadways.

4. Weather or road-condition may play a role.

5. Other miscellaneous factors, such as # of pedestrians or bikes, may play a role.






## Data <a name="data"></a>

2.1 Data sources

I used the provided dataset but supplemented it with additional data such as GeoJSON’s for neighborhood and Zip codes.  There were a number of missing values within a number of the fields, which led to me having to adopt certain approaches talked about within previous courses. I included a dataset with specific boundaries of Seattle (Neighborhoods, Zip codes) to see if that might help in defining specific problem areas for injury collision within Seattle. 


In [2]:
%matplotlib inline

import matplotlib as mpl
import pandas as pd
import numpy as np

import folium
from folium import plugins

print('Folium installed and imported!')

print("Hello Capstone Project Course!")

Folium installed and imported!
Hello Capstone Project Course!


In [3]:
crash_data = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')

print('Data downloaded and read into a dataframe!')

  interactivity=interactivity, compiler=compiler, result=result)


Data downloaded and read into a dataframe!


2.2 Data cleaning

To first start using this data, I first had to deal with a number of problems within the dataset. 
First, several identifiers, such as INATTENTIONIND and UNDERINFL, could not be used due to the dataset missing too many variables.  As a result, I removed those columns from the dataset as they would not be good predictors. 

Secondly, some identifiers, such as OBJECTID and INCKEY, used unique ESRI or other unique/secondary identifiers, that could not be correlated without a paid subscription to ARCGIS or access to seattle.gov.  I also eliminated a number of these columns for the same reason.  

Lastly, some of the data values were organized in a way which made standardization nearly impossible.  For example, LOCATION was not useful due to the non-standard way that it was categorized, and SDOT_COLDESC was not able to be parsed in a meaningful way.  Since a number of these columns were redundant, I ended up not using these predictors. 
For the values remaining, I removed the missing rows of data as it was not reliable to use any of the other cleaning techniques learned (such as averaging out a column of data). 

In [4]:
crash_data.isnull().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

In [5]:
# clean up the dataset to remove unnecessary columns (eg. REG) 
#crash_data.drop(['INCKEY','OBJECTID','COLDETKEY','SDOTCOLNUM'], axis=1, inplace=True)


# for sake of consistency, let's also make all column labels of type string
crash_data.columns = list(map(str, crash_data.columns))

#Remove Missing X and Y values 
# simply drop whole row with NaN in "price" column
crash_data.dropna(subset=["X","Y"], axis=0, inplace=True)

# reset index, because we droped two rows
crash_data.reset_index(drop=True, inplace=True)

crash_data.isnull().sum()

crash_data.shape


(189339, 38)

After data cleaning, there were 189,339 samples and 39 features in the data.
However, upon examining the meaning of each feature, there were a number of features that were considered redundant. For example, Location, which is a description of the block where the accident occurred, can also be found through the usage of X and Y co-ordinates.   Likewise, SDOT_COLDESC, a description of the incident, can also be replicated with SD_COLCODE.  

After filtering out redundant features, as well as those missing too many values and those not relevant to the problem at hand, I was left 13 features to use as predictors as well as the value that I was predicting, SEVERITYCODE.


## Methodology <a name="methodology"></a>

3. Exploratory Data Analysis

X-Y Co-ordinates
One of the discussed hypotheses was that there may be high risk areas within the city that need to be addressed.  For this to be true, there should have been certain areas with higher than normal accident collisions. 
The X-Y  coordinates were easier to work with when compared with LOCATION, so I cleaned up the X-Y data and exported a limited set of co-ordinates  to export on to a Folium Map. From initial glance, it looked as though there were some promising results when the data was first returned. 



In [5]:
#Correlations without data wrangling
crash_data.corr()


#It seems like Personcount, Pedcount, and Pedcylcount and Crosswalkkey are the closest correlation. But that's without breaking down X and Y


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.02119,0.022581,0.022586,0.004849,1.0,0.128866,0.246722,0.214969,-0.058067,0.185926,0.005814,0.104878,0.176014
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.02119,0.009956,-0.023848,1.0,0.946085,0.945539,0.045476,0.02119,-0.062879,0.025104,0.034791,-0.095751,-0.034854,0.969311,0.028291,0.05655
INCKEY,0.022581,0.010309,-0.027396,0.946085,1.0,0.999996,0.046684,0.022581,-0.062269,0.025094,0.031422,-0.109595,-0.026313,0.990651,0.019731,0.048362
COLDETKEY,0.022586,0.0103,-0.027415,0.945539,0.999996,1.0,0.046652,0.022586,-0.062174,0.025086,0.031372,-0.109669,-0.026172,0.990651,0.019615,0.048242
INTKEY,0.004849,0.120754,-0.114935,0.045476,0.046684,0.046652,1.0,0.004849,-0.000281,-0.003988,0.000478,-0.013624,0.007741,0.033923,-0.010282,0.019296
SEVERITYCODE.1,1.0,0.010309,0.017737,0.02119,0.022581,0.022586,0.004849,1.0,0.128866,0.246722,0.214969,-0.058067,0.185926,0.005814,0.104878,0.176014
PERSONCOUNT,0.128866,0.012887,-0.01385,-0.062879,-0.062269,-0.062174,-0.000281,0.128866,1.0,-0.024764,-0.040317,0.37564,-0.136945,0.011847,-0.022093,-0.03341
PEDCOUNT,0.246722,0.011304,0.010178,0.025104,0.025094,0.025086,-0.003988,0.246722,-0.024764,1.0,-0.017461,-0.265337,0.267683,0.022448,0.001577,0.567358


In [6]:
#But wait, X and Y correlations are wrong!
#It looks like there might be certain neighborhoods that might have higher density, but the points are not surrounding something 
#If X and Y are highly co-ordinated, that would be like if there was a high concentration at a Stadium or something in particular
#So location may play a role, but not X 

crash_table_X = crash_data["X"].value_counts(dropna=True)
crash_table_X

crash_table_Y = crash_data["Y"].value_counts(dropna=True)
crash_table_Y

#X/Y Measures indicate only 5 values at exact same spot. So what else can we look at? 

47.708655    265
47.717173    254
47.604161    252
47.725036    239
47.579673    231
            ... 
47.556705      1
47.709101      1
47.513899      1
47.565438      1
47.563521      1
Name: Y, Length: 23839, dtype: int64

In [6]:
# get the first 500 crimes in the crash_data dataframe
limit = 2000
crash_data = crash_data.iloc[0:limit, :]

# Seattle latitude and longitude values
latitude = 47.61
longitude = -122.33

# create map and display it
seattle_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# display the map of Seattle
seattle_map

# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

# loop through the 500 crimes and add each to the incidents feature group
for lat, lng, in zip(crash_data.Y, crash_data.X):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add incidents to map
seattle_map.add_child(incidents)

It seemed as though there were some clusters of data which seemed to suggest that certain areas, particularly Downtown Seattle, had higher than normal accident collisions. When another map was created, specifically sampling data from injury collisions, it seemed to strengthen this result.  

In [7]:
# create a plain world map
seattle_map2 = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='OpenStreetMap')

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map2)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(crash_data.Y, crash_data.X, crash_data.SEVERITYDESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map2



In [8]:
#Look at the same map now only with 1's and 2's as distinctions

crash_data.sort_values(by=["SEVERITYCODE"], ascending= 1, inplace = True)

Severe_crash = crash_data[crash_data['SEVERITYCODE'] == 2]

# create a plain world map
seattle_map3 = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='OpenStreetMap')

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map3)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(Severe_crash.Y, Severe_crash.X, Severe_crash.SEVERITYDESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map3



While it might be ideal to implement initiatives across all neighborhoods, it seems there is one specific area where implementing changes would be most beneficial. Another version of the map, which examines where more than one accident has occurred, seems to solidify this approach. 

In [11]:
#Look at the same map now only with Accidents
#MAKE A Coordinate value that sums up X + Y, and do value counts to see if there are coordinates which have a lot of overlap. 

crash_data["Coordinates"] = crash_data["Y"].map(str) + "," + crash_data["X"].map(str)

Accidentcount = crash_data["Coordinates"].value_counts(sort=True)

crash_data["Accidents"] = crash_data.groupby('Coordinates')['Coordinates'].transform('count')



crash_data.sort_values(by=["Accidents"], ascending= 1, inplace = True)

Accident = crash_data[crash_data['Accidents'] > 1]

# create a plain world map
seattle_map4 = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='OpenStreetMap')

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map4)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(Accident.Y, Accident.X, Accident.SEVERITYDESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map4


However, simply looking at the location of accidents is not a good predictor by itself.  While there are certain locations that have higher numbers of severe accidents, that is simply because there are more accidents within that area. There also seem to be places where there are a high number of accidents, but few of them are severe. Therefore, we need another metric to examine this. 

In [11]:
#So while there are a disproportionate # of crashes in certain areas, that doesn't necessarily equate to severity of crash
#To show this, let's look more at data

crash_data[["X","Y","SEVERITYCODE"]].corr()

#Future research: ArcGIS into co-ordinates for better correlation

Unnamed: 0,X,Y,SEVERITYCODE
X,1.0,-0.095699,0.019089
Y,-0.095699,1.0,0.023628
SEVERITYCODE,0.019089,0.023628,1.0
