## Cleaning the data in preparation for modeling
Our goal: Figure out what makes an intersection dangerous for pedestrians and cyclists. 

<b>Approach/model 1: </b><br>
Look at the number of collisions at each intersection and quantify how much each feature of the intersection (eg, pavement, crosswalk, etc) contributes to having a high number of collisions.
Implementation: Pandas group by intersection to get number of collisions, then fit regression model on the features in the df that corresponds to characteristics of that intersection. 

Potential issues: no marker for number of people using intersection, which might be an important characteristic to note. Danger can then be defined as #collisions/#people using intersection. Some intersections with high traffic might have lower collisions rates than others.   

<b>Approach/model 2:</b><br>
Look at other features that are highly correlated with collisions. These can be independent of the intersection (weather, ligh levels, etc). While some of these cannot be improved upon by the city, others can be (eg, low light levels mitigated by street lights). There should be some grouping by intersections to see if some intersections have a disproportionate amount of collisions due to these effects. 

### Cleaning
The first part of data cleaning is to accurately measure the number of ped-car and cyclist-car collisions. Approximately 9% of the data doesn't have a ped, car, OR cyclist count, which inhibits our ability to accurately categorize accidents. 

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from pandas_profiling import ProfileReport


# Set style and settings
plt.style.use('ggplot')
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 15)

In [78]:
collisions = pd.read_csv('../data/external/Collisions.csv',
                        parse_dates={'Datetime': ['INCDTTM']},
                        infer_datetime_format=True)

collisions = (
    collisions.set_index('Datetime')
    .sort_index()
    .drop(columns=['EXCEPTRSNDESC', 'EXCEPTRSNCODE', 'REPORTNO', 'STATUS'])
)

### Filling in missing data
Approximately 9% of the data is missing the persons/cars involved. We can fill this is by looking at the SDOT descriptions of the accidents

In [79]:
# How many of these involve ZERO people (ie, terrible book-keeping)
no_people = collisions.loc[(collisions['PEDCOUNT'] == 0) & 
           (collisions['PEDCYLCOUNT'] == 0) & 
           (collisions['PERSONCOUNT'] == 0) & 
            (collisions['VEHCOUNT'] == 0)]

people = collisions.loc[(collisions['PEDCOUNT'] != 0) | 
           (collisions['PEDCYLCOUNT'] != 0) | 
           (collisions['PERSONCOUNT'] != 0) | 
            (collisions['VEHCOUNT'] != 0)]

print('Fraction of data with no people involved: ', no_people.shape[0]/collisions.shape[0])

Fraction of data with no people involved:  0.08778965323268431


In [80]:
# Make dictionary with the descriptions and counts for each description
description_series = no_people['SDOT_COLDESC'].value_counts()
descriptions = list(description_series.index)
counts = list(description_series)
d = {descriptions[i]:counts[i] for i in range(len(descriptions))}
d

{'NOT ENOUGH INFORMATION / NOT APPLICABLE': 8060,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE': 4765,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END': 3963,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE': 894,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE': 529,
 'MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT': 306,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE': 184,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE': 168,
 'MOTOR VEHCILE STRUCK PEDESTRIAN': 159,
 'MOTOR VEHICLE STRUCK OBJECT IN ROAD': 148,
 'MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE': 67,
 'PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE': 24,
 'MOTOR VEHICLE OVERTURNED IN ROAD': 22,
 'DRIVERLESS VEHICLE RAN OFF ROAD - HIT FIXED OBJECT': 10,
 'DRIVERLESS VEHICLE STRUCK MOTOR VEHICLE REAR END': 7,
 'PEDALCYCLIST STRUCK MOTOR VEHICLE LEFT SIDE SIDESWIPE': 7,
 'PEDALCYCLIST OVERTURNED IN ROAD': 6,
 'DRIVERLESS VEHICLE STRUCK MOTOR VEHICLE FRONT EN

In [82]:
# Replace the values in no_people df and merge with original collisions set
veh_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('VEHICLE') if type(x) == str else 0)
ped_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('PEDESTRIAN') if type(x) == str else 0)
cyclist_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('PEDALCYCLIST') if type(x) == str else 0)

no_people.loc[:,'VEHCOUNT'] = veh_count
no_people.loc[:,'PEDCOUNT'] = ped_count
no_people.loc[:, 'PEDCYLCOUNT'] = cyclist_count

In [132]:
# Merge the people and no_people dataframes
df = pd.concat([no_people, people])

### Dealing with NaNs and binary Y/N

In [133]:
# Assigning 0/1 to binary features
df = (
    pd.get_dummies(df, columns=['SPEEDING', 'INATTENTIONIND', 'HITPARKEDCAR', 'PEDROWNOTGRNT'])
    .drop(columns=['HITPARKEDCAR_N'])
)

In [134]:
# Fixing alcohol influence
print(df['UNDERINFL'].value_counts())
df.loc[:,'UNDERINFL'] = df['UNDERINFL'].apply(lambda x: 1 if (x=='Y') or (x==1) else 0)

N    103000
0     81676
Y      5398
1      4230
Name: UNDERINFL, dtype: int64


### Dealing with weather, road conditions, and light
Grouping into fewer categories. Any model using a RF will end up building a sparse tree if our categorical variables have high cardinality. This can bias the RF towards the continuous variables. 

Weather Conditions:<br>
0: Unknown or NaN <br>
1: Clear and Overcast<br>
2: All others (rain/sleet/snow/fog)<br>

Road Conditions:<br>
0: Unknown or NaN<br>
1: Dry<br>
2: All others (wet/sand/mud)<br>


Light Conditions: <br>
0: Unknown or Other or NaN<br>
1: Daylight<br>
2: Dark & no street lights (or lights off<br>
3: Dark (dawn/dusk/street lights)<br>


### Lastly, group severity codes together 

0: Unknown or NaN<br>
1: property damage<br>
2: injury (minor, serious, fatality)<br>

In [135]:
df_temp = df[['WEATHER', 'ROADCOND', 'LIGHTCOND', 'SEVERITYCODE']]
df_temp.head()
df_temp.isnull().sum()

WEATHER         26340
ROADCOND        26260
LIGHTCOND       26429
SEVERITYCODE        1
dtype: int64

In [136]:
def encode_weather(x):
    if (x == 'Unknown') or (x == 0):
        return 0
    elif (x == 'Clear') or (x == 'Overcast'):
        return 1
    else:
        return 2    

def encode_road(x):    
    if (x == 'Unknown') or (x == 0):
        return 0
    elif x == 'Dry':
        return 1
    else:
        return 2
    
def encode_light(x):
    if (x == 'Unknown') or (x == 'Other') or (x == 0):
        return 0
    elif x == 'Daylight':
        return 1
    elif (x == 'Dark - No Street Lights') or (x == 'Dark - Street Lights Off'):
        return 2
    else:
        return 3
    
def encode_severity(x):
    """Everything else should return 0 or 1"""
    injury_list = ['2', '2b', '3']
    if x in injury_list:
        return int(2)
    else:
        return int(x)

In [137]:
# Encode weather, road, and light conditions. Nan is filled with 0 prior to OHE

for series, function in zip(['WEATHER', 'ROADCOND', 'LIGHTCOND', 'SEVERITYCODE'], 
                            [encode_weather, encode_road, encode_light, encode_severity]):
    df[series] = df_temp[series].fillna(0).apply(function)
df.head()

Unnamed: 0_level_0,X,Y,OBJECTID,INCKEY,COLDETKEY,ADDRTYPE,INTKEY,LOCATION,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SDOTCOLNUM,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,SPEEDING_Y,INATTENTIONIND_Y,HITPARKEDCAR_Y,PEDROWNOTGRNT_Y
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
2003-10-06,-122.320755,47.608073,1680,3663,3663,Intersection,29797.0,BROADWAY AND CHERRY ST,0,Unknown,,0,0,0,0,0,0,0,2003/10/06 00:00:00+00,,,,0,0,0,0,3279003.0,,,0,0,0,0,0,1
2004-01-01,,,16515,29248,29248,,,,0,Unknown,,0,0,0,0,0,0,0,2004/01/01 00:00:00+00,,0.0,NOT ENOUGH INFORMATION / NOT APPLICABLE,0,0,0,0,4001030.0,,,0,0,0,0,0,0
2004-01-01,-122.31352,47.601688,9624,22796,22796,Block,,E YESLER WAY BETWEEN 14TH AVE AND 15TH AVE,0,Unknown,,0,0,0,2,0,0,0,2004/01/01 00:00:00+00,Mid-Block (but intersection related),14.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",0,0,0,0,4001029.0,,,0,0,0,0,0,0
2004-01-01,-122.360959,47.571594,11719,25538,25538,Block,,WEST SEATTLE BR WB BETWEEN W SEATTLE BR WB OFF...,0,Unknown,,0,0,0,2,0,0,0,2004/01/01 00:00:00+00,Mid-Block (not related to intersection),14.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",0,0,0,0,4001028.0,,,0,0,0,0,0,0
2004-01-01,-122.337454,47.615057,13533,27177,27177,Intersection,29540.0,7TH AVE AND VIRGINIA ST,0,Unknown,,0,0,0,2,0,0,0,2004/01/01 00:00:00+00,At Intersection (intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",0,0,0,0,4001027.0,,,0,0,0,0,0,0


### Drop some columns
Conservative estimate in case we want to do more with the data



In [138]:
df = df.drop(columns=['SEGLANEKEY', 'SDOTCOLNUM', 'SEVERITYDESC', 
                      'COLLISIONTYPE', 'SDOT_COLDESC', 'COLDETKEY'])

In [139]:
df.columns

Index(['X', 'Y', 'OBJECTID', 'INCKEY', 'ADDRTYPE', 'INTKEY', 'LOCATION',
       'SEVERITYCODE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT',
       'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'JUNCTIONTYPE',
       'SDOT_COLCODE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'ST_COLCODE', 'ST_COLDESC', 'CROSSWALKKEY', 'SPEEDING_Y',
       'INATTENTIONIND_Y', 'HITPARKEDCAR_Y', 'PEDROWNOTGRNT_Y'],
      dtype='object')

(59311, 36)

### Save as csv and pkl

In [140]:
# Pickle the dataframe AND save as csv
df.to_pickle('../data/processed/cleaned_data.pkl')
df.to_csv('../data/processed/cleaned_data.csv')