## Cleaning the data in preparation for modeling
Our goal: Figure out what makes an intersection dangerous for pedestrians and cyclists. 

<b>Approach/model 1: </b><br>
Look at the number of collisions at each intersection and quantify how much each feature of the intersection (eg, pavement, crosswalk, etc) contributes to having a high number of collisions.
Implementation: Pandas group by intersection to get number of collisions, then fit regression model on the features in the df that corresponds to characteristics of that intersection. 

Potential issues: no marker for number of people using intersection, which might be an important characteristic to note. Danger can then be defined as #collisions/#people using intersection. Some intersections with high traffic might have lower collisions rates than others.   

<b>Approach/model 2:</b><br>
Look at other features that are highly correlated with collisions. These can be independent of the intersection (weather, ligh levels, etc). While some of these cannot be improved upon by the city, others can be (eg, low light levels mitigated by street lights). There should be some grouping by intersections to see if some intersections have a disproportionate amount of collisions due to these effects. 

### Cleaning
The first part of data cleaning is to accurately measure the number of ped-car and cyclist-car collisions. Approximately 9% of the data doesn't have a ped, car, OR cyclist count, which inhibits our ability to accurately categorize accidents. 

In [127]:
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from pandas_profiling import ProfileReport

# Set style and settings
plt.style.use('ggplot')
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 15)

In [128]:
collisions = pd.read_csv('../data/external/Collisions.csv',
                        parse_dates={'Datetime': ['INCDTTM']},
                        infer_datetime_format=True)

collisions = (
    collisions.set_index('Datetime')
    .sort_index()
    .drop(columns=['EXCEPTRSNDESC', 'EXCEPTRSNCODE', 'REPORTNO', 'STATUS'])
)

### Filling in missing data
Approximately 9% of the data is missing the persons/cars involved. We can fill this is by looking at the SDOT descriptions of the accidents

In [129]:
# How many of these involve ZERO people (ie, terrible book-keeping)
no_people = collisions.loc[(collisions['PEDCOUNT'] == 0) & 
           (collisions['PEDCYLCOUNT'] == 0) & 
           (collisions['PERSONCOUNT'] == 0) & 
            (collisions['VEHCOUNT'] == 0)]

people = collisions.loc[(collisions['PEDCOUNT'] != 0) | 
           (collisions['PEDCYLCOUNT'] != 0) | 
           (collisions['PERSONCOUNT'] != 0) | 
            (collisions['VEHCOUNT'] != 0)]

print('Fraction of data with no people involved: ', no_people.shape[0]/collisions.shape[0])

Fraction of data with no people involved:  0.08778965323268431


In [130]:
# Make dictionary with the descriptions and counts for each description
description_series = no_people['SDOT_COLDESC'].value_counts()
descriptions = list(description_series.index)
counts = list(description_series)
d = {descriptions[i]:counts[i] for i in range(len(descriptions))}
d

{'NOT ENOUGH INFORMATION / NOT APPLICABLE': 8060,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE': 4765,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END': 3963,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE': 894,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE': 529,
 'MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT': 306,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE': 184,
 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE': 168,
 'MOTOR VEHCILE STRUCK PEDESTRIAN': 159,
 'MOTOR VEHICLE STRUCK OBJECT IN ROAD': 148,
 'MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE': 67,
 'PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE': 24,
 'MOTOR VEHICLE OVERTURNED IN ROAD': 22,
 'DRIVERLESS VEHICLE RAN OFF ROAD - HIT FIXED OBJECT': 10,
 'PEDALCYCLIST STRUCK MOTOR VEHICLE LEFT SIDE SIDESWIPE': 7,
 'DRIVERLESS VEHICLE STRUCK MOTOR VEHICLE REAR END': 7,
 'PEDALCYCLIST OVERTURNED IN ROAD': 6,
 'DRIVERLESS VEHICLE STRUCK MOTOR VEHICLE FRONT EN

In [131]:
# Replace the values in no_people df and merge with original collisions set
veh_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('VEHICLE') if type(x) == str else 0)
ped_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('PEDESTRIAN') if type(x) == str else 0)
cyclist_count = no_people['SDOT_COLDESC'].apply(lambda x: x.count('PEDALCYCLIST') if type(x) == str else 0)

no_people.loc[:,'VEHCOUNT'] = veh_count
no_people.loc[:,'PEDCOUNT'] = ped_count
no_people.loc[:, 'PEDCYLCOUNT'] = cyclist_count

# Merge the people and no_people dataframes
df = pd.concat([no_people, people])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


### Dealing with NaNs and binary Y/N

In [118]:
# Assigning 0/1 to binary features
df = (
    pd.get_dummies(df, columns=['SPEEDING', 'INATTENTIONIND', 'HITPARKEDCAR', 'PEDROWNOTGRNT'])
    .drop(columns=['HITPARKEDCAR_N'])
)

In [141]:
# Fixing alcohol influence
print(df['UNDERINFL'].value_counts())
df.loc[:,'UNDERINFL'] = df['UNDERINFL'].apply(lambda x: 1 if (x=='Y') or (x==1) else 0)

N    103000
0     81676
Y      5398
1      4230
Name: UNDERINFL, dtype: int64


### Dealing with weather, road conditions, and light
Grouping into fewer categories

In [146]:
# Pickle the dataframe AND save as csv
df.to_pickle('../data/processed/cleaned_data.pkl')
df.to_csv('../data/processed/cleaned_data.csv')