### Washington State Crash Event Analysis
#### --- by 

In [1]:
import pandas as pd
import numpy as np
import regex as re

import requests
import asyncio
import json as js

import time

import os

pd.set_option('display.max_rows', 6)

#### Load Datasets

In [2]:
dir = os.path.abspath(os.path.dirname(os.getcwd())) + '/data/'

df_data = pd.read_csv(dir + '/output/data_with_zipcode.csv').drop(axis=1, labels='Unnamed: 0') 
df_data.event_zipcode = df_data.event_zipcode.astype(str)   # convert the default float type values into str
df_crashtype = pd.read_csv(dir + '/output/crash_type.csv').set_index(keys='type_index')

df_data.shape

  df_data = pd.read_csv(dir + '/output/data_with_zipcode.csv').drop(axis=1, labels='Unnamed: 0')


(4132, 306)

##### Data Cleaning

- The following blocks are for dropping rows which do not have valid zipcodes (i.e. rows that do not have either a driver zipcode or an accident zipcode)

In [3]:
# drop rows which do not have an event zipcode

has_no_zipcode = df_data.event_zipcode.map(lambda v : v == 'nan')
df_data = df_data[df_data.event_zipcode != 'nan']
df_data.shape

(4132, 306)

In [4]:
# drop rows which do not have a person zipcode

df_data.dzip = df_data.dzip.map(
    lambda n: 0 if n ==0 else 0 if pd.isna(n) else int(n)
)
df_data = df_data[df_data.dzip > 10000]     # valid zip codes are all 5 digit so we filter out those with less than 5 digits
df_data.dzip = df_data.dzip.astype(str)
df_data.shape

(4100, 306)

- The following block cleans the age column. <br/>
After observation we found that there are invalid age values such as 999 or 998, which, after cleaning, are replaced with the column mean (calculation of the mean is based on the column being filtered out of the abnomral values.)

In [5]:
age_filter = filter(lambda v: v > 0 and v < 100, df_data.age)
age_mean = round( np.mean(list(age_filter), dtype=float),0)

df_data.age = df_data.age.map(
    lambda v : age_mean if v < 0 or v >= 100 else v
)

In [6]:
df_data.ptype.value_counts()

1    4100
Name: ptype, dtype: int64

##### Among drivers involved in fatal crashes, what proportion are involved in crashes in communities where they live?
<br/>
- <b>Visualization note</b>: a barchart / pie chart to show the proportion of non-resident and resident crash cases.

In [7]:
df_data['is_resident'] = df_data.index.map(
    lambda i: df_data.event_zipcode[i] == df_data.dzip[i]
)

df_data['is_driver'] = df_data.ptype.map(
    lambda t: t == 1
)

prop = len(df_data[(df_data.is_resident == True) & (
    df_data.is_driver == True)]) / float(len(df_data[df_data.is_driver == True]))

print('{prop:.4f}% of the drivers are from the community where the accident occured'.format(prop = prop * 100))

23.7805% of the drivers are from the community where the accident occured


Based on our analysis, 23.7805% of the drivers are from the community where the accident occured.

##### Are there differences in the types of crashes and behavior factors in those crashes among “residents” versus those deemed to be not “from” the area?

In [8]:
df_data.crashtype.value_counts()

98    701
13    600
1     338
     ... 
70      1
71      1
26      1
Name: crashtype, Length: 58, dtype: int64

In [9]:
df_crashtype.head()     # this dataframe stores the meta info of the variable crashtype

Unnamed: 0_level_0,info,category
type_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,No Impact,NOT CATEGORIZED
1,Drive Off Road,SINGLE DRIVER
2,Control/Traction Loss,SINGLE DRIVER
3,"Avoid Collision with Vehicle, Pedestrian, Animal",SINGLE DRIVER
4,Specifics Other,SINGLE DRIVER


- <b>Visualizaiton note</b>: A grouped barchart is needed to display the distributional features of crashtype categories across non-resident and resident group

In [10]:
map_crashtype_category = {  # maps a crashtype to its category
    k:v for k,v in zip(df_crashtype.index, df_crashtype.category)
}

map_crashtype_eng = {   # maps a crashtype index to its actual meaning
    k:v for k,v in zip(df_crashtype.index, df_crashtype['info'])
}

df_data['crash_category'] = df_data.crashtype.map(map_crashtype_category)
df_data['crashtype_eng'] = df_data.crashtype.map(map_crashtype_eng)

df_data.to_csv(dir + 'output/data_vis.csv')

##### Analysis of Behavioral Factors

- The following columns are thought to indicate whether an involved person conducted risky behavior in the crash event.
- - restraintmisuse: valued 1 when there was a restraint misuse
- - helmetmisuse: valued 1 when there was a helmet misuse
- - 

##### Predictive Analysis of Risky Drivers