# Final Project

### Imports and data loading for plots

In [8]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
import numpy as np
import seaborn as sns
from bokeh.plotting import figure, show
from bokeh.io import output_notebook,curdoc, show
from bokeh.models import ColumnDataSource, FactorRange,Grid, HBar, LinearAxis, Plot,LabelSet,Legend
from bokeh.core.properties import value
from bokeh.transform import factor_cmap,dodge
from bokeh.palettes import Spectral10
from bokeh.models import HoverTool
from bokeh.models import Select
from bokeh.models import Legend
from bokeh.layouts import column,row
from bokeh.models import Panel, Tabs
import warnings
warnings.filterwarnings('ignore')

# select a palette
from bokeh.palettes import Spectral3
from bokeh.palettes import Category20b_13 as palette
from bokeh.palettes import Category20b_14 as palette2
# itertools handles the cycling
import itertools  


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from sklearn import tree

sns.set(style='darkgrid', palette='muted', color_codes=True)



# Magic command useful for jupyter notebook
%matplotlib inline

# Set plot size. 
plt.rcParams['figure.figsize'] = [13, 6]

# Set font size
plt.rcParams.update({'font.size': 22})

In [6]:
df_crash = pd.read_csv('data/Motor_Vehicle_Collisions_-_Crashes.csv')
df_vehicle = pd.read_csv('data/Motor_Vehicle_Collisions_-_Vehicles.csv')
df_people = pd.read_csv('data/Motor_Vehicle_Collisions_-_Person.csv')

## 1 - Motivation

In this project three datasets were used. The main dataset was a detailed table of vehicle crash incidends in New York from 2012 to 2019. For each incident there was a unique collision ID which the two other datasets were based upon. These datasets were records of the people and vehicles involved in the crashes of the first dataset. 

These datasets where chosen because of the large amount of variables. They had the core variables that are essential to this sort of analysis like; time of crash, date of crash and location of crash (GPS coordinates and Borough). In addition there were several interesting variables like street name of crash, number of persons killed/injured and contributing factor of the crash. 
The problem with such a dataset would not be to search heavily for parameters to analyse, but rather carefully select a few of the vast possibilities in the dataset. 

The main goal was to be able to provide the user with easy to understand vizualizations. And in some cases provide tools for the user to interactively select what data they would like to see to encourage user engagement. 


## 2 - Basic stats

### Raw dataset stats

To best explain how the preprocessing and cleaning was done, an overview of the initial raw dataset is given in this section. 

**Crashes:** 
- 1.67 M rows
- 29 columns
- 362 mb

**Vehicles:** 
- 3.35 M rows
- 25 columns
- 551 MB

**People:** 
- 3.91 M rows
- 21 columns
- 624 MB 

### Cleaning and preprocessing

Initially there were a lot of NaN values in the datasets as seen below. Therefore a lot of values had to be removed. To avoid losing lots of data from dropping NaNs, the appropriate columns from the three datasets were extracted. Many colums had over a million NaN values and would thus not be appropriate for analysis. Important parameters such as crash time, crash date and location (latitude/longitude/borough) had most of their values. 

In [7]:
df_crash.isnull().sum()

CRASH DATE                             0
CRASH TIME                             0
BOROUGH                           509839
ZIP CODE                          510046
LATITUDE                          201721
LONGITUDE                         201721
LOCATION                          201721
ON STREET NAME                    330899
CROSS STREET NAME                 570911
OFF STREET NAME                  1433844
NUMBER OF PERSONS INJURED             17
NUMBER OF PERSONS KILLED              31
NUMBER OF PEDESTRIANS INJURED          0
NUMBER OF PEDESTRIANS KILLED           0
NUMBER OF CYCLIST INJURED              0
NUMBER OF CYCLIST KILLED               0
NUMBER OF MOTORIST INJURED             0
NUMBER OF MOTORIST KILLED              0
CONTRIBUTING FACTOR VEHICLE 1       4518
CONTRIBUTING FACTOR VEHICLE 2     227813
CONTRIBUTING FACTOR VEHICLE 3    1563865
CONTRIBUTING FACTOR VEHICLE 4    1649657
CONTRIBUTING FACTOR VEHICLE 5    1666553
COLLISION_ID                           0
VEHICLE TYPE COD

In [9]:
df_vehicle.isnull().sum()

UNIQUE_ID                            0
COLLISION_ID                         0
CRASH_DATE                           0
CRASH_TIME                           0
VEHICLE_ID                           0
STATE_REGISTRATION              152152
VEHICLE_TYPE                    132033
VEHICLE_MAKE                   1713629
VEHICLE_MODEL                  3294186
VEHICLE_YEAR                   1720659
TRAVEL_DIRECTION               1607383
VEHICLE_OCCUPANTS              1668167
DRIVER_SEX                     1917770
DRIVER_LICENSE_STATUS          1971646
DRIVER_LICENSE_JURISDICTION    1961964
PRE_CRASH                       850587
POINT_OF_IMPACT                1628802
VEHICLE_DAMAGE                 1640673
VEHICLE_DAMAGE_1               2277825
VEHICLE_DAMAGE_2               2557501
VEHICLE_DAMAGE_3               2745625
PUBLIC_PROPERTY_DAMAGE         1528863
PUBLIC_PROPERTY_DAMAGE_TYPE    3331696
CONTRIBUTING_FACTOR_1            92818
CONTRIBUTING_FACTOR_2          1620959
dtype: int64

In [8]:
df_people.isnull().sum()

UNIQUE_ID                      0
COLLISION_ID                   0
CRASH_DATE                     0
CRASH_TIME                     0
PERSON_ID                     19
PERSON_TYPE                    0
PERSON_INJURY                  0
VEHICLE_ID                151782
PERSON_AGE                294988
EJECTION                 1911145
EMOTIONAL_STATUS         1864739
BODILY_INJURY            1864696
POSITION_IN_VEHICLE      1910875
SAFETY_EQUIPMENT         1910925
PED_LOCATION             3858762
PED_ACTION               3858863
COMPLAINT                1864689
PED_ROLE                  194895
CONTRIBUTING_FACTOR_1    3859973
CONTRIBUTING_FACTOR_2    3860035
PERSON_SEX                468460
dtype: int64

After selection of columns used for analysis the datasets had the following columns and nan values: 

In [10]:
df_crash = df_crash[['CRASH DATE','CRASH TIME','BOROUGH', 'LATITUDE', 'LONGITUDE',\
         'ON STREET NAME', 'CROSS STREET NAME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED',\
          'NUMBER OF PEDESTRIANS INJURED','NUMBER OF PEDESTRIANS KILLED','NUMBER OF CYCLIST INJURED','NUMBER OF CYCLIST KILLED',\
          'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED','CONTRIBUTING FACTOR VEHICLE 1',\
         'CONTRIBUTING FACTOR VEHICLE 2','COLLISION_ID','VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2']]
df_vehicle = df_vehicle[['UNIQUE_ID','COLLISION_ID','CRASH_DATE','CRASH_TIME','VEHICLE_ID',\
                        'VEHICLE_TYPE','VEHICLE_YEAR','DRIVER_SEX','PRE_CRASH','POINT_OF_IMPACT','VEHICLE_DAMAGE']]
df_people = df_people.drop(columns = ['PED_LOCATION','PED_ACTION','CONTRIBUTING_FACTOR_1','CONTRIBUTING_FACTOR_2',\
                                     'EJECTION','EMOTIONAL_STATUS','BODILY_INJURY','POSITION_IN_VEHICLE',\
                                     'SAFETY_EQUIPMENT','COMPLAINT'])

In [11]:
df_crash.isnull().sum()

CRASH DATE                            0
CRASH TIME                            0
BOROUGH                          509839
LATITUDE                         201721
LONGITUDE                        201721
ON STREET NAME                   330899
CROSS STREET NAME                570911
NUMBER OF PERSONS INJURED            17
NUMBER OF PERSONS KILLED             31
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1      4518
CONTRIBUTING FACTOR VEHICLE 2    227813
COLLISION_ID                          0
VEHICLE TYPE CODE 1                5944
VEHICLE TYPE CODE 2              280627
dtype: int64

In [12]:
df_vehicle.isnull().sum()

UNIQUE_ID                0
COLLISION_ID             0
CRASH_DATE               0
CRASH_TIME               0
VEHICLE_ID               0
VEHICLE_TYPE        132033
VEHICLE_YEAR       1720659
DRIVER_SEX         1917770
PRE_CRASH           850587
POINT_OF_IMPACT    1628802
VEHICLE_DAMAGE     1640673
dtype: int64

In [13]:
df_people.isnull().sum()

UNIQUE_ID             0
COLLISION_ID          0
CRASH_DATE            0
CRASH_TIME            0
PERSON_ID            19
PERSON_TYPE           0
PERSON_INJURY         0
VEHICLE_ID       151782
PERSON_AGE       294988
PED_ROLE         194895
PERSON_SEX       468460
dtype: int64

The next steps was to decide on what to do with the rows containing NaN values. To avoid removing data as much as possible, it was decided to fille the missing values with 'Unspecified'. In this way the data would still count in analysis on rows without missing values. When the actual columns with NaN values were analyzed, 'Unspecified' could still be used to remove the rows at this point or simply include them in the analysis. Following this step all column had zero NaN values and could now be used for analysis. 

**Other considerations**

After the datasets had been cleaned of NaN values we had to consider all the 'bad' values in the dataset. This includes numbers that would be wrongfully entered when recorded or non-valid strings. An example was a Vehicle year recorded as 1100 which was deemed highly unlikely...

The list of bad values included

- Unlikely person ages. 
- Mispellings of vehicle type (Creating several groups of the same data in a groupby)
- Unlikely Vehicle years
- Incorrect latitude/longitudes. 

As the number of bad values was high for each column and in some cases dificult to identify it was deicded to handle them when appropriate as visualizations often revealed the bad values. 

### Final dataset stats

After all the cleaning was done the final datasets had the following stats: 

**Crashes:** 
- 1.67 M rows
- 20 columns

**Vehicles:** 
- 3.35 M rows
- 12 columns


**People:** 
- 3.91 M rows
- 11 columns


## 3 Data analysis

- Describe your data analysis and explain what you've learned about the dataset.

- If relevant, talk about your machine-learning.

Now that the data has been cleaned, we import the new data frames, and convert the crash times and dates to the pandas date-time struct, and string replace some of the wrongly inputted data in the vehicle type code 1 variable.

In [3]:
df_crash = pd.read_csv('crash_clean.csv')
df_vehicle = pd.read_csv('vehicle_clean.csv')
df_persons = pd.read_csv('people_clean.csv')

df_crash['CRASH DATE'] = pd.to_datetime(df_crash['CRASH DATE'],errors='coerce')
df_crash['CRASH TIME'] = pd.to_datetime(df_crash['CRASH TIME'],errors='coerce')

df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('TAXI','Taxi')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('SPORT UTILITY / STATION WAGON','Station Wagon')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Station Wagon/Sport Utility Vehicle','Station Wagon')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('4 dr sedan','Sedan')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('2 dr sedan','Sedan')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('VAN','Van')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('van','Van')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('VN','Van')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('MOTORCYCLE','Motorcycle')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Motorbike','Motorcycle')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('ambul','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Ambul','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('AMBUL','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('AmbulanceANCE','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Ambulanceance','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('AM','Ambulance')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Fire','Firetruck')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('FIRE','Firetruck')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('FIRE TRUCK','Firetruck') 
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Firetruck TRUCK','Firetruck')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('BUS','Bus')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('BU','Bus')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('BICYCLE','Bicycle')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('Bike','Bicycle')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('PICK-UP TRUCK','Pick-up Truck')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('TK','Pick-up Truck')
df_crash['VEHICLE TYPE CODE 1'] = df_crash['VEHICLE TYPE CODE 1'].str.replace('LIVERY VEHICLE','Livery Vehicle')

Now we take the top 15 causes of crashes and the top 15 vehicles types and create a focus set of those. These sets are then used in the visualizations.

In [4]:
focusviolations = set(df_crash.groupby(df_crash['CONTRIBUTING FACTOR VEHICLE 1']).size().sort_values(ascending=False).index.to_list()[1:16])
focusviolations.remove('Other Vehicular')

focusvehicles = set(df_crash.groupby(df_crash['VEHICLE TYPE CODE 1']).size().sort_values(ascending=False).index.to_list()[1:16])
focusvehicles.remove('OTHER')
focusvehicles.remove('UNKNOWN')
focusvehicles.remove('LARGE COM VEH(6 OR MORE TIRES)')
focusvehicles.remove('SMALL COM VEH(4 TIRES) ')

# Bokeh plots of contributing factors and vehicle types

Building on the exploratory data analysis, we now use Bokeh to create 3 plots for two ways of looking at the crashes with interactive legends to allow the user to explore what we thought were interesting about the focus sets. The Bokeh plots are plotted in hours of the day, day of the week and month of the year respectively, to allow the user to find and confirm some of the preconcieved notions we have about why people crash. Furthermore, we want to visualize the variable that are used in the machine learning models further down the line. If we can find some correlations between i.e. the time of day and number of crashes due to alcohol involvement, then that variable would be important in the given ML model.

In [6]:
#output_file("Crashes_hours_weeks_months.html",mode = 'inline')
# Creating dataframes with relevant variables
df_hourlycrash = df_crash[['CONTRIBUTING FACTOR VEHICLE 1','CRASH TIME','COLLISION_ID']]
df_hourlycrash_vehicle = df_crash[['VEHICLE TYPE CODE 1','CRASH TIME','COLLISION_ID']]
df_hourlycrash['CRASH TIME'] = df_crash['CRASH TIME'].dt.hour
df_hourlycrash_vehicle['CRASH TIME'] = df_crash['CRASH TIME'].dt.hour

# Pivoting the dataframe for both types
df_hourlycrash = pd.pivot_table(df_hourlycrash,values = 'COLLISION_ID',index = ['CRASH TIME'],columns = ['CONTRIBUTING FACTOR VEHICLE 1'],aggfunc = 'count').fillna(0)
df_hourlycrash_vehicle = pd.pivot_table(df_hourlycrash_vehicle,values = 'COLLISION_ID',index = ['CRASH TIME'],columns = ['VEHICLE TYPE CODE 1'],aggfunc = 'count').fillna(0)

# Normalizing the dataframes by total number of crashes
Total = df_crash.groupby('CONTRIBUTING FACTOR VEHICLE 1').size()
Total_vehicle = df_crash.groupby('VEHICLE TYPE CODE 1').size()
df_hourlycrash = df_hourlycrash.div(Total,axis=1)
df_hourlycrash_vehicle = df_hourlycrash_vehicle.div(Total_vehicle,axis=1)

# Creating the data structures for Bokeh 
source1 = ColumnDataSource(df_hourlycrash)
source_vehicle1 = ColumnDataSource(df_hourlycrash_vehicle)
hours = [str(elem+1) for elem in df_hourlycrash.index.to_list()]

# Same procedure for weekly crashes
df_weeklycrash = df_crash[['CONTRIBUTING FACTOR VEHICLE 1','CRASH DATE','COLLISION_ID']]
df_weeklycrash_vehicle = df_crash[['VEHICLE TYPE CODE 1','CRASH DATE','COLLISION_ID']]
df_weeklycrash['CRASH DATE'] = df_crash['CRASH DATE'].dt.dayofweek
df_weeklycrash_vehicle['CRASH DATE'] = df_crash['CRASH DATE'].dt.dayofweek
df_weeklycrash['CRASH DATE'] = df_weeklycrash['CRASH DATE'].map({0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday' ,4:'Friday' ,5:'Saturday' ,6:'Sunday'})
df_weeklycrash_vehicle['CRASH DATE'] = df_weeklycrash_vehicle['CRASH DATE'].map({0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday' ,4:'Friday' ,5:'Saturday' ,6:'Sunday'})
df_weeklycrash = pd.pivot_table(df_weeklycrash,values = 'COLLISION_ID',index = ['CRASH DATE'],columns = ['CONTRIBUTING FACTOR VEHICLE 1'],aggfunc = 'count').fillna(0)
df_weeklycrash_vehicle = pd.pivot_table(df_weeklycrash_vehicle,values = 'COLLISION_ID',index = ['CRASH DATE'],columns = ['VEHICLE TYPE CODE 1'],aggfunc = 'count').fillna(0)

Total = df_crash.groupby('CONTRIBUTING FACTOR VEHICLE 1').size()
Total_vehicle = df_crash.groupby('VEHICLE TYPE CODE 1').size()
df_weeklycrash = df_weeklycrash.div(Total,axis=1)
df_weeklycrash_vehicle = df_weeklycrash_vehicle.div(Total_vehicle,axis=1)

source2 = ColumnDataSource(df_weeklycrash)
source_vehicle2 = ColumnDataSource(df_weeklycrash_vehicle)
DaysOfWeek = df_weeklycrash.index.tolist()
correct_order = [1,5,6,4,0,2,3]
DaysOfWeek = [DaysOfWeek[i] for i in correct_order]

# Same procedure for monthly
df_monthlycrash = df_crash[['CONTRIBUTING FACTOR VEHICLE 1','CRASH DATE','COLLISION_ID']]
df_monthlycrash_vehicle = df_crash[['VEHICLE TYPE CODE 1','CRASH DATE','COLLISION_ID']]
df_monthlycrash['CRASH DATE'] = df_crash['CRASH DATE'].dt.month
df_monthlycrash_vehicle['CRASH DATE'] = df_crash['CRASH DATE'].dt.month
df_monthlycrash['CRASH DATE'] = df_monthlycrash['CRASH DATE'].map({1:'January',2:'February',3:'March',4:'April' ,5:'May' ,6:'June' ,7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'})
df_monthlycrash_vehicle['CRASH DATE'] = df_monthlycrash_vehicle['CRASH DATE'].map({1:'January',2:'February',3:'March',4:'April' ,5:'May' ,6:'June' ,7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'})


df_monthlycrash = pd.pivot_table(df_monthlycrash,values = 'COLLISION_ID',index = ['CRASH DATE'],columns = ['CONTRIBUTING FACTOR VEHICLE 1'],aggfunc = 'count').fillna(0)
df_monthlycrash_vehicle = pd.pivot_table(df_monthlycrash_vehicle,values = 'COLLISION_ID',index = ['CRASH DATE'],columns = ['VEHICLE TYPE CODE 1'],aggfunc = 'count').fillna(0)

Total = df_crash.groupby('CONTRIBUTING FACTOR VEHICLE 1').size()
Total_vehicle = df_crash.groupby('VEHICLE TYPE CODE 1').size()
df_monthlycrash = df_monthlycrash.div(Total,axis=1)
df_monthlycrash_vehicle = df_monthlycrash_vehicle.div(Total_vehicle,axis=1)

source3 = ColumnDataSource(df_monthlycrash)
source_vehicle3 = ColumnDataSource(df_monthlycrash_vehicle)
Months = df_monthlycrash_vehicle.index.tolist()
correct_order = [4,3,7,0,8,6,5,1,11,10,9,2]
Months = [Months[i] for i in correct_order]
color = itertools.cycle(palette)

output_notebook()

In [9]:
# Creating the bokeh figures
p1 = figure(x_range = FactorRange(factors=hours), plot_height=500,plot_width=900, title="Crashes per hour")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusviolations,color)):
    bar[i[0]] = p1.vbar(x='CRASH TIME',  top=i[0], source= source1,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p1.xaxis.axis_label = 'Hour of the day'
p1.yaxis.axis_label = 'Normalized values'
legend1 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p1.add_layout(legend1, 'left')
tab1 = Panel(child=p1, title="Violations by hours")


p2 = figure(x_range = FactorRange(factors=hours), plot_height=500,plot_width=900, title="Crashes per hour")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusvehicles,color)):
    bar[i[0]] = p2.vbar(x='CRASH TIME',  top=i[0], source= source_vehicle1,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p2.xaxis.axis_label = 'Hour of the day'
p2.yaxis.axis_label = 'Normalized values'
legend2 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p2.add_layout(legend2, 'right')
tab2 = Panel(child=p2, title="Vehicle Types by hours")


p3 = figure(x_range = FactorRange(factors=DaysOfWeek), plot_height=500,plot_width=900, title="Crashes each day")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusviolations,color)):
    bar[i[0]] = p3.vbar(x='CRASH DATE',  top=i[0], source= source2,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p3.xaxis.axis_label = 'Day of the week'
p3.yaxis.axis_label = 'Normalized values'
legend3 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p3.add_layout(legend3, 'left')
tab3 = Panel(child=p3, title="Violations by week")


p4 = figure(x_range = FactorRange(factors=DaysOfWeek), plot_height=500,plot_width=900, title="Crashes each day")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusvehicles,color)):
    bar[i[0]] = p4.vbar(x='CRASH DATE',  top=i[0], source= source_vehicle2,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p4.xaxis.axis_label = 'Day of the week'
p4.yaxis.axis_label = 'Normalized values'
legend4 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p4.add_layout(legend4, 'right')
tab4 = Panel(child=p4, title="Vehicle Types by week")


p5 = figure(x_range = FactorRange(factors=Months), plot_height=500,plot_width=900, title="Crashes each month")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusviolations,color)):
    bar[i[0]] = p5.vbar(x='CRASH DATE',  top=i[0], source= source3,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p5.xaxis.axis_label = 'Months'
p5.yaxis.axis_label = 'Normalized values'
legend5 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p5.add_layout(legend5, 'left')
p5.xaxis.major_label_orientation = np.pi/4
tab5 = Panel(child=p5, title="Violations by month")


p6 = figure(x_range = FactorRange(factors=Months), plot_height=500,plot_width=900, title="Crashes each month")
bar ={} # to store vbars
items = []
for indx,i in enumerate(zip(focusvehicles,color)):
    bar[i[0]] = p6.vbar(x='CRASH DATE',  top=i[0], source= source_vehicle3,color=i[1], width = 0.5,  muted_alpha=False, muted = True)
    items.append((i[0],[bar[i[0]]]))
p6.xaxis.axis_label = 'Months'
p6.yaxis.axis_label = 'Normalized values'
legend6 = Legend(items=items,click_policy = 'mute', location=(0, 20))
p6.add_layout(legend6, 'right')
p6.xaxis.major_label_orientation = np.pi/4
tab6 = Panel(child=p6, title="Vehicle Types by month")

In [10]:
# showing each tab in 1 plot with 2 rows
tabs1 = Tabs(tabs=[ tab1, tab3, tab5 ])
tabs2 = Tabs(tabs=[ tab2, tab4, tab6 ])

show(column(tabs1, tabs2))

As seen in the two plots above, the user is allowed free reign over which contributing factor for the crash he/she wants to look at, over the 3 different time scales. The same goes for the vehicle types.
Some of the interesting points to note here, are the hours of the day plots for both contributing factors and vehicle types. Here, the rush hour trafic pattern appears, as the number of crashes rises at about 8-9 am, then drops until it rises again around 15-16 pm. As we would expect, alcohol involvement inverses this pattern, and has most crashes happen at night or early hours of the morning.
Another fun observation is that the number of crashes where a bicyclist is involved rises during the summer months, as we would expect as the weather is better.

# Stacked Bokeh plot with crashes as a function of age distributed across males and females

Now, another important feature that is found in the data is the age and gender of the drivers and passengers in the crashes. Alot of preprocessing is done to select the drivers specifically of the crash, and then finding their age and genders from the data. As of now, this is done by grouping by the collision ID of the crash, and finding the minimum and maximum age of the people in the crash, and the first and last appearance of their genders. This is perhaps a somewhat crude way of handling the data preprocssing step, but it gets the job done. It begins however to fall apart when there is more than 2 drivers in a crash (which will realistically happen) or when we look at solo accidents.

In [11]:
df_persons = df_persons[~df_persons['PERSON_AGE'].isin(['Unspecified'])]
df_persons = df_persons[~df_persons['PERSON_SEX'].isin(['Unspecified'])]
df_persons["PERSON_AGE"] = pd.to_numeric(df_persons["PERSON_AGE"], downcast="float")
df_persons = df_persons.loc[(df_persons['PERSON_AGE']>np.float(0.0)) & (df_persons['PERSON_AGE'] < np.float(100.0) )]
df_persons = df_persons.loc[(df_persons['PED_ROLE'] == 'Driver')]

df_sort = df_persons.groupby(['COLLISION_ID','PERSON_TYPE'],as_index=False).agg({'PERSON_AGE': ['min', 'max'],'PERSON_SEX': ['first','last']})

df_sort = df_sort[df_sort['COLLISION_ID'].notnull() == True].set_index('COLLISION_ID')
df_crash2 = df_crash[df_crash['COLLISION_ID'].notnull() == True].set_index('COLLISION_ID')

df_merged = pd.merge(df_crash2,df_sort, how='inner', left_index=True, right_index=True)
df_merged = df_merged.rename(columns = {df_merged.columns[-1] : 'PERSON_SEX2'})
df_merged = df_merged.rename(columns = {df_merged.columns[-2] : 'PERSON_SEX1'})
df_merged = df_merged.rename(columns = {df_merged.columns[-3] : 'PERSON_AGE2'})
df_merged = df_merged.rename(columns = {df_merged.columns[-4] : 'PERSON_AGE1'})
df_merged = df_merged.rename(columns = {df_merged.columns[-5] : 'PERSON_TYPE'})

Now that we have found a way to get the age and gender into the equation, we can visualize it! We group the ages into age groups of 5 year, because then we can normalize by actual numbers of how many people are in that specific age group, instead of the total number of crashes. This gives us a more realistic feel for how many crashes young people are involved in (looking at you 20-24 males!).


In [12]:
df_pre1 = df_merged[['PERSON_AGE1','PERSON_SEX1','BOROUGH']]
df_pre1["PERSON_AGE1"] = pd.to_numeric(df_pre1["PERSON_AGE1"], downcast="integer")
def agefunc(x):
    if x <=5:
        x = '0-5'
    elif x > 5 and x <= 10:
        x = '5-9'
    elif x > 10 and x <= 15:
        x = '10-14'
    elif x > 15 and x <= 20:
        x = '15-19'
    elif x > 20 and x <= 25:
        x = '20-24'
    elif x > 25 and x <= 30:
        x = '25-29'
    elif x > 30 and x <=35:
        x = '30-34'    
    elif x > 35 and x <= 40:
        x = '35-39'
    elif x > 40 and x <= 45:
        x = '40-44'    
    elif x > 45 and x <= 50:
        x = '45-49'    
    elif x > 50 and x <= 55:
        x = '50-54'    
    elif x > 55 and x <= 60:
        x = '55-59'
    elif x > 60 and x <= 65:
        x = '60-64'
    elif x > 65 and x <= 70:
        x = '65-69'
    elif x > 70 and x <= 75:
        x = '70-74'
    elif x > 75 and x <= 80:
        x = '75-79'    
    elif x > 80 and x <= 85:
        x = '80-84'
    elif x > 85:
        x = '85 and over'
    return x

df_pre1['GROUPED_AGE'] = df_pre1['PERSON_AGE1'].apply(agefunc)

df_pre1 = pd.pivot_table(df_pre1, values='BOROUGH', index=['GROUPED_AGE'], columns=['PERSON_SEX1'], aggfunc='count').fillna(0)
df_pre1 = df_pre1.iloc[:,[0,1]] # removing unspecified børger


age_tot_male = 4112539
age_tot_female = 4510159
df_dist = pd.DataFrame({'F_dist': [6.4*age_tot_female/100, 5.5*age_tot_female/100, 5.6*age_tot_female/100, 5.3*age_tot_female/100, 6.7*age_tot_female/100, 9.4*age_tot_female/100, 8.6*age_tot_female/100, 7.3*age_tot_female/100, 6.4*age_tot_female/100, 6.4*age_tot_female/100, 6.4*age_tot_female/100, 6.0*age_tot_female/100, 5.7*age_tot_female/100, 4.6*age_tot_female/100, 3.4*age_tot_female/100, 2.6*age_tot_female/100, 1.8*age_tot_female/100, 2.0*age_tot_female/100],
                             'M_dist': [6.4*age_tot_male/100, 5.5*age_tot_male/100, 5.6*age_tot_male/100, 5.3*age_tot_male/100, 6.7*age_tot_male/100, 9.4*age_tot_male/100, 8.6*age_tot_male/100, 7.3*age_tot_male/100, 6.4*age_tot_male/100, 6.4*age_tot_male/100, 6.4*age_tot_male/100, 6.0*age_tot_male/100, 5.7*age_tot_male/100, 4.6*age_tot_male/100, 3.4*age_tot_male/100, 2.6*age_tot_male/100, 1.8*age_tot_male/100, 2.0*age_tot_male/100]},
                            index=[['0-5','5-9','10-14', '15-19', '20-24', '25-29', '30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-69','70-74','75-79','80-84','85 and over']])

df_dist = df_dist/1000
df_pre2 = df_pre1.iloc[[0,9,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17]] 
age = df_pre2.index.tolist()
df_pre2 = df_pre2.div(df_dist.values, axis = 1) # Looking at crashes pr. 1000 peoples
source = ColumnDataSource(df_pre2)
stacks = df_pre2.columns.to_list()
output_notebook()

In [13]:
TOOLTIPS = [("PERSON_SEX1","$name"), ("Fraction","@$name")]
p = figure(x_range = FactorRange(factors = age),width=850,plot_height=500, tooltips=TOOLTIPS, title="Crashes for different age groups from 2012 to 2020")
p.vbar_stack(stackers=stacks,x = 'GROUPED_AGE', color=['chocolate', 'steelblue'],width = 0.9,source= source)
p.xaxis.major_label_orientation = np.pi/4
p.xaxis.axis_label = 'Age Groups'
p.yaxis.axis_label = 'Crashes per 1000 peoples'
show(p) #displays your plot

Tada! In this plot, which is normalized by values from this [website](https://www.baruch.cuny.edu/nycdata/population-geography/age_distribution.htm), the number of crashes per 1000 people for males (M) and females (F) can be seen. One important thing to notice here, is that the trend we saw in the exploratory data analysis, in which the number of crashes seemed to flatten out in the ages 30-55 before dropping, is gone! This is due to the binning procedure that is employed to produce this figure. So here we get a glimpse of how the trends in the data is very susceptible to how we treat it.
Obviously, we see a lot of crashes where young gentlemen are involved, but the same trend is also visible for females. So even if we young guys crash more often, our female counterparts appear to do the same!

## 4 Genre

## 5 Visualizations

## 6 Discussion

Improvements

* Proper feature handling of age and genders (how do we look at the crashes, solo accidents vs 2 cars in a single crash etc.) 


## 7 Contributions