# Final Project

### Imports and data loading for plots

In [3]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
import numpy as np
import seaborn as sns
from bokeh.plotting import figure, show
from bokeh.io import output_notebook,curdoc, show
from bokeh.models import ColumnDataSource, FactorRange,Grid, HBar, LinearAxis, Plot,LabelSet,Legend
from bokeh.core.properties import value
from bokeh.transform import factor_cmap,dodge
from bokeh.palettes import Spectral10
from bokeh.models import HoverTool
from bokeh.models import Select
from bokeh.layouts import column,row
import warnings
warnings.filterwarnings('ignore')

# select a palette
from bokeh.palettes import Spectral3
from bokeh.palettes import Category20b_13 as palette
from bokeh.palettes import Category20b_14 as palette2
# itertools handles the cycling
import itertools  


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from sklearn import tree

sns.set(style='darkgrid', palette='muted', color_codes=True)



# Magic command useful for jupyter notebook
%matplotlib inline

# Set plot size. 
plt.rcParams['figure.figsize'] = [13, 6]

# Set font size
plt.rcParams.update({'font.size': 22})

In [6]:
df_crash = pd.read_csv('data/Motor_Vehicle_Collisions_-_Crashes.csv')
df_vehicle = pd.read_csv('data/Motor_Vehicle_Collisions_-_Vehicles.csv')
df_people = pd.read_csv('data/Motor_Vehicle_Collisions_-_Person.csv')

## 1 - Motivation

In this project three datasets were used. The main dataset was a detailed table of vehicle crash incidends in New York from 2012 to 2019. For each incident there was a unique collision ID which the two other datasets were based upon. These datasets were records of the people and vehicles involved in the crashes of the first dataset. 

These datasets where chosen because of the large amount of variables. They had the core variables that are essential to this sort of analysis like; time of crash, date of crash and location of crash (GPS coordinates and Borough). In addition there were several interesting variables like street name of crash, number of persons killed/injured and contributing factor of the crash. 
The problem with such a dataset would not be to search heavily for parameters to analyse, but rather carefully select a few of the vast possibilities in the dataset. 

The main goal was to be able to provide the user with easy to understand vizualizations. And in some cases provide tools for the user to interactively select what data they would like to see to encourage user engagement. 


## 2 - Basic stats

### Raw dataset stats

To best explain how the preprocessing and cleaning was done, an overview of the initial raw dataset is given in this section. 

**Crashes:** 
- 1.67 M rows
- 29 columns
- 362 mb

**Vehicles:** 
- 3.35 M rows
- 25 columns
- 551 MB

**People:** 
- 3.91 M rows
- 21 columns
- 624 MB 

### Cleaning and preprocessing

Initially there were a lot of NaN values in the datasets as seen below. Therefore a lot of values had to be removed. To avoid losing lots of data from dropping NaNs, the appropriate columns from the three datasets were extracted. Many colums had over a million NaN values and would thus not be appropriate for analysis. Important parameters such as crash time, crash date and location (latitude/longitude/borough) had most of their values. 

In [7]:
df_crash.isnull().sum()

CRASH DATE                             0
CRASH TIME                             0
BOROUGH                           509839
ZIP CODE                          510046
LATITUDE                          201721
LONGITUDE                         201721
LOCATION                          201721
ON STREET NAME                    330899
CROSS STREET NAME                 570911
OFF STREET NAME                  1433844
NUMBER OF PERSONS INJURED             17
NUMBER OF PERSONS KILLED              31
NUMBER OF PEDESTRIANS INJURED          0
NUMBER OF PEDESTRIANS KILLED           0
NUMBER OF CYCLIST INJURED              0
NUMBER OF CYCLIST KILLED               0
NUMBER OF MOTORIST INJURED             0
NUMBER OF MOTORIST KILLED              0
CONTRIBUTING FACTOR VEHICLE 1       4518
CONTRIBUTING FACTOR VEHICLE 2     227813
CONTRIBUTING FACTOR VEHICLE 3    1563865
CONTRIBUTING FACTOR VEHICLE 4    1649657
CONTRIBUTING FACTOR VEHICLE 5    1666553
COLLISION_ID                           0
VEHICLE TYPE COD

In [9]:
df_vehicle.isnull().sum()

UNIQUE_ID                            0
COLLISION_ID                         0
CRASH_DATE                           0
CRASH_TIME                           0
VEHICLE_ID                           0
STATE_REGISTRATION              152152
VEHICLE_TYPE                    132033
VEHICLE_MAKE                   1713629
VEHICLE_MODEL                  3294186
VEHICLE_YEAR                   1720659
TRAVEL_DIRECTION               1607383
VEHICLE_OCCUPANTS              1668167
DRIVER_SEX                     1917770
DRIVER_LICENSE_STATUS          1971646
DRIVER_LICENSE_JURISDICTION    1961964
PRE_CRASH                       850587
POINT_OF_IMPACT                1628802
VEHICLE_DAMAGE                 1640673
VEHICLE_DAMAGE_1               2277825
VEHICLE_DAMAGE_2               2557501
VEHICLE_DAMAGE_3               2745625
PUBLIC_PROPERTY_DAMAGE         1528863
PUBLIC_PROPERTY_DAMAGE_TYPE    3331696
CONTRIBUTING_FACTOR_1            92818
CONTRIBUTING_FACTOR_2          1620959
dtype: int64

In [8]:
df_people.isnull().sum()

UNIQUE_ID                      0
COLLISION_ID                   0
CRASH_DATE                     0
CRASH_TIME                     0
PERSON_ID                     19
PERSON_TYPE                    0
PERSON_INJURY                  0
VEHICLE_ID                151782
PERSON_AGE                294988
EJECTION                 1911145
EMOTIONAL_STATUS         1864739
BODILY_INJURY            1864696
POSITION_IN_VEHICLE      1910875
SAFETY_EQUIPMENT         1910925
PED_LOCATION             3858762
PED_ACTION               3858863
COMPLAINT                1864689
PED_ROLE                  194895
CONTRIBUTING_FACTOR_1    3859973
CONTRIBUTING_FACTOR_2    3860035
PERSON_SEX                468460
dtype: int64

After selection of columns used for analysis the datasets had the following columns and nan values: 

In [10]:
df_crash = df_crash[['CRASH DATE','CRASH TIME','BOROUGH', 'LATITUDE', 'LONGITUDE',\
         'ON STREET NAME', 'CROSS STREET NAME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED',\
          'NUMBER OF PEDESTRIANS INJURED','NUMBER OF PEDESTRIANS KILLED','NUMBER OF CYCLIST INJURED','NUMBER OF CYCLIST KILLED',\
          'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED','CONTRIBUTING FACTOR VEHICLE 1',\
         'CONTRIBUTING FACTOR VEHICLE 2','COLLISION_ID','VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2']]
df_vehicle = df_vehicle[['UNIQUE_ID','COLLISION_ID','CRASH_DATE','CRASH_TIME','VEHICLE_ID',\
                        'VEHICLE_TYPE','VEHICLE_YEAR','DRIVER_SEX','PRE_CRASH','POINT_OF_IMPACT','VEHICLE_DAMAGE']]
df_people = df_people.drop(columns = ['PED_LOCATION','PED_ACTION','CONTRIBUTING_FACTOR_1','CONTRIBUTING_FACTOR_2',\
                                     'EJECTION','EMOTIONAL_STATUS','BODILY_INJURY','POSITION_IN_VEHICLE',\
                                     'SAFETY_EQUIPMENT','COMPLAINT'])

In [11]:
df_crash.isnull().sum()

CRASH DATE                            0
CRASH TIME                            0
BOROUGH                          509839
LATITUDE                         201721
LONGITUDE                        201721
ON STREET NAME                   330899
CROSS STREET NAME                570911
NUMBER OF PERSONS INJURED            17
NUMBER OF PERSONS KILLED             31
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1      4518
CONTRIBUTING FACTOR VEHICLE 2    227813
COLLISION_ID                          0
VEHICLE TYPE CODE 1                5944
VEHICLE TYPE CODE 2              280627
dtype: int64

In [12]:
df_vehicle.isnull().sum()

UNIQUE_ID                0
COLLISION_ID             0
CRASH_DATE               0
CRASH_TIME               0
VEHICLE_ID               0
VEHICLE_TYPE        132033
VEHICLE_YEAR       1720659
DRIVER_SEX         1917770
PRE_CRASH           850587
POINT_OF_IMPACT    1628802
VEHICLE_DAMAGE     1640673
dtype: int64

In [13]:
df_people.isnull().sum()

UNIQUE_ID             0
COLLISION_ID          0
CRASH_DATE            0
CRASH_TIME            0
PERSON_ID            19
PERSON_TYPE           0
PERSON_INJURY         0
VEHICLE_ID       151782
PERSON_AGE       294988
PED_ROLE         194895
PERSON_SEX       468460
dtype: int64

The next steps was to decide on what to do with the rows containing NaN values. To avoid removing data as much as possible, it was decided to fille the missing values with 'Unspecified'. In this way the data would still count in analysis on rows without missing values. When the actual columns with NaN values were analyzed, 'Unspecified' could still be used to remove the rows at this point or simply include them in the analysis. Following this step all column had zero NaN values and could now be used for analysis. 

**Other considerations**

After the datasets had been cleaned of NaN values we had to consider all the 'bad' values in the dataset. This includes numbers that would be wrongfully entered when recorded or non-valid strings. An example was a Vehicle year recorded as 1100 which was deemed highly unlikely...

The list of bad values included

- Unlikely person ages. 
- Mispellings of vehicle type (Creating several groups of the same data in a groupby)
- Unlikely Vehicle years
- Incorrect latitude/longitudes. 

As the number of bad values was high for each column and in some cases dificult to identify it was deicded to handle them when appropriate as visualizations often revealed the bad values. 

### Final dataset stats

After all the cleaning was done the final datasets had the following stats: 

**Crashes:** 
- 1.67 M rows
- 20 columns

**Vehicles:** 
- 3.35 M rows
- 12 columns


**People:** 
- 3.91 M rows
- 11 columns


## 3 Data analysis

- Describe your data analysis and explain what you've learned about the dataset.

- If relevant, talk about your machine-learning.

## 4 Genre

## 5 Visualizations

## 6 Discussion

## 7 Contributions