## NYPD Motor Vehicle Collision Data

### Overview

'The Motor Vehicle Collisions - Crash' table contains details on the crash events. Each row represents a crash event. The data tables contain information from all police reported motor vehicle collisions in NYC. The dataset can be found by following this link: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions-Crashes/h9gi-nx95

### High-Level Description

The data dates from 2012 to the current day, with data being updated on a daily basis. At the time of this writing, there are 1.6 million rows, each row representing a crash event, and 29 columns which represent crash date, crash time, borough, zip code, latitude, longitude, location, on and off street name, cross street name, number of persons injured, number of persons killed, number of pedestrians injured, number of pedestrians killed, number of cyclist injured, number of cyclist killed, number of motorist injured, number of motorist killed, contributing factors, vehicle type codes and collision ID.

### Bring in the data

Let's start by bringing in the data! I'm going to limit this to 3 million rows.

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
datanyc = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=3000000", low_memory=False)

Let's look at the first 10 rows to get an idea of how the dataset looks like.

In [2]:
pd.set_option('display.max_columns', None)
datanyc.head(10)

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2019-01-15T00:00:00.000,7:56,,,,,,VERRAZANO BRIDGE LOWER,,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4068241,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,
1,2019-02-05T00:00:00.000,17:25,,,40.822784,-73.9574,POINT (-73.9574 40.822784),HENRY HUDSON PARKWAY,,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4075461,Sedan,Sedan,,,
2,2019-01-25T00:00:00.000,18:25,BROOKLYN,11236.0,40.647255,-73.89072,POINT (-73.89072 40.647255),EAST 108 STREET,AVENUE J,,1.0,0.0,0,0,1,0,0,0,Failure to Yield Right-of-Way,Unspecified,,,,4072943,Bike,,,,
3,2019-02-08T00:00:00.000,9:30,QUEENS,11004.0,40.751026,-73.71374,POINT (-73.71374 40.751026),263 STREET,74 AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4077193,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,
4,2019-01-15T00:00:00.000,16:00,,,40.8827,-73.89273,POINT (-73.89273 40.8827),SEDGWICK AVENUE,,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4067853,Sedan,Sedan,,,
5,2019-02-07T00:00:00.000,16:00,,,40.726376,-73.76624,POINT (-73.76624 40.726376),EPSOM COURSE,GRAND CENTRAL PARKWAY,,1.0,0.0,0,0,0,0,1,0,Following Too Closely,Unspecified,,,,4076758,Station Wagon/Sport Utility Vehicle,Sedan,,,
6,2019-02-08T00:00:00.000,11:25,,,,,,FLATBUSH AVENUE,GRAND ARMY PLAZA,,0.0,0.0,0,0,0,0,0,0,Unsafe Lane Changing,Unspecified,,,,4077220,Sedan,Tractor Truck Diesel,,,
7,2019-01-21T00:00:00.000,15:00,MANHATTAN,10033.0,40.84787,-73.93641,POINT (-73.93641 40.84787),WEST 178 STREET,WADSWORTH AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,,,,,4069800,Sedan,,,,
8,2019-01-12T00:00:00.000,6:00,BROOKLYN,11205.0,40.69441,-73.9762,POINT (-73.9762 40.69441),,,39 AUBURN PLACE,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4065633,Station Wagon/Sport Utility Vehicle,,,,
9,2019-01-31T00:00:00.000,9:36,BROOKLYN,11230.0,40.62418,-73.97048,POINT (-73.97048 40.62418),OCEAN PARKWAY,AVENUE J,,1.0,0.0,0,0,0,0,1,0,Unspecified,Unspecified,,,,4072649,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,


In [3]:
datanyc.shape

(1624091, 29)

We have 1,624,091 rows and 29 columns. Let's see the data types.

In [4]:
pd.options.display.max_info_rows = 3000000
datanyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1624091 entries, 0 to 1624090
Data columns (total 29 columns):
crash_date                       1624091 non-null object
crash_time                       1624091 non-null object
borough                          1130767 non-null object
zip_code                         1130566 non-null object
latitude                         1426221 non-null float64
longitude                        1426221 non-null float64
location                         1426221 non-null object
on_street_name                   1305160 non-null object
off_street_name                  1077604 non-null object
cross_street_name                226766 non-null object
number_of_persons_injured        1624074 non-null float64
number_of_persons_killed         1624060 non-null float64
number_of_pedestrians_injured    1624091 non-null int64
number_of_pedestrians_killed     1624091 non-null int64
number_of_cyclist_injured        1624091 non-null int64
number_of_cyclist_killed        

I will change `crash_date` and `crash_time` to a datetime data type. 

In [5]:
datanyc['crash_date'] = pd.to_datetime(datanyc['crash_date'])
datanyc['crash_time'] = pd.to_datetime(datanyc['crash_time'])
datanyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1624091 entries, 0 to 1624090
Data columns (total 29 columns):
crash_date                       1624091 non-null datetime64[ns]
crash_time                       1624091 non-null datetime64[ns]
borough                          1130767 non-null object
zip_code                         1130566 non-null object
latitude                         1426221 non-null float64
longitude                        1426221 non-null float64
location                         1426221 non-null object
on_street_name                   1305160 non-null object
off_street_name                  1077604 non-null object
cross_street_name                226766 non-null object
number_of_persons_injured        1624074 non-null float64
number_of_persons_killed         1624060 non-null float64
number_of_pedestrians_injured    1624091 non-null int64
number_of_pedestrians_killed     1624091 non-null int64
number_of_cyclist_injured        1624091 non-null int64
number_of_cyclis

I'd like to use scikit-learn (sklearn) to do a tree algorithm on my data, figuring out how to predict "0", "1" or "2" injuries.

In [6]:
import datetime as dt
simple_nyc = datanyc.loc[(datanyc['number_of_persons_injured'] == 0) | (datanyc['number_of_persons_injured'] == 1)  | (datanyc['number_of_persons_injured'] == 2), 
                            ["latitude", "longitude", "number_of_persons_injured", "crash_date", "crash_time"]].dropna()

simple_nyc['hour'], simple_nyc['day'] = simple_nyc['crash_time'].dt.hour, simple_nyc['crash_date'].dt.day
simple_nyc.drop(columns = ['crash_time'], inplace=True)
simple_nyc.drop(columns = ['crash_date'], inplace=True)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0)
predictors, outcome = simple_nyc.drop('number_of_persons_injured',axis=1), simple_nyc['number_of_persons_injured']
X_train, X_test, y_train, y_test = train_test_split(predictors, outcome, random_state=1)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [8]:
predictions = model.predict(X_train)

In [9]:
model_outcome_training = pd.DataFrame({"prediction": predictions, "actual": y_train})

model_outcome_training.head(20)

Unnamed: 0,prediction,actual
910220,2.0,2.0
978840,0.0,0.0
456795,0.0,0.0
878059,0.0,0.0
6965,0.0,0.0
549676,0.0,0.0
854718,0.0,0.0
974173,0.0,0.0
194439,0.0,0.0
681056,0.0,0.0


In [10]:
(model_outcome_training['prediction'] == model_outcome_training['actual']).value_counts()

True     1044432
False       8995
dtype: int64

Moslty, "true".. which looks good. Let's see the numbers.

In [11]:
pd.crosstab(model_outcome_training['prediction'], model_outcome_training['actual'])

actual,0.0,1.0,2.0
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,861569,7260,1356
1.0,107,154741,261
2.0,8,3,28122


That's not bad. Most predictions are accurate. Let's see how it does on testing data.

In [12]:
model_outcome_testing = pd.DataFrame({"prediction": model.predict(X_test), "actual": y_test})

pd.crosstab(model_outcome_testing['prediction'], model_outcome_testing['actual'])

actual,0.0,1.0,2.0
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,231772,41841,7611
1.0,46245,10095,1845
2.0,9273,1962,499


And let's see the percentages.

In [14]:
pd.crosstab(model_outcome_testing['prediction'], model_outcome_testing['actual'], normalize='index').round(2)

actual,0.0,1.0,2.0
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.82,0.15,0.03
1.0,0.79,0.17,0.03
2.0,0.79,0.17,0.04


It is not perfect but not bad for the beginning.

## Thank you for reading!