## NYPD Motor Vehicle Collision Data

### Overview

'The Motor Vehicle Collisions - Crash' table contains details on the crash events. Each row represents a crash event. The data tables contain information from all police reported motor vehicle collisions in NYC. The dataset can be found by following this link: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions-Crashes/h9gi-nx95

### High-Level Description

The data dates from 2012 to the current day, with data being updated on a daily basis. At the time of this writing, there are 1.6 million rows, each row representing a crash event, and 29 columns which represent crash date, crash time, borough, zip code, latitude, longitude, location, on and off street name, cross street name, number of persons injured, number of persons killed, number of pedestrians injured, number of pedestrians killed, number of cyclist injured, number of cyclist killed, number of motorist injured, number of motorist killed, contributing factors, vehicle type codes and collision ID.

### Bring in the data

Let's start by bringing in the data! I'm only going to bring in the rows that have 'Brooklyn' in the `borough` field. I'm going to limit this to 3 million rows.

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
datanyc = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv?borough=BROOKLYN&$limit=3000000", low_memory=False)

Let's look at the first 10 rows to get an idea of how the dataset looks like.

In [2]:
pd.set_option('display.max_columns', None)
datanyc.head(10)

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2019-04-04T00:00:00.000,22:30,BROOKLYN,11207.0,40.654427,-73.8908,POINT (-73.8908 40.654427),WORTMAN AVENUE,ALABAMA AVENUE,,0.0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,4109395,Bus,Sedan,,,
1,2019-04-27T00:00:00.000,0:05,BROOKLYN,11218.0,40.64881,-73.97749,POINT (-73.97749 40.64881),EAST 4 STREET,FORT HAMILTON PARKWAY,,1.0,0.0,0,0,1,0,0,0,Traffic Control Disregarded,Unspecified,,,,4121270,Sedan,Bike,,,
2,2019-04-18T00:00:00.000,19:29,BROOKLYN,11218.0,40.637745,-73.97334,POINT (-73.97334 40.637745),,,438 OCEAN PARKWAY,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117453,Sedan,,,,
3,2019-04-25T00:00:00.000,18:30,BROOKLYN,11236.0,40.64856,-73.90535,POINT (-73.90535 40.64856),ROCKAWAY AVENUE,FOSTER AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4120847,Station Wagon/Sport Utility Vehicle,Sedan,,,
4,2019-04-18T00:00:00.000,16:12,BROOKLYN,11212.0,40.657803,-73.90868,POINT (-73.90868 40.657803),CHESTER STREET,LOTT AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117114,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,
5,2019-04-24T00:00:00.000,22:45,BROOKLYN,11237.0,40.698837,-73.91407,POINT (-73.91407 40.698837),IRVING AVENUE,LINDEN STREET,,0.0,0.0,0,0,0,0,0,0,Failure to Yield Right-of-Way,Unspecified,,,,4120286,Convertible,Sedan,,,
6,2019-04-15T00:00:00.000,10:35,BROOKLYN,11201.0,40.696198,-73.98869,POINT (-73.98869 40.696198),ADAMS STREET,TILLARY STREET,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4115154,Station Wagon/Sport Utility Vehicle,Sedan,,,
7,2019-04-03T00:00:00.000,22:40,BROOKLYN,11201.0,40.696198,-73.98869,POINT (-73.98869 40.696198),TILLARY STREET,ADAMS STREET,,0.0,0.0,0,0,0,0,0,0,Passing or Lane Usage Improper,Unspecified,,,,4109270,Sedan,Station Wagon/Sport Utility Vehicle,,,
8,2019-04-24T00:00:00.000,16:25,BROOKLYN,11225.0,40.65745,-73.956566,POINT (-73.956566 40.65745),HAWTHORNE STREET,BEDFORD AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4121249,Sedan,Station Wagon/Sport Utility Vehicle,,,
9,2019-04-13T00:00:00.000,9:40,BROOKLYN,11203.0,40.654434,-73.92139,POINT (-73.92139 40.654434),REMSEN AVENUE,LINDEN BOULEVARD,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4114448,Station Wagon/Sport Utility Vehicle,Sedan,,,


In [3]:
datanyc.shape

(350345, 29)

We have 350345 rows and 29 columns. 

## Part 1

I want to use hour, day, and season as my predictors of injuries in crashes. To do so, I will create and add `hour`, `season`, and `day` columns to my dataset with the help of the information provided in the dataset. Let's begin with creating an `hour` column which will only have the hours instead of hours and minutes.

In [4]:
datanyc['crash_time'] = pd.to_datetime(datanyc.crash_time)
datanyc['hour'] = datanyc['crash_time'].dt.hour
datanyc.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5,hour
0,2019-04-04T00:00:00.000,2019-12-15 22:30:00,BROOKLYN,11207.0,40.654427,-73.8908,POINT (-73.8908 40.654427),WORTMAN AVENUE,ALABAMA AVENUE,,0.0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,4109395,Bus,Sedan,,,,22
1,2019-04-27T00:00:00.000,2019-12-15 00:05:00,BROOKLYN,11218.0,40.64881,-73.97749,POINT (-73.97749 40.64881),EAST 4 STREET,FORT HAMILTON PARKWAY,,1.0,0.0,0,0,1,0,0,0,Traffic Control Disregarded,Unspecified,,,,4121270,Sedan,Bike,,,,0
2,2019-04-18T00:00:00.000,2019-12-15 19:29:00,BROOKLYN,11218.0,40.637745,-73.97334,POINT (-73.97334 40.637745),,,438 OCEAN PARKWAY,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117453,Sedan,,,,,19
3,2019-04-25T00:00:00.000,2019-12-15 18:30:00,BROOKLYN,11236.0,40.64856,-73.90535,POINT (-73.90535 40.64856),ROCKAWAY AVENUE,FOSTER AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4120847,Station Wagon/Sport Utility Vehicle,Sedan,,,,18
4,2019-04-18T00:00:00.000,2019-12-15 16:12:00,BROOKLYN,11212.0,40.657803,-73.90868,POINT (-73.90868 40.657803),CHESTER STREET,LOTT AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117114,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,,16


Next, let's create a column that shows the season in which a crash occurred. Let's look at the `crash_date` column to make sure it has datetime data type.

In [5]:
pd.options.display.max_info_rows = 3000000
datanyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350345 entries, 0 to 350344
Data columns (total 30 columns):
crash_date                       350345 non-null object
crash_time                       350345 non-null datetime64[ns]
borough                          350345 non-null object
zip_code                         350340 non-null float64
latitude                         342487 non-null float64
longitude                        342487 non-null float64
location                         342487 non-null object
on_street_name                   285684 non-null object
off_street_name                  285606 non-null object
cross_street_name                64696 non-null object
number_of_persons_injured        350341 non-null float64
number_of_persons_killed         350340 non-null float64
number_of_pedestrians_injured    350345 non-null int64
number_of_pedestrians_killed     350345 non-null int64
number_of_cyclist_injured        350345 non-null int64
number_of_cyclist_killed         350345 

I will change `crash_date` to a datetime data type to be able to create a season column. 

In [6]:
datanyc['crash_date'] = pd.to_datetime(datanyc['crash_date'])
datanyc.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5,hour
0,2019-04-04,2019-12-15 22:30:00,BROOKLYN,11207.0,40.654427,-73.8908,POINT (-73.8908 40.654427),WORTMAN AVENUE,ALABAMA AVENUE,,0.0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,4109395,Bus,Sedan,,,,22
1,2019-04-27,2019-12-15 00:05:00,BROOKLYN,11218.0,40.64881,-73.97749,POINT (-73.97749 40.64881),EAST 4 STREET,FORT HAMILTON PARKWAY,,1.0,0.0,0,0,1,0,0,0,Traffic Control Disregarded,Unspecified,,,,4121270,Sedan,Bike,,,,0
2,2019-04-18,2019-12-15 19:29:00,BROOKLYN,11218.0,40.637745,-73.97334,POINT (-73.97334 40.637745),,,438 OCEAN PARKWAY,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117453,Sedan,,,,,19
3,2019-04-25,2019-12-15 18:30:00,BROOKLYN,11236.0,40.64856,-73.90535,POINT (-73.90535 40.64856),ROCKAWAY AVENUE,FOSTER AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4120847,Station Wagon/Sport Utility Vehicle,Sedan,,,,18
4,2019-04-18,2019-12-15 16:12:00,BROOKLYN,11212.0,40.657803,-73.90868,POINT (-73.90868 40.657803),CHESTER STREET,LOTT AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117114,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,,16


In [7]:
datanyc['crash_date'].dt.month.head(20)

0     4
1     4
2     4
3     4
4     4
5     4
6     4
7     4
8     4
9     4
10    4
11    4
12    4
13    4
14    4
15    4
16    4
17    4
18    4
19    4
Name: crash_date, dtype: int64

In [8]:
def season(crash_date):
    if crash_date.month in ([3, 4, 5]):
        val = 'Spring'
    elif crash_date.month in ([6, 7, 8]):
        val = 'Summer'
    elif crash_date.month in ([9, 10, 11]):
        val = 'Autumn'
    elif crash_date.month in ([12, 1, 2]):
        val = 'Winter'
    else:
        val = "Unspecified"
    return val

datanyc['season'] = datanyc['crash_date'].apply(season)

In [9]:
datanyc['season'].value_counts()

Autumn    93793
Summer    93091
Spring    83863
Winter    79598
Name: season, dtype: int64

Now, let's create a `day` column which will show the day in which the crash occured.

In [10]:
datanyc['day_of_week'] = datanyc['crash_date'].dt.weekday_name
datanyc['year'] = datanyc['crash_date'].dt.year
datanyc.head(10)

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5,hour,season,day_of_week,year
0,2019-04-04,2019-12-15 22:30:00,BROOKLYN,11207.0,40.654427,-73.8908,POINT (-73.8908 40.654427),WORTMAN AVENUE,ALABAMA AVENUE,,0.0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,4109395,Bus,Sedan,,,,22,Spring,Thursday,2019
1,2019-04-27,2019-12-15 00:05:00,BROOKLYN,11218.0,40.64881,-73.97749,POINT (-73.97749 40.64881),EAST 4 STREET,FORT HAMILTON PARKWAY,,1.0,0.0,0,0,1,0,0,0,Traffic Control Disregarded,Unspecified,,,,4121270,Sedan,Bike,,,,0,Spring,Saturday,2019
2,2019-04-18,2019-12-15 19:29:00,BROOKLYN,11218.0,40.637745,-73.97334,POINT (-73.97334 40.637745),,,438 OCEAN PARKWAY,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117453,Sedan,,,,,19,Spring,Thursday,2019
3,2019-04-25,2019-12-15 18:30:00,BROOKLYN,11236.0,40.64856,-73.90535,POINT (-73.90535 40.64856),ROCKAWAY AVENUE,FOSTER AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4120847,Station Wagon/Sport Utility Vehicle,Sedan,,,,18,Spring,Thursday,2019
4,2019-04-18,2019-12-15 16:12:00,BROOKLYN,11212.0,40.657803,-73.90868,POINT (-73.90868 40.657803),CHESTER STREET,LOTT AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117114,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,,16,Spring,Thursday,2019
5,2019-04-24,2019-12-15 22:45:00,BROOKLYN,11237.0,40.698837,-73.91407,POINT (-73.91407 40.698837),IRVING AVENUE,LINDEN STREET,,0.0,0.0,0,0,0,0,0,0,Failure to Yield Right-of-Way,Unspecified,,,,4120286,Convertible,Sedan,,,,22,Spring,Wednesday,2019
6,2019-04-15,2019-12-15 10:35:00,BROOKLYN,11201.0,40.696198,-73.98869,POINT (-73.98869 40.696198),ADAMS STREET,TILLARY STREET,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4115154,Station Wagon/Sport Utility Vehicle,Sedan,,,,10,Spring,Monday,2019
7,2019-04-03,2019-12-15 22:40:00,BROOKLYN,11201.0,40.696198,-73.98869,POINT (-73.98869 40.696198),TILLARY STREET,ADAMS STREET,,0.0,0.0,0,0,0,0,0,0,Passing or Lane Usage Improper,Unspecified,,,,4109270,Sedan,Station Wagon/Sport Utility Vehicle,,,,22,Spring,Wednesday,2019
8,2019-04-24,2019-12-15 16:25:00,BROOKLYN,11225.0,40.65745,-73.956566,POINT (-73.956566 40.65745),HAWTHORNE STREET,BEDFORD AVENUE,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4121249,Sedan,Station Wagon/Sport Utility Vehicle,,,,16,Spring,Wednesday,2019
9,2019-04-13,2019-12-15 09:40:00,BROOKLYN,11203.0,40.654434,-73.92139,POINT (-73.92139 40.654434),REMSEN AVENUE,LINDEN BOULEVARD,,0.0,0.0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4114448,Station Wagon/Sport Utility Vehicle,Sedan,,,,9,Spring,Saturday,2019


Finally, let's change the `number_of_persons_injured` column values from float to integer.

In [11]:
datanyc.dropna(subset = ['number_of_persons_injured'], how='all', inplace=True)
datanyc['number_of_persons_injured'] = datanyc.number_of_persons_injured.astype(int)
datanyc.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5,hour,season,day_of_week,year
0,2019-04-04,2019-12-15 22:30:00,BROOKLYN,11207.0,40.654427,-73.8908,POINT (-73.8908 40.654427),WORTMAN AVENUE,ALABAMA AVENUE,,0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,4109395,Bus,Sedan,,,,22,Spring,Thursday,2019
1,2019-04-27,2019-12-15 00:05:00,BROOKLYN,11218.0,40.64881,-73.97749,POINT (-73.97749 40.64881),EAST 4 STREET,FORT HAMILTON PARKWAY,,1,0.0,0,0,1,0,0,0,Traffic Control Disregarded,Unspecified,,,,4121270,Sedan,Bike,,,,0,Spring,Saturday,2019
2,2019-04-18,2019-12-15 19:29:00,BROOKLYN,11218.0,40.637745,-73.97334,POINT (-73.97334 40.637745),,,438 OCEAN PARKWAY,0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117453,Sedan,,,,,19,Spring,Thursday,2019
3,2019-04-25,2019-12-15 18:30:00,BROOKLYN,11236.0,40.64856,-73.90535,POINT (-73.90535 40.64856),ROCKAWAY AVENUE,FOSTER AVENUE,,0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4120847,Station Wagon/Sport Utility Vehicle,Sedan,,,,18,Spring,Thursday,2019
4,2019-04-18,2019-12-15 16:12:00,BROOKLYN,11212.0,40.657803,-73.90868,POINT (-73.90868 40.657803),CHESTER STREET,LOTT AVENUE,,0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4117114,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,,16,Spring,Thursday,2019


### Select Data

Let's create a new data frame with the columns that we are interested in.

In [12]:
clean_nyc = datanyc.loc[(datanyc['number_of_persons_injured']), 
                             ["number_of_persons_injured", "hour", "season", "day_of_week"]].dropna()
clean_nyc.head(10)

Unnamed: 0,number_of_persons_injured,hour,season,day_of_week
0,0,22,Spring,Thursday
1,1,0,Spring,Saturday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday
0,0,22,Spring,Thursday


Because I have string values (`season` and `day_of_week`), I will use "One Hot Encoding" to transform my strings into integers.

Let's convert the categories of `season` (there are four) into four columns. Each column will have a 1 in it if that's the right season, and a 0 in it if that's not the right season.

In [13]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

season_nyc = ohe.fit_transform(datanyc['season'].values.reshape(-1,1)).toarray()
season_nyc[5880:5900,]  # I will choose random rows from the middle to figure out the values

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

In [14]:
datanyc.season[5880:5900,]

5880    Spring
5881    Spring
5882    Spring
5883    Summer
5884    Summer
5885    Summer
5886    Spring
5887    Summer
5888    Summer
5889    Spring
5890    Spring
5891    Spring
5892    Spring
5893    Summer
5894    Spring
5895    Spring
5896    Spring
5897    Summer
5898    Spring
5899    Spring
Name: season, dtype: object

Okay... So it probably goes as Autumn, Spring, Summer, Winter but let's be sure.

In [15]:
season_nyc[15880:15900,]

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [16]:
datanyc.season[15880:15900,]

15881    Autumn
15882    Autumn
15883    Autumn
15884    Autumn
15885    Autumn
15886    Autumn
15887    Autumn
15888    Autumn
15889    Autumn
15890    Autumn
15891    Autumn
15892    Autumn
15893    Autumn
15894    Autumn
15895    Autumn
15896    Autumn
15897    Autumn
15898    Autumn
15899    Autumn
15900    Autumn
Name: season, dtype: object

Yes... Autumn, Spring, Summer, Winter.

In [17]:
dfOneHotSeason = pd.DataFrame(season_nyc, columns = ["Autumn", "Spring", "Summer", "Winter"])
dfOneHotSeason.head(10)

Unnamed: 0,Autumn,Spring,Summer,Winter
0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,1.0,0.0,0.0
7,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0
9,0.0,1.0,0.0,0.0


Next, let's convert the categories of `day_of_week` into seven columns. 

In [18]:
day_nyc = ohe.fit_transform(datanyc['day_of_week'].values.reshape(-1,1)).toarray()
day_nyc[:30,]

array([[0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
 

In [19]:
datanyc.day_of_week[:30,]

0      Thursday
1      Saturday
2      Thursday
3      Thursday
4      Thursday
5     Wednesday
6        Monday
7     Wednesday
8     Wednesday
9      Saturday
10      Tuesday
11       Friday
12    Wednesday
13    Wednesday
14       Sunday
15     Thursday
16       Friday
17       Monday
18      Tuesday
19      Tuesday
20     Thursday
21      Tuesday
22      Tuesday
23     Thursday
24       Sunday
25       Monday
26    Wednesday
27    Wednesday
28      Tuesday
29       Monday
Name: day_of_week, dtype: object

So it starts from Friday (1st) and then Monday (2nd), Saturday (3rd), Sunday (4th), Thursday (5th), Tuesday (6th) and Wednesday (5th). 

In [20]:
dfOneHotDay = pd.DataFrame(day_nyc, columns = ["Friday", "Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday"])
dfOneHotDay.head(10)

Unnamed: 0,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [21]:
ohe_df = pd.concat([clean_nyc.reset_index(drop=True), 
                    dfOneHotSeason.reset_index(drop=True), 
                    dfOneHotDay.reset_index(drop=True)], axis = 1, )
ohe_df.head(10)

Unnamed: 0,number_of_persons_injured,hour,season,day_of_week,Autumn,Spring,Summer,Winter,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1,0,Spring,Saturday,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,0,22,Spring,Thursday,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


All look well. Let's drop `season` and `day_of_week` since their values are included in our new columns now.

In [22]:
ohe_df.drop(columns = ['season', 'day_of_week'], inplace = True)

ohe_df.head(10)

Unnamed: 0,number_of_persons_injured,hour,Autumn,Spring,Summer,Winter,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0,22,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,0,22,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


And now I'd like to use scikit-learn (sklearn) to do a logistic regression to predict "0" versus "1" number of persons injured. We'll use season, day of the week, and hour as our predictors.

In [23]:
from sklearn.linear_model import LogisticRegression

predictors, outcome = ohe_df.drop('number_of_persons_injured',axis=1), ohe_df['number_of_persons_injured']
logisticRegr = LogisticRegression()

logisticRegr.fit(predictors, outcome)
predictions = logisticRegr.predict(predictors)



Let's see how we did.

In [24]:
model_outcome = pd.DataFrame({"prediction": predictions, "actual": outcome})

model_outcome.head()

Unnamed: 0,prediction,actual
0,0,0
1,1,1
2,0,0
3,0,0
4,0,0


And see our stats...

In [25]:
(model_outcome['prediction'] == model_outcome['actual']).value_counts()

True     350339
False         2
dtype: int64

Very good!

In [26]:
from sklearn.metrics import accuracy_score

accuracy_score(model_outcome['actual'], model_outcome['prediction'])

0.9999942912762138

That is a very good accuracy score! Let's see the distribution of the predictions.

In [27]:
model_outcome['prediction'].value_counts()

0    291475
1     58866
Name: prediction, dtype: int64

Good.. my model guessed "0" 291,475 times and "1" 58,866 times. That seem to be fair.

## Part 2

Now, I will use the `hour` column to make predictions about the number of persons injured.

In [28]:
hours_nyc = datanyc.loc[(datanyc['number_of_persons_injured'] == 0) | (datanyc['number_of_persons_injured'] == 1), 
                         ['number_of_persons_injured', 'hour']].dropna()
hours_nyc.head()

Unnamed: 0,number_of_persons_injured,hour
0,0,22
1,1,0
2,0,19
3,0,18
4,0,16


In [29]:
X= hours_nyc.drop("number_of_persons_injured",axis=1)
y= hours_nyc["number_of_persons_injured"]

In [30]:
from sklearn.linear_model import LogisticRegression

predictors, outcome = hours_nyc.drop('number_of_persons_injured',axis=1), hours_nyc['number_of_persons_injured']
logisticRegr = LogisticRegression()

logisticRegr.fit(predictors, outcome)
predictions = logisticRegr.predict(predictors)



Let's see how we did.

In [31]:
model_outcome = pd.DataFrame({"prediction": predictions, "actual": outcome})

model_outcome.head()

Unnamed: 0,prediction,actual
0,0,0
1,0,1
2,0,0
3,0,0
4,0,0


And what about our stats...

In [32]:
(model_outcome['prediction'] == model_outcome['actual']).value_counts()

True     275110
False     58866
dtype: int64

Not bad.

In [33]:
from sklearn.metrics import accuracy_score

accuracy_score(model_outcome['actual'], model_outcome['prediction'])

0.8237418257599348

82% accuracy is pretty good! But...

In [34]:
model_outcome['prediction'].value_counts()

0    333976
Name: prediction, dtype: int64

Well.. my model simply guessed "0" every single time.  A good guess, since my sample has many more '0' than '1' but this shows a complexity in machine learning... the need to handle unbalanced samples. 

## Thank you for reading!