# NYPD Motor Vehicle Collision Data Prep<a id='Top'></a>

### Overview

The "Motor Vehicle Collisions - Crashes" dataset available through New York City's Open Data program contains information about reported vehicle crashes in NYC. Each row contains details on a single crash event. 

The data contains records from 2012 to today, with data updated on a daily basis. At the time of this writing, there are 1.61 million rows, each row representing a crash event, and 29 columns. 

In this notebook we will analyze this data as follows:

1. [Importing](#Importing)
2. [Understanding](#Understanding)
    - [Column Contents](#column_contents)
    - [Descriptive Statistics](#descriptive_statistics)
    - [Columns Missing Data](#empty)
3. [Transforming](#Transforming)
    - [Dropping Columns](#Drop)
    - [Renaming Data](#Renaming)
    - [Redundant Columns](#Redundant)
    - [Data Types](#data_type)
    - [Categorizing](#categorizing)
4. [Analyzing](#Analyzing)
5. [Statistical Analysis](#statistical_analysis)
    - [ANOVA](#anova)
    - [2 Sample T-Test](#2ttest)
    - [Chi-Square](#chi-sq)
6. [Visualizing](#Visualizations)
    - [Number of Deaths by Borough](#Fataities_by_borough) 
    - [Number of Crashes by Hour](#Crashes_by_hour)
    - [Number of Accidents by Season](#accidents_by_season)
    - [Fatal Car Crash Locations](#car_crash_locations)
    - [Contributing Factor Trends](#contributing_factor_trends)
    - [Contributing Factors to Crash Fatalities](#Factor_Bar_Plot)
    - [Fatalities to Pedestrians vs Cyclists vs Motorist](#Fataity_Grouped_Series)
    - [Fatal Crash Frequency Over Time](#Fatality_Time_Scatterplot)
    - [Crash Factor Percentages in Queens](#queens_crash_causes)
7. [End of Document](#Bottom)


* The dataset can be found by following this link: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions-Crashes/h9gi-nx95

### Importing the Data<a id='Importing'></a>

Let's begin by importing a few libraries we will use later in the notebook, and then bring the first two million rows of NYPD Motor Vehicle Collision Data using pandas. We are over-estimating the number of rows, to leave room for more data if this is run in the future.
<br><div style="text-align: right">[Begining of the page](#Top)</div>

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
import sys
from IPython.core.display import display, HTML

In [2]:
datanyc = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=2000000", low_memory=False)

And let's pull up the data dictionary supplied by the Open Data website for reference.

In [3]:
data_dict = pd.read_excel("https://data.cityofnewyork.us/api/views/h9gi-nx95/files/2e58023a-21a6-4c76-b9e8-0101bf7509ca?download=true&filename=MVCollisionsDataDictionary.xlsx",
                         sheet_name='Column Info',  header=1)
data_dict.head(30)

Unnamed: 0,Table Name,Column Name,Column Description,Primary Key or Foreign Key,"Additional Notes (where applicable, includes the range of possible values, units of measure, how to interpret null/zero values, whether there are specific relationships between columns, and/or information on column source)"
0,MV-Collisions - Crash,UNIQUE_ID,Unique record code generated by system,Primary Key for the crash table,
1,MV-Collisions - Crash,ACCIDENT_DATE,Occurrence date of collision,,
2,MV-Collisions - Crash,ACCIDENT_TIME,Occurrence time of collision,,
3,MV-Collisions - Crash,BOROUGH,Borough where collision occurred,,
4,MV-Collisions - Crash,ZIP CODE,Postal code of incident occurrence,,
5,MV-Collisions - Crash,LATITUDE,Latitude coordinate for Global Coordinate Syst...,,
6,MV-Collisions - Crash,LONGITUDE,Longitude coordinate for Global Coordinate Sys...,,
7,MV-Collisions - Crash,LOCATION,"Latitude , Longitude pair",,
8,MV-Collisions - Crash,ON STREET NAME,Street on which the collision occurred,,
9,MV-Collisions - Crash,CROSS STREET NAME,Nearest cross street to the collision,,


### Understanding the Data <a id='Understanding'></a>
Let's look at the first few rows of the dataset. <br><div style="text-align: right">[Begining of the page](#Top)</div>

In [4]:
pd.set_option('display.max_columns', None) # This allows us to view all columns in a dataframe when called
pd.set_option('display.max_rows', 200) # This returns 200 rows at max to prevent accidents when writing code
datanyc.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2015-01-15T00:00:00.000,15:20,,,,,,,,,0.0,0.0,0,0,0,0,0,0,Fatigued/Drowsy,Unspecified,,,,3153579,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
1,2019-12-07T00:00:00.000,10:00,,,,,,CROSS ISLAND PARKWAY,,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4253585,Sedan,Sedan,,,
2,2019-12-07T00:00:00.000,19:22,,,,,,VERRAZANO BRIDGE UPPER,,,0.0,0.0,0,0,0,0,0,0,Passing or Lane Usage Improper,Unspecified,,,,4254813,Sedan,,,,
3,2015-01-15T00:00:00.000,4:00,,,,,,,,,1.0,0.0,0,0,0,0,1,0,Unspecified,Unspecified,,,,3153538,TAXI,PASSENGER VEHICLE,,,
4,2015-01-15T00:00:00.000,9:35,,,40.804068,-73.931154,POINT (-73.9311544 40.8040684),,,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,3153217,OTHER,OTHER,,,


... And get overall information about the contents of the data. <a id='column_contents'></a>

In [5]:
pd.options.display.max_info_rows = 2000000
datanyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1618008 entries, 0 to 1618007
Data columns (total 29 columns):
crash_date                       1618008 non-null object
crash_time                       1618008 non-null object
borough                          1126776 non-null object
zip_code                         1126579 non-null object
latitude                         1420654 non-null float64
longitude                        1420654 non-null float64
location                         1420654 non-null object
on_street_name                   1300439 non-null object
off_street_name                  1074461 non-null object
cross_street_name                225404 non-null object
number_of_persons_injured        1617991 non-null float64
number_of_persons_killed         1617977 non-null float64
number_of_pedestrians_injured    1618008 non-null int64
number_of_pedestrians_killed     1618008 non-null int64
number_of_cyclist_injured        1618008 non-null int64
number_of_cyclist_killed        

Each column should contain approxamitely 1.6 million values, though some columns have considerably fewer entries. Let's find the percentage of the missing values and see which columns have the most amount of missing values. To do so we will get a mean of the missing values and then round it to the second decimal.

In [6]:
pd.set_option('display.max_columns', 29)
datanyc.isnull().mean().round(4) * 100

crash_date                        0.00
crash_time                        0.00
borough                          30.36
zip_code                         30.37
latitude                         12.20
longitude                        12.20
location                         12.20
on_street_name                   19.63
off_street_name                  33.59
cross_street_name                86.07
number_of_persons_injured         0.00
number_of_persons_killed          0.00
number_of_pedestrians_injured     0.00
number_of_pedestrians_killed      0.00
number_of_cyclist_injured         0.00
number_of_cyclist_killed          0.00
number_of_motorist_injured        0.00
number_of_motorist_killed         0.00
contributing_factor_vehicle_1     0.26
contributing_factor_vehicle_2    13.47
contributing_factor_vehicle_3    93.53
contributing_factor_vehicle_4    98.65
contributing_factor_vehicle_5    99.66
collision_id                      0.00
vehicle_type_code1                0.34
vehicle_type_code2       

Wow! Some columns have a lot of missing values. 

For some it makes sense. After looking at the data dictionary, `contributing_factor_vehicle_2` or `contributing_factor_vehicle_3` seem like they may be missing because there were no second or third contributing factors to the collision. 

It looks like `contributing_factor_vehicle_3`, `contributing_factor_vehicle_4`, `contributing_factor_vehicle_5` and `vehicle_type_code_3`, `vehicle_type_code_4`, `vehicle_type_code_5` have very few values compared to the others. We will take a closer look at them when we start transforming our data.

For now, we will use the `describe` function to generate some descriptive statistics. This will work on numeric and object series, and may point out any glaring holes in the data. <a id='descriptive_statistics'></a>

In [7]:
datanyc.describe()

Unnamed: 0,latitude,longitude,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,collision_id
count,1420654.0,1420654.0,1617991.0,1617977.0,1618008.0,1618008.0,1618008.0,1618008.0,1618008.0,1618008.0,1618008.0
mean,40.69211,-73.8726,0.2627926,0.001167507,0.05062954,0.0006316409,0.02074773,8.405397e-05,0.1915559,0.0004542623,2785430.0
std,1.141132,2.347086,0.6601335,0.03610199,0.2318293,0.02570814,0.1437003,0.009234882,0.6224953,0.02317054,1505219.0
min,0.0,-201.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
25%,40.6688,-73.9772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1031953.0
50%,40.72257,-73.92975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3445130.0
75%,40.76797,-73.86688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3849863.0
max,43.34444,0.0,43.0,8.0,27.0,6.0,4.0,2.0,43.0,5.0,4255234.0


Obviously averages and standard deviations don't tell a lot about latitude and longitude, but why we don't have full data counts in those columns? The rest of analysis doesn't appear to have obvious problems.

Let's take a look at some of the empty `latitude` column using the `isnull` function. <a id='empty'></a>

In [8]:
datanyc[datanyc['latitude'].isnull()].head(20)

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2015-01-15T00:00:00.000,15:20,,,,,,,,,0.0,0.0,0,0,0,0,0,0,Fatigued/Drowsy,Unspecified,,,,3153579,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
1,2019-12-07T00:00:00.000,10:00,,,,,,CROSS ISLAND PARKWAY,,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4253585,Sedan,Sedan,,,
2,2019-12-07T00:00:00.000,19:22,,,,,,VERRAZANO BRIDGE UPPER,,,0.0,0.0,0,0,0,0,0,0,Passing or Lane Usage Improper,Unspecified,,,,4254813,Sedan,,,,
3,2015-01-15T00:00:00.000,4:00,,,,,,,,,1.0,0.0,0,0,0,0,1,0,Unspecified,Unspecified,,,,3153538,TAXI,PASSENGER VEHICLE,,,
5,2018-12-27T00:00:00.000,9:00,MANHATTAN,10019.0,,,,12th ave,55th street,,0.0,0.0,0,0,0,0,0,0,Other Vehicular,Other Vehicular,,,,4052890,Van,Sedan,,,
6,2019-12-07T00:00:00.000,5:42,QUEENS,11434.0,,,,Rockaway boulevard,Guy R Brewer Boulevard,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4253491,Station Wagon/Sport Utility Vehicle,Sedan,,,
7,2019-12-06T00:00:00.000,10:35,,,,,,FDR DRIVE,,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4253284,Sedan,Pick-up Truck,,,
8,2018-12-26T00:00:00.000,13:13,,,,,,BRONX WHITESTONE BRIDGE,,,2.0,0.0,0,0,0,0,2,0,Following Too Closely,Unspecified,Unspecified,,,4054589,Station Wagon/Sport Utility Vehicle,Van,Sedan,,
9,2019-12-06T00:00:00.000,13:55,MANHATTAN,10019.0,,,,WEST 57 STREET,BROADWAY,,0.0,0.0,0,0,0,0,0,0,Other Vehicular,View Obstructed/Limited,,,,4253292,Sedan,Box Truck,,,
10,2019-12-06T00:00:00.000,14:35,,,,,,VANWYCK EXPRESSWAY,ROCKAWAY BOULEVARD,,0.0,0.0,0,0,0,0,0,0,Aggressive Driving/Road Rage,Unspecified,,,,4254141,Sedan,Sedan,,,


Seems like those rows have valid data. We wouldn't delete these rows, as the injury/fatality data may be useful, but we would drop them for location purposes.

If we *really* had some time, we would write/find a program to reverse map the `on_street_name` to the other fields.

For now, we'll pivot to take a closer look at vehicle types.

In [9]:
datanyc['vehicle_type_code1'].value_counts().head(20)

PASSENGER VEHICLE                      715236
SPORT UTILITY / STATION WAGON          313500
Sedan                                  161163
Station Wagon/Sport Utility Vehicle    130790
TAXI                                    50670
VAN                                     26540
OTHER                                   23982
PICK-UP TRUCK                           23069
UNKNOWN                                 19929
Taxi                                    16728
SMALL COM VEH(4 TIRES)                  14559
LARGE COM VEH(6 OR MORE TIRES)          14527
BUS                                     14057
Pick-up Truck                           10826
LIVERY VEHICLE                          10481
Box Truck                                8424
Bus                                      6952
MOTORCYCLE                               6536
BICYCLE                                  5568
Bike                                     4027
Name: vehicle_type_code1, dtype: int64

In [10]:
datanyc['vehicle_type_code_3'].value_counts().head(20)

PASSENGER VEHICLE                      63655
SPORT UTILITY / STATION WAGON          33161
Sedan                                  10923
Station Wagon/Sport Utility Vehicle     9053
UNKNOWN                                 3285
TAXI                                    3218
PICK-UP TRUCK                           2292
VAN                                     1489
OTHER                                   1108
Taxi                                     695
Pick-up Truck                            609
BICYCLE                                  533
SMALL COM VEH(4 TIRES)                   479
MOTORCYCLE                               464
LARGE COM VEH(6 OR MORE TIRES)           448
LIVERY VEHICLE                           424
BUS                                      403
Box Truck                                202
Bus                                      127
Motorcycle                               101
Name: vehicle_type_code_3, dtype: int64

It looks like these columns were a text entry field, instead of a select field. It  also appears and that codes 2 through 4 have significantly  fewer values that code 1. The data dictionary doesn't clarify definitively, but we believe these may represent multiple cars. The relative emptiness of codes 1-4 will likely lead us to primarily analyzing `vehicle_type_code1`.

We believe we have a decent understanding of our data. In the next section we will modify the column names to standardize the dataset, deal with missing values, clean duplications, and generally get our dataset to the point where we use it confidently.

## Transforming the Data <a id='Transforming'></a>
### What needs attention
####  [Dropping Columns](#Drop)
* Some columns (such as `vehicle_type_code_4`, `contributing_factor_vehicle_5`) are nearly entirely empty. We'll remove those. 
* We will not be using some columns (e.g. `collision_id`, `on_street_name`, `off_street_name`, `cross_street_name`) so we can drop them completely. 


#### [Renaming Data](#Renaming)
* Cleaning and combining duplicate rows
* Renaming some columns
* Correcting misspellings
* Dealing with missing values
* Some dtype changes

#### [Redundant Columns](#Redundant)
* `Latitude` and `longitude` columns seem to be contained in the `location` column. We like keeping the two values separate for now, so we can probably remove `location` later.

#### [Data Types](#data_type)
* The values that we expect to be a 'datetime' type are an 'object' type (`crash_date` and `crash_time` columns). We'll fix those.
* We will change the data type of zip code to string.

#### [Categorizing](#categorizing)
* We are curious to see if collisions go up seasonally, so we'll make a new variable that bins the collisions by Spring (March, April, May), Summer (June, July, August), Fall (September, October, November), and Winter (December, January, February).
<br><div style="text-align: right">[Begining of the page](#Top)</div>

### Dropping Columns  <a id='Drop'></a>
We'll begin by removing some columns, keeping only those missing fewer than 30% of their values. We can also drop some columns we know we're not going to use. Those operations are simple enough that we'll do them all before checking in again on the DataFrame.

In [11]:
clean_nyc = datanyc.dropna(thresh=(0.30 * datanyc.shape[0]), axis=1).copy()

In [12]:
clean_nyc.drop(columns=["collision_id", "on_street_name", "off_street_name"], inplace=True)

Let's take a peek at what 'clean_nyc' looks like now, as far as data types and number of columns (and values in those columns):

In [13]:
clean_nyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1618008 entries, 0 to 1618007
Data columns (total 19 columns):
crash_date                       1618008 non-null object
crash_time                       1618008 non-null object
borough                          1126776 non-null object
zip_code                         1126579 non-null object
latitude                         1420654 non-null float64
longitude                        1420654 non-null float64
location                         1420654 non-null object
number_of_persons_injured        1617991 non-null float64
number_of_persons_killed         1617977 non-null float64
number_of_pedestrians_injured    1618008 non-null int64
number_of_pedestrians_killed     1618008 non-null int64
number_of_cyclist_injured        1618008 non-null int64
number_of_cyclist_killed         1618008 non-null int64
number_of_motorist_injured       1618008 non-null int64
number_of_motorist_killed        1618008 non-null int64
contributing_factor_vehicle_1    1

And what about the percentage of the missing values now?

In [14]:
pd.set_option('display.max_columns', 29)
clean_nyc.isnull().mean().round(4) * 100

crash_date                        0.00
crash_time                        0.00
borough                          30.36
zip_code                         30.37
latitude                         12.20
longitude                        12.20
location                         12.20
number_of_persons_injured         0.00
number_of_persons_killed          0.00
number_of_pedestrians_injured     0.00
number_of_pedestrians_killed      0.00
number_of_cyclist_injured         0.00
number_of_cyclist_killed          0.00
number_of_motorist_injured        0.00
number_of_motorist_killed         0.00
contributing_factor_vehicle_1     0.26
contributing_factor_vehicle_2    13.47
vehicle_type_code1                0.34
vehicle_type_code2               16.53
dtype: float64

So far, so good.

### Correcting Misspellings and Renaming  <a id='Renaming'></a>

We will modify the column names to standardize the dataset using the rename function.

In [15]:
clean_nyc.rename(columns={'vehicle_type_code1':'vehicle_type_code_1',
                        'vehicle_type_code2':'vehicle_type_code_2',
                       }, 
               inplace=True)

Let's take a closer look at `vehicle_type_code_1`.

In [16]:
clean_nyc['vehicle_type_code_1'].value_counts().head(40)

PASSENGER VEHICLE                      715236
SPORT UTILITY / STATION WAGON          313500
Sedan                                  161163
Station Wagon/Sport Utility Vehicle    130790
TAXI                                    50670
VAN                                     26540
OTHER                                   23982
PICK-UP TRUCK                           23069
UNKNOWN                                 19929
Taxi                                    16728
SMALL COM VEH(4 TIRES)                  14559
LARGE COM VEH(6 OR MORE TIRES)          14527
BUS                                     14057
Pick-up Truck                           10826
LIVERY VEHICLE                          10481
Box Truck                                8424
Bus                                      6952
MOTORCYCLE                               6536
BICYCLE                                  5568
Bike                                     4027
Tractor Truck Diesel                     3667
Van                               

It looks like there are misspellings and duplicates. Let's see if we can combine some of the obvious misspellings.

In [17]:
clean_nyc['vehicle_type_code_1'].replace('SPORT UTILITY / STATION WAGON', 'SUV', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Station Wagon/Sport Utility Vehicle', 'SUV', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('TAXI', 'Taxi', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Bike', 'BICYCLE', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('VAN', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Motorscooter', 'SCOOTER', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Moped', 'SCOOTER', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('van', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('MOTORCYCLE', 'Motorcycle', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('AMBULANCE', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Refrigerated Van', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Motorbike', 'Motorcycle', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('AMBUL', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('CAB', 'Taxi', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Cab', 'Taxi', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('VAN T', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('VAN/T', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('van t', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('VAN', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Ambul', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('AMB', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Ambu', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('ambul', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Ambu', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Fire', 'FIRE TRUCK', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('fire', 'FIRE TRUCK', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('FIRE', 'FIRE TRUCK', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('FIRET', 'FIRE TRUCK', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('FDNY', 'FIRE TRUCK', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Other', 'Unknown', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('BUS', 'Bus', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Box T', 'Box Truck', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('GARBA', 'Garbage or Refuse', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('taxi', 'Taxi', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('taxy', 'Taxi', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('AM', 'Ambulance', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('VN', 'Van', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('CONV', 'Convertible', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('Garbage or Refuse', 'Dump', inplace=True)
clean_nyc['vehicle_type_code_1'].replace('OTHER', 'UNKNOWN', inplace=True)
clean_nyc['vehicle_type_code_1'].value_counts().head(50)

PASSENGER VEHICLE                 715236
SUV                               444296
Sedan                             161163
Taxi                               67399
UNKNOWN                            43911
Pick-up Truck                      33895
Van                                30450
Bus                                21009
SMALL COM VEH(4 TIRES)             14559
LARGE COM VEH(6 OR MORE TIRES)     14527
LIVERY VEHICLE                     10481
BICYCLE                             9595
Motorcycle                          8695
Box Truck                           8427
Ambulance                           4046
Tractor Truck Diesel                3667
TK                                  2485
BU                                  2229
Dump                                2018
Convertible                         1780
FIRE TRUCK                          1061
DS                                  1006
4 dr sedan                           907
PK                                   854
Flat Bed        

Even in one column we can see how much variation there is. We would suggest that whoever created this dataset turns this from a "fill in the blank" text field to a select field from a predetermined list, to get better data fidelity, and we hope by this point, you can see why.

Now let's look at `contributing_factor_vehicle_1` and `contributing_factor_vehicle_2`.

In [18]:
clean_nyc['contributing_factor_vehicle_1'].unique()

array(['Fatigued/Drowsy', 'Unspecified', 'Passing or Lane Usage Improper',
       'Other Vehicular', 'Driver Inattention/Distraction',
       'Following Too Closely', 'Aggressive Driving/Road Rage', nan,
       'Alcohol Involvement', 'Oversized Vehicle', 'Unsafe Lane Changing',
       'Backing Unsafely', 'Animals Action',
       'Failure to Yield Right-of-Way', 'Pavement Slippery',
       'Lost Consciousness', 'Other Electronic Device', 'Unsafe Speed',
       'Turning Improperly', 'Reaction to Uninvolved Vehicle',
       'Failure to Keep Right', 'Traffic Control Disregarded',
       'Drugs (illegal)', 'Outside Car Distraction',
       'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
       'View Obstructed/Limited', 'Driver Inexperience',
       'Obstruction/Debris', 'Passing Too Closely', 'Pavement Defective',
       'Glare', 'Shoulders Defective/Improper', 'Eating or Drinking',
       'Passenger Distraction', 'Prescription Medication',
       'Steering Failure', 'Physical Dis

In [19]:
clean_nyc['contributing_factor_vehicle_2'].unique()

array(['Unspecified', 'Other Vehicular', 'View Obstructed/Limited', nan,
       'Driver Inattention/Distraction', 'Unsafe Lane Changing',
       'Fatigued/Drowsy', 'Following Too Closely',
       'Passing or Lane Usage Improper', 'Backing Unsafely',
       'Alcohol Involvement', 'Turning Improperly',
       'Failure to Yield Right-of-Way', 'Lost Consciousness',
       'Oversized Vehicle', 'Passing Too Closely',
       'Traffic Control Disregarded', 'Outside Car Distraction',
       'Driver Inexperience', 'Obstruction/Debris', 'Pavement Slippery',
       'Aggressive Driving/Road Rage', 'Brakes Defective',
       'Physical Disability', 'Unsafe Speed',
       'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
       'Traffic Control Device Improper/Non-Working',
       'Reaction to Uninvolved Vehicle', 'Passenger Distraction',
       'Failure to Keep Right', 'Driverless/Runaway Vehicle',
       'Lane Marking Improper/Inadequate', 'Pavement Defective',
       'Tow Hitch Defective', '

They have similar factors and there seem to be a lot of unique values. We want to just focus on `contributing_factor_vehicle_1` in this project. 

We'll now perform some categorization in `contributing_factor_vehicle_1` to make our analysis a little bit easier.

In [20]:
clean_nyc['contributing_factor_vehicle_1'].replace({'Backing Unsafely': 'Traffic Recklessness', 
                                                  'Unsafe Speed': 'Traffic Recklessness', 
                                                 'Passing or Lane Usage Improper': 'Traffic Recklessness',
                                                 'Turning Improperly': 'Traffic Recklessness',
                                                 'Following Too Closely': 'Traffic Recklessness',
                                                 'Passing Too Closely' : 'Traffic Recklessness',
                                                 'Outside Car Distraction': 'Traffic Recklessness',
                                                 'Steering Failure': 'Traffic Recklessness',
                                                 'Reaction to Uninvolved Vehicle': 'Traffic Recklessness',
                                                 'Traffic Control Disregarded': 'Traffic Recklessness',
                                                 'Failure to Yield Right-of-Way': 'Traffic Recklessness',
                                                 'Aggressive Driving/Road Rage': 'Traffic Recklessness',
                                                 'Unsafe Lane Changing': 'Traffic Recklessness',
                                                 'Driver Inexperience': 'Traffic Recklessness',
                                                  
                                                 'Passenger Distraction': 'Driver Inattention/Distraction',
                                                 'Failure to Keep Right': 'Driver Inattention/Distraction',
                                                 'Eating or Drinking': 'Driver Inattention/Distraction',
                                                 'Animals Action': 'Driver Inattention/Distraction',
                                                 'Using On Board Navigation Device': 'Driver Inattention/Distraction',
                                                 'Reaction to Other Uninvolved Vehicle': 'Driver Inattention/Distraction',
                                                 'Cell Phone (hands-free)': 'Driver Inattention/Distraction',
                                                 'Cell Phone (hand-Held)': 'Driver Inattention/Distraction',
                                                 'Other Electronic Device': 'Driver Inattention/Distraction',
                                                 'Cell Phone (hand-held)': 'Driver Inattention/Distraction',
                                                 'Texting': 'Driver Inattention/Distraction',
                                                 'Listening/Using Headphones': 'Driver Inattention/Distraction',
                                                 'Fatigued/Drowsy': 'Driver Inattention/Distraction',
                                                 'Fell Asleep': 'Driver Inattention/Distraction',
                                                  
                                                  
                                                 'Brakes Defective': 'Car Defects',
                                                 'Tinted Windows': 'Car Defects',
                                                 'Tire Failure/Inadequate': 'Car Defects',
                                                 'Tow Hitch Defective': 'Car Defects',
                                                 'Headlights Defective': 'Car Defects',
                                                 'Accelerator Defective': 'Car Defects',
                                                 'Windshield Inadequate': 'Car Defects',
                                                 'Driverless/Runaway Vehicle': 'Car Defects',
                                                 'Oversized Vehicle': 'Car Defects',

                                                  
                                                 'Traffic Control Disregarded':'Road Defects',
                                                 'Glare':'Road Defects',
                                                 'Tinted Windows':'Road Defects',
                                                 'Lane Marking Improper/Inadequate': 'Road Defects',
                                                 'View Obstructed/Limited': 'Road Defects',
                                                 'Pavement Defective': 'Road Defects',
                                                 'Other Lighting Defects': 'Road Defects',
                                                 'Obstruction/Debris': 'Road Defects',
                                                 'Traffic Control Device Improper/Non-Working': 'Road Defects',
                                                 'Shoulders Defective/Improper': 'Road Defects',
                                                 'Pavement Slippery': 'Road Defects',
                                                  
                                                 'Illnes': 'Illness',
                                                 'Lost Consciousness': 'Illness',
                                                 'Physical Disability': 'Illness',
                                                 'Prescription Medication': 'Illness',
                                                  
                                                 'Drugs (illegal)': 'Drugs (Illegal)',
                                                 'Alcohol Involvement': 'Drugs (Illegal)',
                                                  
                                                 'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion': 'Outside Error',
                                                 'Vehicle Vandalism': 'Outside Error',
                                                 'Other Vehicular': 'Outside Error',
                                                  
                                                 }, inplace=True)

In [21]:
clean_nyc['contributing_factor_vehicle_1'].unique()

array(['Driver Inattention/Distraction', 'Unspecified',
       'Traffic Recklessness', 'Outside Error', nan, 'Drugs (Illegal)',
       'Car Defects', 'Road Defects', 'Illness', '80', '1'], dtype=object)

We will drop '80' and '1' since we do not know what they represent. We will also drop the 'nan' and 'unspecified' values, since they carry no information at this point. That will also make visualizations easier.

In [22]:
nyc80 = clean_nyc[clean_nyc['contributing_factor_vehicle_1'] == '80' ].index
clean_nyc.drop(nyc80, inplace=True)

nyc1 = clean_nyc[clean_nyc['contributing_factor_vehicle_1'] == '1' ].index
clean_nyc.drop(nyc1, inplace=True)

dropunspecified = clean_nyc[clean_nyc['contributing_factor_vehicle_1'] == 'Unspecified' ].index
clean_nyc.drop(dropunspecified, inplace=True)

clean_nyc.dropna(subset = ['contributing_factor_vehicle_1'], how='all', inplace=True)

clean_nyc['contributing_factor_vehicle_1'].unique()

array(['Driver Inattention/Distraction', 'Traffic Recklessness',
       'Outside Error', 'Drugs (Illegal)', 'Car Defects', 'Road Defects',
       'Illness'], dtype=object)

Next, we will rename some of the columns to make things easier while analyzing the data.

In [23]:
clean_nyc.rename(columns={'number_of_persons_injured' : 'persons_injured',
                        'number_of_persons_killed' : 'persons_killed',
                        'number_of_pedestrians_injured' : 'pedestrians_injured',
                        'number_of_pedestrians_killed' : 'pedestrians_killed',
                        'number_of_cyclist_injured' : 'cyclist_injured',
                        'number_of_cyclist_killed' : 'cyclist_killed',
                        'number_of_motorist_injured'  : 'motorist_injured',
                        'number_of_motorist_killed' : 'motorist_killed'},inplace=True)
clean_nyc.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2
0,2015-01-15T00:00:00.000,15:20,,,,,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,PASSENGER VEHICLE,PASSENGER VEHICLE
2,2019-12-07T00:00:00.000,19:22,,,,,,0.0,0.0,0,0,0,0,0,0,Traffic Recklessness,Unspecified,Sedan,
5,2018-12-27T00:00:00.000,9:00,MANHATTAN,10019.0,,,,0.0,0.0,0,0,0,0,0,0,Outside Error,Other Vehicular,Van,Sedan
7,2019-12-06T00:00:00.000,10:35,,,,,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,Sedan,Pick-up Truck
8,2018-12-26T00:00:00.000,13:13,,,,,,2.0,0.0,0,0,0,0,2,0,Traffic Recklessness,Unspecified,SUV,Van


Next, we want to change our values to lowercase letters because having all the values in the same format will make it easier to read and also it can be useful while making analysis (e.g. we can just type the value without thinking whether that value was lower case or upper case letters).

In [24]:
clean_nyc1 = clean_nyc.applymap(lambda s:s.lower() if type(s) == str else s)
clean_nyc1.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2
0,2015-01-15t00:00:00.000,15:20,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle
2,2019-12-07t00:00:00.000,19:22,,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,
5,2018-12-27t00:00:00.000,9:00,manhattan,10019.0,,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan
7,2019-12-06t00:00:00.000,10:35,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck
8,2018-12-26t00:00:00.000,13:13,,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van


### Redundant Columns: 'latitude' and 'longitude' <a id='Redundant'></a>

We suspect the `location` column is simply a concatenation of `latitude` and `longitude` columns. Let's check the data dictionary to see whether we can gain some information about it.

In [25]:
data_dict[5:8]

Unnamed: 0,Table Name,Column Name,Column Description,Primary Key or Foreign Key,"Additional Notes (where applicable, includes the range of possible values, units of measure, how to interpret null/zero values, whether there are specific relationships between columns, and/or information on column source)"
5,MV-Collisions - Crash,LATITUDE,Latitude coordinate for Global Coordinate Syst...,,
6,MV-Collisions - Crash,LONGITUDE,Longitude coordinate for Global Coordinate Sys...,,
7,MV-Collisions - Crash,LOCATION,"Latitude , Longitude pair",,


When we look at the `LOCATION` column (row 7) we see that it is described as "Latitude , Longitude pair". We are probably right but let's verify it even further.

In [26]:
clean_nyc1[["latitude", "longitude", "location"]].head(30)

Unnamed: 0,latitude,longitude,location
0,,,
2,,,
5,,,
7,,,
8,,,
9,,,
10,,,
11,,,
17,,,
19,,,


Yes, that seems to be right but before we quickly do something we might regret, let's first see if it's true that all the `location` data follows the same pattern we see right now:

In [27]:
clean_nyc1['location'].str.match('POINT \(-7\d\.\d+ \d{2}\.\d+\)', na=False).value_counts()

False    1020354
Name: location, dtype: int64

1,017,028 rows where that's not the case! More than we expected! Let's check them out.

In [28]:
clean_nyc1[~clean_nyc1['location'].str.match('POINT \(-7\d\.\d+ \d{2}\.\d+\)', na=False)].head(30)

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2
0,2015-01-15t00:00:00.000,15:20,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle
2,2019-12-07t00:00:00.000,19:22,,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,
5,2018-12-27t00:00:00.000,9:00,manhattan,10019.0,,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan
7,2019-12-06t00:00:00.000,10:35,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck
8,2018-12-26t00:00:00.000,13:13,,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van
9,2019-12-06t00:00:00.000,13:55,manhattan,10019.0,,,,0.0,0.0,0,0,0,0,0,0,outside error,view obstructed/limited,sedan,box truck
10,2019-12-06t00:00:00.000,14:35,,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,sedan
11,2019-12-06t00:00:00.000,16:00,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,suv,sedan
17,2015-01-14t00:00:00.000,2:45,,,,,,0.0,0.0,0,0,0,0,0,0,outside error,,passenger vehicle,
19,2018-12-25t00:00:00.000,10:50,,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,driver inattention/distraction,sedan,sedan


Ahh those sweet missing values... As we've seen earlier, `latitude`, `longitude` and `location` columns have 12% of their values missing each. So far, as far as we can understand and as the data dictionary points it out, the `location` column is simply a concatenation of `latitude` and `longitude` columns. Let's drop the `location` column.

In [29]:
clean_nyc1.drop(columns="location", inplace = True)

Finally, let's take a peek at our cleaned data.

In [30]:
clean_nyc1.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2
0,2015-01-15t00:00:00.000,15:20,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle
2,2019-12-07t00:00:00.000,19:22,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,
5,2018-12-27t00:00:00.000,9:00,manhattan,10019.0,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan
7,2019-12-06t00:00:00.000,10:35,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck
8,2018-12-26t00:00:00.000,13:13,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van


### Data Type <a id='data_type'></a>

Let's look at our dates to make sure they are all in the same format:

In [31]:
clean_nyc1[['crash_date', 'crash_time']].head()

Unnamed: 0,crash_date,crash_time
0,2015-01-15t00:00:00.000,15:20
2,2019-12-07t00:00:00.000,19:22
5,2018-12-27t00:00:00.000,9:00
7,2019-12-06t00:00:00.000,10:35
8,2018-12-26t00:00:00.000,13:13


The `crash_date` column definitely needs some fixing. We will transform the string timestamp for `crash_date` to a true datetime data type.

In [32]:
clean_nyc1['crash_date'] = pd.to_datetime(clean_nyc1['crash_date'])
clean_nyc1.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2
0,2015-01-15,15:20,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle
2,2019-12-07,19:22,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,
5,2018-12-27,9:00,manhattan,10019.0,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan
7,2019-12-06,10:35,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck
8,2018-12-26,13:13,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van


We also want to create a new column which will carry the values for hours only. We think that can be helpful when grouping the times and visualizing the data. We will create a new column called `hour` in which we will only have the hours instead of hours and minutes.

In [33]:
clean_nyc1['crash_time'] = pd.to_datetime(clean_nyc1.crash_time)
clean_nyc1['hour'] = clean_nyc1['crash_time'].dt.hour
clean_nyc1.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2,hour
0,2015-01-15,2019-12-10 15:20:00,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle,15
2,2019-12-07,2019-12-10 19:22:00,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,,19
5,2018-12-27,2019-12-10 09:00:00,manhattan,10019.0,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan,9
7,2019-12-06,2019-12-10 10:35:00,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck,10
8,2018-12-26,2019-12-10 13:13:00,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van,13


We'll also clean up the `zip_code` column, which should have string values, not integer.

In [34]:
clean_nyc1.loc[:,'zip_code'] = clean_nyc1['zip_code'].astype(str)

### Categorizing<a id='categorizing'></a>: Making a Seasons Variable

We are interested in adding a variable that shows the season in which a collision occurred.

In [35]:
clean_nyc1['crash_date'].dt.month.head(30)

0      1
2     12
5     12
7     12
8     12
9     12
10    12
11    12
17     1
19    12
22     1
23    12
30    12
32     1
35    12
36    12
37    12
39     1
40    10
41     1
42    12
43     1
47    12
51    12
52     9
55    11
57     1
59    12
61    12
63    12
Name: crash_date, dtype: int64

In [36]:
def season(crash_date):
    if crash_date.month in ([3, 4, 5]):
        val = 'Spring'
    elif crash_date.month in ([6, 7, 8]):
        val = 'Summer'
    elif crash_date.month in ([9, 10, 11]):
        val = 'Autumn'
    elif crash_date.month in ([12, 1, 2]):
        val = 'Winter'
    else:
        val = "Unspecified"
    return val

clean_nyc1['season'] = clean_nyc1['crash_date'].apply(season)

In [37]:
clean_nyc1['season'].value_counts()

Autumn    278181
Summer    273570
Spring    246130
Winter    222473
Name: season, dtype: int64

Before we move on to the analysis of our dataset, we would like to take a quick look at the difference our work has made so far.

In [38]:
datanyc.shape

(1618008, 29)

In [39]:
clean_nyc1.shape

(1020354, 20)

We've reduced our data by 9 columns. Let's take a general look at our data as well as the 'info'.

In [40]:
clean_nyc1.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,persons_injured,persons_killed,pedestrians_injured,pedestrians_killed,cyclist_injured,cyclist_killed,motorist_injured,motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,vehicle_type_code_1,vehicle_type_code_2,hour,season
0,2015-01-15,2019-12-10 15:20:00,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,passenger vehicle,passenger vehicle,15,Winter
2,2019-12-07,2019-12-10 19:22:00,,,,,0.0,0.0,0,0,0,0,0,0,traffic recklessness,unspecified,sedan,,19,Winter
5,2018-12-27,2019-12-10 09:00:00,manhattan,10019.0,,,0.0,0.0,0,0,0,0,0,0,outside error,other vehicular,van,sedan,9,Winter
7,2019-12-06,2019-12-10 10:35:00,,,,,0.0,0.0,0,0,0,0,0,0,driver inattention/distraction,unspecified,sedan,pick-up truck,10,Winter
8,2018-12-26,2019-12-10 13:13:00,,,,,2.0,0.0,0,0,0,0,2,0,traffic recklessness,unspecified,suv,van,13,Winter


In [41]:
clean_nyc1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1020354 entries, 0 to 1618006
Data columns (total 20 columns):
crash_date                       1020354 non-null datetime64[ns]
crash_time                       1020354 non-null datetime64[ns]
borough                          669986 non-null object
zip_code                         1020354 non-null object
latitude                         899476 non-null float64
longitude                        899476 non-null float64
persons_injured                  1020351 non-null float64
persons_killed                   1020342 non-null float64
pedestrians_injured              1020354 non-null int64
pedestrians_killed               1020354 non-null int64
cyclist_injured                  1020354 non-null int64
cyclist_killed                   1020354 non-null int64
motorist_injured                 1020354 non-null int64
motorist_killed                  1020354 non-null int64
contributing_factor_vehicle_1    1020354 non-null object
contributing_factor_v

We've done a lot of data cleaning, and this is a great start for our next stage. If you would like to perform your own analyses, you can use the following cell's code to save the cleaned data locally by removing the `#` mark at the beginning of the line.

In [42]:
#clean_nyc1.to_csv("nyc_crash_data.csv", index = False)

# <center> <br>[Beginning of the page](#Top)</center> <a id='Bottom'></a>