In [103]:
import os
import pandas as pd

In [104]:
dir = './Data'

In [105]:
collisions = pd.read_csv(dir + '/collisions_2018-2020.csv')
collisions.head(1)

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,


In [106]:
# create a year column from the date column
collisions['YEAR'] = pd.DatetimeIndex(collisions['CRASH DATE']).year
collisions.shape

(115740, 30)

In [107]:
# filter collision data to only include 2018
collisions = collisions[collisions['YEAR'] == 2018]
collisions.shape

(79383, 30)

In [108]:
collisions.head(1)

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5,YEAR
43,2018-08-30,14:00,BROOKLYN,11204.0,40.61662,-73.99972,"(40.61662, -73.99972)",NEW UTRECHT AVENUE,74 STREET,,...,,,,3971526,Sedan,Sedan,,,,2018


In [109]:
interest_cols = ['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME', 'OFF STREET NAME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1', 'COLLISION_ID', 'VEHICLE TYPE CODE 1']

collisions = collisions[interest_cols]

# Number of persons attributes

For the visualization purposes, the differentantion of **PERSONS**, **PEDESTRIANS**, **CYCLISTS** and **MOTORISTS** (**INJURED/KILLED**) is irrelevant. A more useful attribute would be the total number of persons involved in the collision. This can be obtained by summing the four attributes under the assumption that the **PERSONS** attribute is not the sum of the other three attributes.

This condition was needed to be checked because the documentation of the dataset was not precise enough to determinate if **NUMBER OF PERSON INJURED/KILLED** was an aggregate from the other three columns or not.

*Note: The metadata information available in the web of the dataset was: "Number of persons injured/killed" regarding the **NUMBER OF PERSONS INJURED/KILLED**.*

In [110]:
collisions['NUMBER OF PERSONS INJURED'].equals(collisions['NUMBER OF PEDESTRIANS INJURED'] + collisions['NUMBER OF CYCLIST INJURED'] + collisions['NUMBER OF MOTORIST INJURED'])

False

In [111]:
collisions['NUMBER OF PERSONS INJURED'].equals(collisions['NUMBER OF PEDESTRIANS INJURED'])

False

In [112]:
collisions['NUMBER OF PERSONS KILLED'].equals(collisions['NUMBER OF PEDESTRIANS KILLED'] + collisions['NUMBER OF CYCLIST KILLED'] + collisions['NUMBER OF MOTORIST KILLED'])

False

In [113]:
collisions['NUMBER OF PERSONS KILLED'].equals(collisions['NUMBER OF PEDESTRIANS KILLED'])

False

As seen by the logical comprobations, the **NUMBER OF PERSONS INJURED/KILLED** is not the sum of the other three attributes. Furthermore, the terms persons and pedestrians are not equal, as one could have thought that the term persons was used to refer to pedestrians.

Based on this, the discrete attributes refering to the injured people were summed to obtain **NUMBER OF INJURED** and the discrete attributes refering to the killed people were summed to obtain **NUMBER OF KILLED**. The **NUMBER OF INJURED/KILLED** attributes were removed.

In [114]:
# sum all columns with 'INJURED' in the name
collisions['TOTAL INJURED'] = collisions.filter(regex='INJURED').sum(axis=1)
collisions['TOTAL KILLED'] = collisions.filter(regex='KILLED').sum(axis=1)

collisions = collisions.drop(collisions.iloc[:, 8:16], axis=1)

In [115]:
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['Taxi', 'TAXI'], 'TAXI')
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['Fire', 'FD tr', 'firet', 'fire', 'FIRE', 'fd tr', 'FD TR', 'FIRE', 'FIRET'], 'FIRE')
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['AMBUL', 'Ambulance', 'ambul', 'AMB', 'Ambul', 'AMBULANCE', 'AMBU'], 'AMBULANCE')

In [116]:
collisions.to_csv(dir + '/collisions_2018.csv', index=False)

NOW WE GO TO OPENREFINE TO CLEAN THE DATA