# Visualization Project

---
**Authors**:
-  *Juan P. Zaldivar E.*
-  *Enrique Millán X.*
---

In [136]:
# nyc = gpd.read_file('Data/new-york-city-boroughs-ny_.geojson')

# alt.Chart(nyc).mark_geoshape(
#     stroke='white',
#     strokeWidth=1
# ).encode(
#     color=alt.Color('name:N'),
#     tooltip=['name:N']
# ).properties(
#     width=500,
#     height=500
# )

## Introduction

This file contains the preprocessing required for the datasets. The datasets are:
- Colissions dataset.
- Weather dataset.
- New York Map.


<!-- explicar que primer se hace una exploracion visual y luego el preprocesing? -->

### Dataset obtention

The [*collision*](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset was already given by the instructors of the project.

The *weather* dataset was obtained following the next steps:

- Visit the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search) web page.

- Select the following options:
  - `Weather Observation Type/Dataset -> Daily Summaries, Date Range -> 2018-01-01 to 2020-12-31, Search For -> Cities, Search Term -> New York City.`

- Look for "*New York, NY US*" and click in ADD TO CART. Now, click the cart in the top right corner.

- Select "*Custom GHCN-Daily CSV*", and the date previously selected (2018-01-01 to 2020-12-31). We are selecting more information than needed (to avoid disjoint downloads), but we will later filter it with ``Pandas`` and ``Open Refine``. Click continue.

- Fill the three options, and select "*metric units*".

- Fill all the options remaining and click continue. There are some options that will be probably not needed, but we will further analyze this when cleaning the datasets.

- Type the email where you want to receive the data so the order can start.

The *map* dataset was obtained following the next steps:

The datasets are located in the folder `Data/` and the results are saved in the folder `Data/Preprocessed/`. Following are the loading of each dataset and the import of the required libraries.

### Libraries

In [137]:
import os
import re
import geopy as gp
import numpy as np
import pandas as pd
import altair as alt
import geopandas as gpd
from uszipcode import SearchEngine
from Modules import collision_preprocessing as cp

In [138]:
# !pip install numpy
# comentar las versiones de los paquetes

## Dataset preprocessing

The preprocessing of the files was done conjointly with OpenRefine and the proposed python libraries in order to be able to take advantages of both tools.

In [139]:
dir = './Data'
temp_pre = './Data/tmp_pre'
colission_exists = False

## Collision dataset preprocessing

The original dataset was first loaded into a ``Pandas`` dataframe in order to filter the desired range of dates. The reason lays in a more efficient way to filter the data, taking the size of the original dataset into account. This volumn of data made the computational process in OpenRefine very slow and ineffcient. After this initial filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine.

In [140]:
if os.path.exists(f'{dir}/collisions_2018-2020.csv'):
    collision = pd.read_csv(f'{dir}/collisions_2018-2020.csv')
    colission_exists = True
else:
    collision = pd.read_csv(f'{dir}/collisions.csv')

collision.shape

(115740, 29)

In [141]:
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,
1,2020-09-20,9:14,,,40.722095,-73.77772,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,,...,Unspecified,,,,4459141,Sedan,Pick-up Truck,,,
2,2020-09-24,22:00,MANHATTAN,10039.0,40.824757,-73.94052,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,,...,Driver Inattention/Distraction,,,,4461437,Sedan,Motorscooter,,,
3,2020-08-06,5:30,,,,,,,,120 Huge Grant Circle,...,Unspecified,,,,4336560,Sedan,Station Wagon/Sport Utility Vehicle,,,
4,2020-08-01,17:16,BRONX,10462.0,40.840508,-73.85515,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,,...,Unspecified,,,,4334713,Sedan,E-Bike,,,


If the filtered version does not exist, we proceed with the filtering. To filter the data to the summer of 2018 and 2020, the first step is to change the data type of the **CRASH DATE** attribute.

In [142]:
if not colission_exists:
    collision['CRASH DATE'] = pd.to_datetime(collision['CRASH DATE'])

    collision = collision[((collision['CRASH DATE'] >= '2018-06-01') & (collision['CRASH DATE'] <= '2018-09-30')) | ((collision['CRASH DATE'] >= '2020-06-01') & (collision['CRASH DATE'] <= '2020-09-30'))]
    collision.to_csv(f'{dir}/collisions_2018-2020.csv', index=False)

collision.shape

(115740, 29)

After the filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine. The procedure and reasoning taken in OpenRefine will be explained and justified in the present section.

### Data exploration?


The collision table incorporates information pertaining to individual crash events, with each row representing a distinct collision incident. The dataframe compile details from all police reported motor vehicle collisions in NYC. 

<!-- ...se hace una breve descripcion de la exploracion? -->

### Data selection and transformation

All of the following transformations were applied with OpenRefine, but some comprobations to justify the transformations are carried on in this section of the notebook.

At first glance, **ON STREET NAME** and **OFF STREET NAME** seem to be the same attribute, but with different names. After looking at the description of the attributes in the web site of the dataset, the following descriptions:

- **ON STREET NAME**: *Street on which the collision occurred*.
- **OFF STREET NAME**: *Street address if known*.

Which gives the idea that both attributes contain approximatelly the same information. Furthermore, there are no rows with both attributes filled, which makes the idea of merging both attributes plausible.

In [143]:
collision[(collision['ON STREET NAME'].notnull()) & (collision['OFF STREET NAME'].notnull())].shape

(0, 29)

In [144]:
collision[(collision['ON STREET NAME'].notnull()) | (collision['OFF STREET NAME'].notnull())].shape

(115514, 29)

The resulting attribute after merging both columns is called **STREET NAME** and contains the street name/address where the collision occurred, with no missing values. Some rows will have a more detail description of the street, while others will only have the name of the street. 

<!-- This is not a problem, since the attribute will be used to join the dataset with the weather dataset, and the weather dataset only contains the name of the street. -->

**CROSS STREET**, which iss the third attribute related to the street enviroment can be dropped since is not useful for the intended analysis.

Similarly, **LOCATION** seems to contain the tuple (**LATITUDE**, **LONGITUDE**), so we could, a priori, remove the two extra attributes.

In [145]:
collision[(collision['LOCATION'].notnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(108073, 29)

The number of rows where the three attributes are not missing does not cover the total number of rows, but there are no rows where the **LOCATION** attribute is missing and at least one of the other two attributes is not missing.

In [146]:
collision[(collision['LOCATION'].isnull()) & (collision['LATITUDE'].isnull()) & (collision['LONGITUDE'].notnull()) | (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].isnull())].shape

(0, 29)

In [147]:
collision[(collision['LOCATION'].isnull())].shape

(7667, 29)

Which makes the rest of the rows (7667) with missing values in the three attributes. This means that the **LATITUDE** and **LONGITUDE** attributes can be removed, since the **LOCATION** attribute contains the same information. With this transformation, the number of attributes is reduced by two.

Regarding vehicle information, the statement of the project specifies that the interest is only on the first vehicle, so we can drop all columns that refer to other vehicles (*VEHICLE TYPE 2*, *VEHICLE TYPE 3*, etc. and *CONTRIBUTING FACTOR VEHICLE 2*, etc.).

In [148]:
collision['NUMBER OF PERSONS INJURED'].equals(collision['NUMBER OF PEDESTRIANS INJURED'] + collision['NUMBER OF CYCLIST INJURED'] + collision['NUMBER OF MOTORIST INJURED'])

False

In [149]:
collision['NUMBER OF PERSONS INJURED'].equals(collision['NUMBER OF PEDESTRIANS INJURED'] + collision['NUMBER OF CYCLIST INJURED'])

False

In [150]:
collision['NUMBER OF PERSONS INJURED'].equals(collision['NUMBER OF PEDESTRIANS INJURED'])


False

### Data type conversion

In [151]:
# no es mejor poner las palabras a minúsculas para mejor visualización?

The data conversion was done in OpenRefine. The **CRASH DATE** attribute was converted to a date type. The **COLLISION ID** and **CRASH TIME** were set as strings for the time being.

 The attributes relating the greographic location of the collision were set as strings with some special remarks. All the values were set to uppercase and the extra spaces were removed (if any). The reason for this was to ease the work of the clusterization method used to collectively inspect and edit cells in the case that there were some values that were not correctly or consistently written.

 This clusterization was done with the *key collision* method and the *fingerprint* keying function. No significant variations were found after a copuble iterations in the values of the attributes but the ones commented above.

 The same clustering was done for the rest of geographical attributes, with the same keying function. Nevertheless, no results appeared in the clusterization, which means that the values were already consistent. To verify the result, a NN was done as well but did not find any significant variation.

For the attributes refering to the number of persons involved in the collision, the data type was set to integer. The reason for this is that the values are discrete and the values are not expected to be negative.

The vehicle and factors attributes were left as strings for the time being. Opposed to the greographical attributes, the values of these attributes did not have extra spaces or inconsistent values, so in the clusterization no variations were found. This could lead to the conclusion that the values were inserted in a controlled manner, which is not the case for the greographical attributes.

### a

In [152]:
precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v1.csv')
precollision.head()

Unnamed: 0,COLLISION_ID,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LOCATION,STREET NAME,NUMBER OF INJURED,NUMBER OF KILLED,CONTRIBUTING FACTOR VEHICLE 1,VEHICLE TYPE CODE 1
0,4345591,2020-09-06T00:00:00Z,18:05,,,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,0.0,2.0,Unsafe Lane Changing,VAN
1,4459141,2020-09-20T00:00:00Z,9:14,,,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,0.0,0.0,Driver Inattention/Distraction,CAR
2,4461437,2020-09-24T00:00:00Z,22:00,MANHATTAN,10039.0,"(40.824757, -73.94052)",8 AVENUE,0.0,0.0,Pedestrian/Bicyclist/Other Pedestrian Error/Co...,CAR
3,4336560,2020-08-06T00:00:00Z,5:30,,,,120 HUGE GRANT CIRCLE,2.0,0.0,Unspecified,CAR
4,4334713,2020-08-01T00:00:00Z,17:16,BRONX,10462.0,"(40.840508, -73.85515)",METROPOLITAN AVENUE,0.0,2.0,Failure to Yield Right-of-Way,CAR


<!-- SCOOTER, TRUCK, CAR, VAN, MOTORCYCLE, BICYCLE, BUS, TAXI, AMBULANCE, FDNY, OTHER, UNKNOWN, SCOOTER, MILITAR, TRACTOR,  -->

In [159]:
precollision['VEHICLE TYPE CODE 1'].unique()

array(['VAN', 'CAR', 'BUS', 'TRACTOR', 'TAXI', 'SCOOTER', 'TRUCK',
       'MOTORCYCLE', 'BICYCLE', 'PK', 'TANK', 'AMBULANCE', 'MOPED',
       'FDNY', 'UNKNOWN', 'OTHERS', 'CHASSIS CAB', 'BULK AGRICULTURE',
       'SWT', 'MESSAGE SI', 'BOX', 'PC', 'SLINGSHOT', 'DOT EQUIPM', 'J1',
       'DRILL RIG', 'E REVEL SC', 'UTV', 'JOHN DEERE', 'F150XL PIC', '1C',
       'SANIT', 'RV', 'STAK', '2- TO', 'MOTORIZED HOME', 'PAS', 'PALLET',
       'UHAUL', 'LOADE', 'NEW Y', 'STREE', 'SPRIN', 'RGS', 'POSTO',
       'SGWS', 'OMR', 'HEAVY', 'ACCES', 'DEMA-', 'BK', 'NYPD'],
      dtype=object)

In [158]:
# precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    # ['SUV', 'FLAT', '3-DOOR', 'CHEVY EXPR', 'PC', 'ELEC. UNIC'], 'CAR')

# precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
#     ['99999'], 'UNKNOWN')
# precollision['VEHICLE TYPE CODE 1'].fillna('UNKNOWN', inplace=True)


precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['BULK AGRICULTURE', ], 'OTHERS')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['BOX', ], 'TRUCK')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['MOTORIZED HOME', ], 'VAN')

### Missing values

It has already been mentioned the existence of some missing values. In the previous section, the verification of missing values was done with the ``.isnull()`` method of ``Pandas``. However, this method does not take into account the ``NaN`` values. In order to check the existence of ``NaN`` values, the ``.isna()`` method was used.

In [104]:
comp = (precollision.isnull().sum() == precollision.isna().sum())
comp[comp == False]

Series([], dtype: bool)

As seen previously, all the missing values of the dataset are detected both with ``.isnull()`` and ``.isna()``. After this check, we could group the attributes with missign values in three separeted clusters:
- Geographical attributes
- Injured/Killed attributes
- Vehicle attributes

In [105]:
precollision.isnull().sum()

COLLISION_ID                         0
CRASH DATE                           0
CRASH TIME                           0
BOROUGH                          40671
ZIP CODE                         40686
LOCATION                          7667
STREET NAME                        226
NUMBER OF INJURED                    2
NUMBER OF KILLED                     4
CONTRIBUTING FACTOR VEHICLE 1      351
VEHICLE TYPE CODE 1                730
dtype: int64

#### Derivation of geographic attributes

The first cluster is formed with reference to the geographicals attributes. The attributes in this cluster are:
- **BOROUGH**
- **ZIP CODE**
- **LOCATION**
- **STREET NAME**
- **CROSS STREET NAME**

Notice that the attributes with the less missing values is **STREET NAME** with only a $0.20\%$ of the entire dataset, partially thanks to the merge of **ON STREET** and **OFF STREET** attributes in the previous sections. We used this information to impute the missing coordinates of the **LOCATION** attribute and the rest of the attributes sequentially.


In [106]:
# from arcgis.gis import GIS
# from arcgis.geocoding import geocode

In [107]:
# gis = GIS(profile = 'Myaaap')

# locations = []
# for index in precollision.shape[0]:
#     res = geocode(precollision['STREET NAME'][index], max_locations=1)
#     if res:
#         locations.append(res[0]['location'])
#     else:
#         locations.append(None)

In [108]:
# (precollision['LOCATION'][0], precollision['STREET NAME'][0] ,res[0]['address'], res[0]['location'])

In [109]:
precollision['LOCATION'].isnull().sum(), precollision['STREET NAME'].isnull().sum()

(7667, 226)

In [110]:
precollision[(precollision['LOCATION'].isnull()) & (precollision['STREET NAME'].isnull())].shape[0]

67

In [111]:
street_unique = precollision[precollision['LOCATION'].isnull()]['STREET NAME'].unique()
len(street_unique)

1897

In [112]:
# from geopy.geocoders import Nominatim

# geolocator = Nominatim(user_agent="GEoNYC")

# locations = []
# for street in street_unique:
#     try:
#         location = geolocator.geocode(street + ', New York City, New York, USA')
#         locations.append(location)
#     except:
#         locations.append(None)

In [113]:
# save locations in a file
# import pickle
# with open(f'{dir}/locations.pkl', 'wb') as f:
#     pickle.dump(locations, f)


In [114]:
precollision['LOCATION'].isnull().sum(), precollision['ZIP CODE'].isnull().sum()

(7667, 40686)

In [115]:
# download locations
import pickle
with open(f'{dir}/locations.pkl', 'rb') as f:
    locations = pickle.load(f)

In [116]:
len(locations)

1897

In [117]:
# count the number of None in locations
count = 0
for location in locations:
    if location is None:
        count += 1
count

603

In [118]:
zipcodes = []; boroughs = []; coord = []

for location in locations:
    if location:
        zipcode = re.findall(r'\d{5}', location.address)
        coord.append('(' + str(location.latitude) + ', ' + str(location.longitude) + ')')
        # boroughs.append(location.raw['display_name'].split(', ')[-6])
        if zipcode:
            zipcodes.append(zipcode[0])
        else:
            zipcodes.append(None)
    else:
        zipcodes.append(None)
        coord.append(None)
        boroughs.append(None)

In [119]:
len(zipcodes), len(coord), len(boroughs)

(1897, 1897, 603)

In [120]:
# Iterate over the rows of the DataFrame
for index, row in precollision.iterrows():
    street_name = row['STREET NAME']
    location = row['LOCATION']
    zip_code = row['ZIP CODE']

    # Check if 'LOCATION' is NaN
    if pd.isnull(location):

        if street_name and street_name in street_unique:
            street_index = np.where(street_unique == street_name)[0][0]
            precollision.at[index, 'LOCATION'] = coord[street_index]

    if pd.isnull(zip_code):

        if street_name and street_name in street_unique:
            street_index = np.where(street_unique == street_name)[0][0]
            precollision.at[index, 'ZIP CODE'] = zipcodes[street_index]

  precollision.at[index, 'ZIP CODE'] = zipcodes[street_index]


In [121]:
precollision['LOCATION'].isnull().sum(), precollision['ZIP CODE'].isnull().sum()
# porq solo 2248 y no menos?????

(2248, 16137)

In [122]:
# precollision['ZIP CODE'] = precollision['ZIP CODE'].fillna(-1)
precollision['LOCATION'] = precollision['LOCATION'].fillna('(-1,-1)')

In [123]:
if os.path.exists(f'{dir}/collisions_2018-2020_prepro_v2.csv'):
    precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v2.csv')
    zipcodes = precollision['DERIVATED ZIP']
else:
    search = SearchEngine()

    def get_zipcode(df, location):
        if df[location] == '(-1,-1)':
            return np.nan

        lat, lon = df[location].replace('(', '').replace(')', '').split(',')
        result = search.by_coordinates(float(lat), float(lon))
        if len(result) > 0:
            return result[0]
        return np.nan

    info = precollision.apply(get_zipcode, axis=1, location='LOCATION')
    precollision['DERIVATED ZIP'] = info
    precollision.to_csv(f'{dir}/collisions_2018-2020_prepro_v2.csv', index=False)

In [124]:
print(precollision['ZIP CODE'].isnull().sum(), precollision['ZIP CODE'].isna().sum())
print(precollision['BOROUGH'].isnull().sum(), precollision['BOROUGH'].isna().sum())

40686 40686
40671 40671


In [125]:
zipcodes = precollision['DERIVATED ZIP'].str.extract(r"zipcode='(\d+)'", expand=False)
boroughs = precollision['DERIVATED ZIP'].str.extract(r"major_city='(\w+)'", expand=False)

precollision['ZIP CODE'].fillna(zipcodes, inplace=True)
precollision['BOROUGH'].fillna(boroughs, inplace=True)

  precollision['ZIP CODE'].fillna(zipcodes, inplace=True)


In [126]:
print(precollision['ZIP CODE'].isnull().sum(), precollision['ZIP CODE'].isna().sum())
print(precollision['BOROUGH'].isnull().sum(), precollision['BOROUGH'].isna().sum())

5440 5440
19984 19984


#### Derivation of vehicle attributes

In [127]:
# poner unespecified en factor $i$ si el vehiculo $i$ es no null y factor $i$ es null
# o
# poner unknown en vehiculo $i$ si el vehiculo $i$ es null y factor $i$ no es null

# son disjuntas estas operaciones?????????????

In some rows of the dataset, the **CONTRIBUTING FACTOR VEHICLE** is missing but the **VEHICLE TYPE CODE** is not. This suggests that the vehicle type is known, but the factor that contributed to the collision is not. In order to fill this missing values, the factor was set as *unespecified*. This was done for all the rows and columns where the **CONTRIBUTING FACTOR VEHICLE** was missing with the above condition.

In [128]:
cp.imputation_with_ref_col(precollision, 'CONTRIBUTING FACTOR VEHICLE', 'VEHICLE TYPE CODE', 'Unspecified', 1, 1)

Likewise, in some rows of the dataset, the **VEHICLE TYPE CODE** is missing but the **CONTRIBUTING FACTOR VEHICLE** is not. This suggests that the factor that contributed to the collision is known, but the vehicle type is not. In order to fill this missing values, the vehicle type was set as *unknown*. This was done for all the rows and columns where the **VEHICLE TYPE CODE** was missing with the above condition.

In [129]:
cp.imputation_with_ref_col(precollision, 'VEHICLE TYPE CODE', 'CONTRIBUTING FACTOR VEHICLE', 'UNKNOWN', 1, 1)

In [130]:
precollision.isnull().sum()

COLLISION_ID                          0
CRASH DATE                            0
CRASH TIME                            0
BOROUGH                           19984
ZIP CODE                           5440
LOCATION                              0
STREET NAME                         226
CROSS STREET NAME                 60179
NUMBER OF PERSONS INJURED             2
NUMBER OF PERSONS KILLED              4
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1       288
CONTRIBUTING FACTOR VEHICLE 2     19593
CONTRIBUTING FACTOR VEHICLE 3    106611
CONTRIBUTING FACTOR VEHICLE 4    113552
CONTRIBUTING FACTOR VEHICLE 5    115126
VEHICLE TYPE CODE 1                 288
VEHICLE TYPE CODE 2               27447
VEHICLE TYPE CODE 3              107096
VEHICLE TYPE CODE 4              113658


<!-- comentar la reduccion de missing values -->

Notice that the only missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute are in the same rows. This means that the number of missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute is the same.

In [131]:
# como se hace para imputar los missing de los dos campos a la vez?

# code 1 como unespecified/unknown
# code resto no existent

#### Derivation of number of person attributes

In [132]:
# como se hace para imputar los missing del numero de personas? o mejor se elimina?

In [133]:
precollision[precollision['NUMBER OF PERSONS INJURED'].isnull() | precollision['NUMBER OF PERSONS KILLED'].isnull()]

Unnamed: 0,COLLISION_ID,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LOCATION,STREET NAME,CROSS STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5,DERIVATED ZIP
42087,4026403,2018-09-30T00:00:00Z,6:30,QUEENS,11368.0,"(40.749767, -73.86381)",102-21 ROOSEVELT AVENUE,,,,...,,,,,TAXI,,,,,"SimpleZipcode(zipcode='11372', zipcode_type='S..."
62406,4026219,2018-08-08T00:00:00Z,13:45,STATEN ISLAND,10312.0,"(40.562523, -74.19211)",1531 ARTHUR KILL ROAD,,1.0,,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='10312', zipcode_type='S..."
69596,4026267,2018-08-23T00:00:00Z,8:45,MANHATTAN,10001.0,"(40.74528, -73.99254)",135 WEST 26 STREET,,0.0,,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='10098', zipcode_type='S..."
90894,4026185,2018-07-23T00:00:00Z,12:50,BROOKLYN,11221.0,"(40.68603, -73.93268)",STUYVESANT AVENUE,PUTNAM AVENUE,,0.0,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='11221', zipcode_type='S..."
114639,4026046,2018-06-05T00:00:00Z,5:19,BRONX,10459.0,"(40.826992, -73.88984)",1145 WEST FARMS ROAD,,1.0,,...,,,,,STATION WAGON/SPORT UTILITY VEHICLE,,,,,"SimpleZipcode(zipcode='10459', zipcode_type='S..."


Since the resulting rows with missing values of the number of persons involved in the collision are a small fraction of the total dataset, we decided to set them to 0, assuming that there were no persons involved in the collision.

Given the small percentage of the rows $0.01$, we think that there would not be a significant impact in the final visualization whether we set the missing values to 0 or we drop the rows with missing values. However, we decided to set the missing values to 0 in order to keep the rows and not lose information.

In [134]:
precollision['NUMBER OF PERSONS INJURED'].fillna(0, inplace=True)
precollision['NUMBER OF PERSONS KILLED'].fillna(0, inplace=True)