# Preprocesing

---
**Authors**:
-  *Juan P. Zaldivar E.*
-  *Enrique Millán X.*
---

## Introduction

This file contains the preprocessing required for the datasets. The datasets are:
- Colissions dataset.
- Weather dataset.
- New York Map.


<!-- explicar que primer se hace una exploracion visual y luego el preprocesing? -->

### Dataset obtention

The [*collision*](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset was already given by the instructors of the project.

The *weather* dataset was obtained following the next steps:

- Visit the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search) web page.

- Select the following options:
  - `Weather Observation Type/Dataset -> Daily Summaries, Date Range -> 2018-01-01 to 2020-12-31, Search For -> Cities, Search Term -> New York City.`

- Look for "*New York, NY US*" and click in ADD TO CART. Now, click the cart in the top right corner.

- Select "*Custom GHCN-Daily CSV*", and the date previously selected (2018-01-01 to 2020-12-31). We are selecting more information than needed (to avoid disjoint downloads), but we will later filter it with ``Pandas`` and ``Open Refine``. Click continue.

- Fill the three options, and select "*metric units*".

- Fill all the options remaining and click continue. There are some options that will be probably not needed, but we will further analyze this when cleaning the datasets.

- Type the email where you want to receive the data so the order can start.

The *map* dataset was obtained following the next steps:

The datasets are located in the folder `Data/` and the results are saved in the folder `Data/Preprocessed/`. Following are the loading of each dataset and the import of the required libraries.

### Libraries

In [140]:
import os
import numpy as np
import pandas as pd

In [141]:
# !pip install numpy
# comentar las versiones de los paquetes

## Dataset preprocessing

The preprocessing of the files was done conjuntly with OpenRefine and the python libraries in order to be able to take advantages of both tools.

In [142]:
dir = './Data'
colission_exists = False

### Colissions dataset

## Collision dataset preprocessing

The original dataset was first loaded into a ``Pandas`` dataframe in order to filter the desired range of dates. The reason lays in a more efficient way to filter the data, since the size of the original dataset. This volumn of data made the coputaniotal process in OpenRefine very slow and ineffcient. After this initial filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine.

In [143]:
if os.path.exists(f'{dir}/collisions_2018-2020.csv'):
    collision = pd.read_csv(f'{dir}/collisions_2018-2020.csv')
    colission_exists = True
else:
    collision = pd.read_csv(f'{dir}/collisions.csv')

collision.shape

  collision = pd.read_csv(f'{dir}/collisions.csv')


(2031091, 29)

In [144]:
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,09/11/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,...,Unspecified,,,,4455765,Sedan,Sedan,,,
1,03/26/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,...,,,,,4513547,Sedan,,,,
2,06/29/2022,6:55,,,,,,THROGS NECK BRIDGE,,,...,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,
3,09/11/2021,9:35,BROOKLYN,11208.0,40.667202,-73.8665,"(40.667202, -73.8665)",,,1211 LORING AVENUE,...,,,,,4456314,Sedan,,,,
4,12/14/2021,8:13,BROOKLYN,11233.0,40.683304,-73.917274,"(40.683304, -73.917274)",SARATOGA AVENUE,DECATUR STREET,,...,,,,,4486609,,,,,


If the filtered version does not exist, we proceed with the filtering. To filter the data to the summer of 2018 and 2020, the first step is to change the data type of the **CRASH DATE** attribute.

In [145]:
if not colission_exists:
    collision['CRASH DATE'] = pd.to_datetime(collision['CRASH DATE'])

    collision = collision[((collision['CRASH DATE'] >= '2018-06-01') & (collision['CRASH DATE'] <= '2018-09-30')) | ((collision['CRASH DATE'] >= '2020-06-01') & (collision['CRASH DATE'] <= '2020-09-30'))]
    collision.to_csv(f'{dir}/collisions_2018-2020.csv', index=False)

collision.shape

(115740, 29)

After the first filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine. The procedure and reasoning taken in OpenRefine will be explained and justified in the present section.

*ON STREET NAME* and *OFF STREET NAME* seem to give the same information. Moreover, apparently, there are no rows that contain both attributes, so that is a reason to think of merging both columns.

We first check if there is a row that contains both of these attributes. Getting as result that there are no rows satisfying this condition.

In [146]:
collision[(collision['ON STREET NAME'].notnull()) & (collision['OFF STREET NAME'].notnull())].shape

(0, 29)

In [147]:
collision[(collision['ON STREET NAME'].notnull()) | (collision['OFF STREET NAME'].notnull())].shape

(115514, 29)

---

*LOCATION* and *LATITUDE & LONGITUDE* seem to give th exact same information so let's check if we can remove the latter.

In [148]:
# mirara si todas las que tienen longitud y latitud tienen location para ver si se pueden eliminar
collision[(collision['LOCATION'].notnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(108073, 29)

In [149]:
# todas las que tienen location tienen longitud y latitud, por lo que podemos precindir de las columnas de longitud y latitud
collision[(collision['LOCATION'].isnull())].shape

(7667, 29)

In [150]:
collision[(collision['LOCATION'].isnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(0, 29)

In [151]:
collision[(collision['LATITUDE'].isnull()) & (collision['LONGITUDE'].notnull()) | (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].isnull())].shape

(0, 29)

---
CROSS STREET se podria eliminar, ya que es la calle mas cercana y no aporta valor.

---
se ha realizado un cluster (con el metodo *key collision* y funcion de keying *fingerprint*) de STREET NAME y se han unido las celdas con nombres semejantes.

Luego se han eliminado los espacios sobrantes de la columna STREET NAME.

---

solo hay 6 valores null

In [152]:
print(collision['NUMBER OF PERSONS INJURED'].isnull().sum(),
collision['NUMBER OF PERSONS KILLED'].isnull().sum(),
collision['NUMBER OF PEDESTRIANS INJURED'].isnull().sum(),
collision['NUMBER OF PEDESTRIANS KILLED'].isnull().sum(),
collision['NUMBER OF CYCLIST INJURED'].isnull().sum(),
collision['NUMBER OF CYCLIST KILLED'].isnull().sum(),
collision['NUMBER OF MOTORIST INJURED'].isnull().sum(),
collision['NUMBER OF MOTORIST KILLED'].isnull().sum())

2 4 0 0 0 0 0 0


In [153]:
print(collision['NUMBER OF PERSONS INJURED'].isna().sum(),
collision['NUMBER OF PERSONS KILLED'].isna().sum(),
collision['NUMBER OF PEDESTRIANS INJURED'].isna().sum(),
collision['NUMBER OF PEDESTRIANS KILLED'].isna().sum(),
collision['NUMBER OF CYCLIST INJURED'].isna().sum(),
collision['NUMBER OF CYCLIST KILLED'].isna().sum(),
collision['NUMBER OF MOTORIST INJURED'].isna().sum(),
collision['NUMBER OF MOTORIST KILLED'].isna().sum())

2 4 0 0 0 0 0 0


In [154]:
collision['COLLISION_ID'].unique().shape[0]

115740

---

---

In [155]:
precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v1.csv')
precollision.head()

Unnamed: 0,COLLISION_ID,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LOCATION,STREET NAME,CROSS STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,...,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,4345591,2020-09-06T00:00:00Z,18:05,,,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,0.0,1.0,...,Unsafe Lane Changing,Following Too Closely,,,,STATION WAGON/SPORT UTILITY VEHICLE,MOTORCYCLE,,,
1,4459141,2020-09-20T00:00:00Z,9:14,,,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,0.0,0.0,...,Driver Inattention/Distraction,Unspecified,,,,SEDAN,PICK-UP TRUCK,,,
2,4461437,2020-09-24T00:00:00Z,22:00,MANHATTAN,10039.0,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,0.0,0.0,...,Pedestrian/Bicyclist/Other Pedestrian Error/Co...,Driver Inattention/Distraction,,,,SEDAN,MOTORSCOOTER,,,
3,4336560,2020-08-06T00:00:00Z,5:30,,,,120 HUGE GRANT CIRCLE,,1.0,0.0,...,Unspecified,Unspecified,,,,SEDAN,STATION WAGON/SPORT UTILITY VEHICLE,,,
4,4334713,2020-08-01T00:00:00Z,17:16,BRONX,10462.0,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,0.0,1.0,...,Failure to Yield Right-of-Way,Unspecified,,,,SEDAN,E-BIKE,,,


In [156]:
precollision['CONTRIBUTING FACTOR VEHICLE 2'].isnull().sum()

19593

In [157]:
# poner unespecified en factor $i$ si el vehiculo $i$ no es null y factor $i$ es null
for i in range(1, 6):
    precollision[f'CONTRIBUTING FACTOR VEHICLE {i}'] = precollision.apply(lambda x: 'Unspecified' if not pd.isnull(x[f'VEHICLE TYPE CODE {i}']) and pd.isnull(x[f'CONTRIBUTING FACTOR VEHICLE {i}']) else x[f'CONTRIBUTING FACTOR VEHICLE {i}'], axis=1)

In [158]:
precollision['CONTRIBUTING FACTOR VEHICLE 2'].isnull().sum()

18852

In [159]:
# poner null en factor $i$ si el vehiculo $i$ es null y factor $i$ no es null
# o
# poner unknown en vehiculo $i$ si el vehiculo $i$ es null y factor $i$ no es null