# Preprocesing

---
**Authors**:
-  *Juan P. Zaldivar E.*
-  *Enrique Millán X.*
---

## Introduction

This file contains the preprocessing required for the datasets. The datasets are:
- Colissions dataset.
- Weather dataset.
- New York Map.


<!-- explicar que primer se hace una exploracion visual y luego el preprocesing? -->

The datasets are located in the folder `Data/` and the results are saved in the folder `Data/Preprocessed/`. Following are the loading of each dataset and the import of the required libraries.

In [1]:
# !pip install numpy
# comentar las versiones de los paquetes

### Import libraries

In [2]:
import os
import numpy as np
import pandas as pd

## Dataset obtaintion

The *weather* dataset was obtained following the next steps:

- Visit the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search)

- Select the following options:  Weather Observation Type/Dataset -> Daily Summaries, Date Range -> 2018-01-01 to 2020-12-31, Search For -> Cities, Search Term -> New York City. 

- Look for "New York, NY US" and click in *ADD TO CART*. Now, click the cart in the top right corner.

- Select "Custom GHCN-Daily CSV", and the date previously selected (2018-01-01 to 2020-12-31). We are selecting more information than needed, but we will later clean it with Pandas and Open Refine. Click continue.

- Fill the three options, and select metric units.

- Fill all the options remaining and click continue. There are some options that will be probably not needed, but we will further analyze this when cleaning the datsaets.

- Type the email where you want to receive the data so the order can start.

The *map* dataset was obtained following the next steps:


## Load datasets

In [3]:
dir = './Data'
colission_exists = False

### Colissions dataset

In [4]:
if os.path.exists(f'{dir}/collisions_2018-2020.csv'):
    collision = pd.read_csv(f'{dir}/collisions_2018-2020.csv')
    colission_exists = True
else:
    collision = pd.read_csv(f'{dir}/collisions.csv')

collision.shape, collision.columns

((115740, 29),
 Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
        'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
        'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
        'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
        'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
        'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
        'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
        'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
        'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
        'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
        'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
       dtype='object'))

In [5]:
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,
1,2020-09-20,9:14,,,40.722095,-73.77772,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,,...,Unspecified,,,,4459141,Sedan,Pick-up Truck,,,
2,2020-09-24,22:00,MANHATTAN,10039.0,40.824757,-73.94052,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,,...,Driver Inattention/Distraction,,,,4461437,Sedan,Motorscooter,,,
3,2020-08-06,5:30,,,,,,,,120 Huge Grant Circle,...,Unspecified,,,,4336560,Sedan,Station Wagon/Sport Utility Vehicle,,,
4,2020-08-01,17:16,BRONX,10462.0,40.840508,-73.85515,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,,...,Unspecified,,,,4334713,Sedan,E-Bike,,,


To filter the data to the summer of 2018 and 2020, the first step is to change the data type of the *CRASH DATE* attribute.

In [6]:
collision['CRASH DATE'] = pd.to_datetime(collision['CRASH DATE'])
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,
1,2020-09-20,9:14,,,40.722095,-73.77772,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,,...,Unspecified,,,,4459141,Sedan,Pick-up Truck,,,
2,2020-09-24,22:00,MANHATTAN,10039.0,40.824757,-73.94052,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,,...,Driver Inattention/Distraction,,,,4461437,Sedan,Motorscooter,,,
3,2020-08-06,5:30,,,,,,,,120 Huge Grant Circle,...,Unspecified,,,,4336560,Sedan,Station Wagon/Sport Utility Vehicle,,,
4,2020-08-01,17:16,BRONX,10462.0,40.840508,-73.85515,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,,...,Unspecified,,,,4334713,Sedan,E-Bike,,,


If the filtered version does not exist, we proceed with the filtering.

In [7]:
if not colission_exists:
    collision = collision[((collision['CRASH DATE'] >= '2018-06-01') & (collision['CRASH DATE'] <= '2018-09-30')) | ((collision['CRASH DATE'] >= '2020-06-01') & (collision['CRASH DATE'] <= '2020-09-30'))]
    collision.to_csv(f'{dir}/colissions_2018-2020.csv', index=False)

collision.shape

(115740, 29)

## Colission dataset preprocessing

*ON STREET NAME* and *OFF STREET NAME* apparantly give the same information. Moreover, apparently, there are no rows taht contain both this attributes, so that is a reason to think of merging both columns.

We first check if there is a row that contains both of these attributes. Getting as result that there are no rows satisfying this condition.

In [17]:
collision[(collision['ON STREET NAME'].notnull()) & (collision['OFF STREET NAME'].notnull())].shape

(0, 29)

In [9]:
collision[(collision['ON STREET NAME'].notnull()) | (collision['OFF STREET NAME'].notnull())].shape

(115514, 29)

---

*LOCATION* and *LATITUDE & LONGITUDE* seem to give th exact same information so let's check if we can remove the latter.

In [10]:
# mirara si todas las que tienen longitud y latitud tienen location para ver si se pueden eliminar
collision[(collision['LOCATION'].notnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(108073, 29)

In [11]:
# todas las que tienen location tienen longitud y latitud, por lo que podemos precindir de las columnas de longitud y latitud
collision[(collision['LOCATION'].isnull())].shape

(7667, 29)

In [12]:
collision[(collision['LOCATION'].isnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(0, 29)

In [13]:
collision[(collision['LATITUDE'].isnull()) & (collision['LONGITUDE'].notnull()) | (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].isnull())].shape

(0, 29)

---
CROSS STREET se podria eliminar, ya que es la calle mas cercana y no aporta valor.

---
se ha realizado un cluster (con el metodo *key collision* y funcion de keying *fingerprint*) de STREET NAME y se han unido las celdas con nombres semejantes.

Luego se han eliminado los espacios sobrantes de la columna STREET NAME.

---

solo hay 6 valores null

In [14]:
print(collision['NUMBER OF PERSONS INJURED'].isnull().sum(),
collision['NUMBER OF PERSONS KILLED'].isnull().sum(),
collision['NUMBER OF PEDESTRIANS INJURED'].isnull().sum(),
collision['NUMBER OF PEDESTRIANS KILLED'].isnull().sum(),
collision['NUMBER OF CYCLIST INJURED'].isnull().sum(),
collision['NUMBER OF CYCLIST KILLED'].isnull().sum(),
collision['NUMBER OF MOTORIST INJURED'].isnull().sum(),
collision['NUMBER OF MOTORIST KILLED'].isnull().sum())

2 4 0 0 0 0 0 0


In [15]:
print(collision['NUMBER OF PERSONS INJURED'].isna().sum(),
collision['NUMBER OF PERSONS KILLED'].isna().sum(),
collision['NUMBER OF PEDESTRIANS INJURED'].isna().sum(),
collision['NUMBER OF PEDESTRIANS KILLED'].isna().sum(),
collision['NUMBER OF CYCLIST INJURED'].isna().sum(),
collision['NUMBER OF CYCLIST KILLED'].isna().sum(),
collision['NUMBER OF MOTORIST INJURED'].isna().sum(),
collision['NUMBER OF MOTORIST KILLED'].isna().sum())

2 4 0 0 0 0 0 0
