# Visualization Project

---
**Authors**:
-  *Juan P. Zaldivar E.*
-  *Enrique Millán X.*
---

In [161]:
# nyc = gpd.read_file('Data/new-york-city-boroughs-ny_.geojson')

# alt.Chart(nyc).mark_geoshape(
#     stroke='white',
#     strokeWidth=1
# ).encode(
#     color=alt.Color('name:N'),
#     tooltip=['name:N']
# ).properties(
#     width=500,
#     height=500
# )

## Introduction

This file contains the preprocessing required for the datasets. The datasets are:
- Colissions dataset.
- Weather dataset.
- New York Map.


<!-- explicar que primer se hace una exploracion visual y luego el preprocesing? -->

### Dataset obtention

The [*collision*](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset was already given by the instructors of the project.

The *weather* dataset was obtained following the next steps:

- Visit the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search) web page.

- Select the following options:
  - `Weather Observation Type/Dataset -> Daily Summaries, Date Range -> 2018-01-01 to 2020-12-31, Search For -> Cities, Search Term -> New York City.`

- Look for "*New York, NY US*" and click in ADD TO CART. Now, click the cart in the top right corner.

- Select "*Custom GHCN-Daily CSV*", and the date previously selected (2018-01-01 to 2020-12-31). We are selecting more information than needed (to avoid disjoint downloads), but we will later filter it with ``Pandas`` and ``Open Refine``. Click continue.

- Fill the three options, and select "*metric units*".

- Fill all the options remaining and click continue. There are some options that will be probably not needed, but we will further analyze this when cleaning the datasets.

- Type the email where you want to receive the data so the order can start.

The *map* dataset was obtained following the next steps:

The datasets are located in the folder `Data/` and the results are saved in the folder `Data/Preprocessed/`. Following are the loading of each dataset and the import of the required libraries.

### Libraries

In [162]:
import os
import re
import geopy as gp
import numpy as np
import pandas as pd
import altair as alt
import geopandas as gpd
from uszipcode import SearchEngine
from Modules import collision_preprocessing as cp

In [163]:
# !pip install numpy
# comentar las versiones de los paquetes

## Dataset preprocessing

The preprocessing of the files was done conjuntly with OpenRefine and the python libraries in order to be able to take advantages of both tools.

In [164]:
dir = './Data'
temp_pre = './Data/tmp_pre'
colission_exists = False

## Collision dataset preprocessing

The original dataset was first loaded into a ``Pandas`` dataframe in order to filter the desired range of dates. The reason lays in a more efficient way to filter the data, taking the size of the original dataset into account. This volumn of data made the computational process in OpenRefine very slow and ineffcient. After this initial filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine.

In [165]:
if os.path.exists(f'{dir}/collisions_2018-2020.csv'):
    collision = pd.read_csv(f'{dir}/collisions_2018-2020.csv')
    colission_exists = True
else:
    collision = pd.read_csv(f'{dir}/collisions.csv')

collision.shape

(115740, 29)

In [166]:
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,
1,2020-09-20,9:14,,,40.722095,-73.77772,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,,...,Unspecified,,,,4459141,Sedan,Pick-up Truck,,,
2,2020-09-24,22:00,MANHATTAN,10039.0,40.824757,-73.94052,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,,...,Driver Inattention/Distraction,,,,4461437,Sedan,Motorscooter,,,
3,2020-08-06,5:30,,,,,,,,120 Huge Grant Circle,...,Unspecified,,,,4336560,Sedan,Station Wagon/Sport Utility Vehicle,,,
4,2020-08-01,17:16,BRONX,10462.0,40.840508,-73.85515,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,,...,Unspecified,,,,4334713,Sedan,E-Bike,,,


If the filtered version does not exist, we proceed with the filtering. To filter the data to the summer of 2018 and 2020, the first step is to change the data type of the **CRASH DATE** attribute.

In [167]:
if not colission_exists:
    collision['CRASH DATE'] = pd.to_datetime(collision['CRASH DATE'])

    collision = collision[((collision['CRASH DATE'] >= '2018-06-01') & (collision['CRASH DATE'] <= '2018-09-30')) | ((collision['CRASH DATE'] >= '2020-06-01') & (collision['CRASH DATE'] <= '2020-09-30'))]
    collision.to_csv(f'{dir}/collisions_2018-2020.csv', index=False)

collision.shape

(115740, 29)

### Data exploration?

The collision table incorporates information pertaining to individual crash events, with each row representing a distinct collision incident. The dataframe compile details from all police reported motor vehicle collisions in NYC. 

<!-- ...se hace una breve descripcion de la exploracion? -->

After the filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine. The procedure and reasoning taken in OpenRefine will be explained and justified in the present section.

### Data selection

At first glance, **ON STREET NAME** and **OFF STREET NAME** seem to be the same attribute, but with different names. After looking at the description of the attributes in the web site of the dataset, the following descriptions:

- **ON STREET NAME**: *Street on which the collision occurred*.
- **OFF STREET NAME**: *Street address if known*.

Which gives the idea that both attributes contain approximatelly the same information. Furthermore, there are no rows with both attributes filled, which makes the idea of merging both attributes plausible.

In [168]:
collision[(collision['ON STREET NAME'].notnull()) & (collision['OFF STREET NAME'].notnull())].shape

(0, 29)

In [169]:
collision[(collision['ON STREET NAME'].notnull()) | (collision['OFF STREET NAME'].notnull())].shape

(115514, 29)

The resulting attribute after merging both columns is called **STREET NAME** and contains the street name/address where the collision occurred, with no missing values. Some rows will have a more detail description of the street, while others will only have the name of the street. 

<!-- This is not a problem, since the attribute will be used to join the dataset with the weather dataset, and the weather dataset only contains the name of the street. -->

**CROSS STREET**, which was the third attribute related to the street enviroment could be dropped since at first glance does not seem to be useful for the analysis. However, we have decided to keep the attribute for teh time being since it could can contribute some extra information to the final visualization.

Similarly, **LOCATION** seems to contain the tuple (**LATITUDE**, **LONGITUDE**), so we could, a priori, remove the two extra attributes.

In [170]:
collision[(collision['LOCATION'].notnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(108073, 29)

The number of rows where the three attributes are not missing does not cover the total number of rows, but there are no rows where the **LOCATION** attribute is missing and at least one of the other two attributes is not missing.

In [171]:
collision[(collision['LOCATION'].isnull()) & (collision['LATITUDE'].isnull()) & (collision['LONGITUDE'].notnull()) | (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].isnull())].shape

(0, 29)

In [172]:
collision[(collision['LOCATION'].isnull())].shape

(7667, 29)

Which makes the rest of the rows (7667) with missing values in the three attributes. This means that the **LATITUDE** and **LONGITUDE** attributes can be removed, since the **LOCATION** attribute contains the same information. By this the number of attributes is reduced by two.

### a

In [173]:
precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v1.csv')
precollision.head()

Unnamed: 0,COLLISION_ID,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LOCATION,STREET NAME,CROSS STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,...,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,4345591,2020-09-06T00:00:00Z,18:05,,,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,0.0,1.0,...,Unsafe Lane Changing,Following Too Closely,,,,STATION WAGON/SPORT UTILITY VEHICLE,MOTORCYCLE,,,
1,4459141,2020-09-20T00:00:00Z,9:14,,,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,0.0,0.0,...,Driver Inattention/Distraction,Unspecified,,,,SEDAN,PICK-UP TRUCK,,,
2,4461437,2020-09-24T00:00:00Z,22:00,MANHATTAN,10039.0,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,0.0,0.0,...,Pedestrian/Bicyclist/Other Pedestrian Error/Co...,Driver Inattention/Distraction,,,,SEDAN,MOTORSCOOTER,,,
3,4336560,2020-08-06T00:00:00Z,5:30,,,,120 HUGE GRANT CIRCLE,,1.0,0.0,...,Unspecified,Unspecified,,,,SEDAN,STATION WAGON/SPORT UTILITY VEHICLE,,,
4,4334713,2020-08-01T00:00:00Z,17:16,BRONX,10462.0,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,0.0,1.0,...,Failure to Yield Right-of-Way,Unspecified,,,,SEDAN,E-BIKE,,,


### Data type conversion

se ha realizado un cluster (con el metodo *key collision* y funcion de keying *fingerprint*) de STREET NAME y se han unido las celdas con nombres semejantes.

Luego se han eliminado los espacios sobrantes de la columna STREET NAME.

### Missing values

It has already been mentioned the existence of some missing values. In the previous section, the verification of missing values was done with the ``.isnull()`` method of ``Pandas``. However, this method does not take into account the ``NaN`` values. In order to check the existence of ``NaN`` values, the ``.isna()`` method was used.

In [174]:
comp = (precollision.isnull().sum() == precollision.isna().sum())
comp[comp == False]

Series([], dtype: bool)

In [175]:
precollision.isnull().sum()

COLLISION_ID                          0
CRASH DATE                            0
CRASH TIME                            0
BOROUGH                           40671
ZIP CODE                          40686
LOCATION                           7667
STREET NAME                         226
CROSS STREET NAME                 60179
NUMBER OF PERSONS INJURED             2
NUMBER OF PERSONS KILLED              4
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1       351
CONTRIBUTING FACTOR VEHICLE 2     19593
CONTRIBUTING FACTOR VEHICLE 3    106611
CONTRIBUTING FACTOR VEHICLE 4    113552
CONTRIBUTING FACTOR VEHICLE 5    115126
VEHICLE TYPE CODE 1                 730
VEHICLE TYPE CODE 2               27447
VEHICLE TYPE CODE 3              107096
VEHICLE TYPE CODE 4              113658


#### Derivation of geographic attributes

In [176]:
# derivar primero location de street
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='TestCode')

location = precollision['STREET NAME'].apply(lambda x: geolocator.geocode(x))

GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=500+WEST+157+STREET&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

In [None]:
# precollision['ZIP CODE'] = precollision['ZIP CODE'].fillna(-1)
precollision['LOCATION'] = precollision['LOCATION'].fillna('(-1,-1)')

In [None]:
if os.path.exists(f'{dir}/collisions_2018-2020_prepro_v2.csv'):
    precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v2.csv')
    zipcodes = precollision['DERIVATED ZIP']
else:
    search = SearchEngine()

    def get_zipcode(df, location):
        if df[location] == '(-1,-1)':
            return np.nan

        lat, lon = df[location].replace('(', '').replace(')', '').split(',')
        result = search.by_coordinates(float(lat), float(lon))
        if len(result) > 0:
            return result[0]
        return np.nan

    info = precollision.apply(get_zipcode, axis=1, location='LOCATION')
    precollision['DERIVATED ZIP'] = info
    precollision.to_csv(f'{dir}/collisions_2018-2020_prepro_v2.csv', index=False)

In [None]:
print(precollision['ZIP CODE'].isnull().sum(), precollision['ZIP CODE'].isna().sum())
print(precollision['BOROUGH'].isnull().sum(), precollision['BOROUGH'].isna().sum())

40686 40686
40671 40671


In [None]:
zipcodes = precollision['DERIVATED ZIP'].str.extract(r"zipcode='(\d+)'", expand=False)
boroughs = precollision['DERIVATED ZIP'].str.extract(r"major_city='(\w+)'", expand=False)

precollision['ZIP CODE'].fillna(zipcodes, inplace=True)
precollision['BOROUGH'].fillna(boroughs, inplace=True)

  precollision['ZIP CODE'].fillna(zipcodes, inplace=True)


In [None]:
print(precollision['ZIP CODE'].isnull().sum(), precollision['ZIP CODE'].isna().sum())
print(precollision['BOROUGH'].isnull().sum(), precollision['BOROUGH'].isna().sum())

5440 5440
19984 19984


#### Derivation of vehicle attributes

In [None]:
# poner unespecified en factor $i$ si el vehiculo $i$ es no null y factor $i$ es null
# o
# poner unknown en vehiculo $i$ si el vehiculo $i$ es null y factor $i$ no es null

# son disjuntas estas operaciones?????????????

In some rows of the dataset, the **CONTRIBUTING FACTOR VEHICLE** is missing but the **VEHICLE TYPE CODE** is not. This suggests that the vehicle type is known, but the factor that contributed to the collision is not. In order to fill this missing values, the factor was set as *unespecified*. This was done for all the rows and columns where the **CONTRIBUTING FACTOR VEHICLE** was missing with the above condition.

In [None]:
cp.imputation_with_ref_col(precollision, 'CONTRIBUTING FACTOR VEHICLE', 'VEHICLE TYPE CODE', 'Unspecified', 1, 5)

Likewise, in some rows of the dataset, the **VEHICLE TYPE CODE** is missing but the **CONTRIBUTING FACTOR VEHICLE** is not. This suggests that the factor that contributed to the collision is known, but the vehicle type is not. In order to fill this missing values, the vehicle type was set as *unknown*. This was done for all the rows and columns where the **VEHICLE TYPE CODE** was missing with the above condition.

In [None]:
cp.imputation_with_ref_col(precollision, 'VEHICLE TYPE CODE', 'CONTRIBUTING FACTOR VEHICLE', 'UNKNOWN', 1, 5)

In [None]:
precollision.isnull().sum()

COLLISION_ID                          0
CRASH DATE                            0
CRASH TIME                            0
BOROUGH                           19984
ZIP CODE                           5440
LOCATION                              0
STREET NAME                         226
CROSS STREET NAME                 60179
NUMBER OF PERSONS INJURED             2
NUMBER OF PERSONS KILLED              4
NUMBER OF PEDESTRIANS INJURED         0
NUMBER OF PEDESTRIANS KILLED          0
NUMBER OF CYCLIST INJURED             0
NUMBER OF CYCLIST KILLED              0
NUMBER OF MOTORIST INJURED            0
NUMBER OF MOTORIST KILLED             0
CONTRIBUTING FACTOR VEHICLE 1       288
CONTRIBUTING FACTOR VEHICLE 2     18852
CONTRIBUTING FACTOR VEHICLE 3    106550
CONTRIBUTING FACTOR VEHICLE 4    113540
CONTRIBUTING FACTOR VEHICLE 5    115123
VEHICLE TYPE CODE 1                 288
VEHICLE TYPE CODE 2               18852
VEHICLE TYPE CODE 3              106550
VEHICLE TYPE CODE 4              113540


<!-- comentar la reduccion de missing values -->

Notice that the only missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute are in the same rows. This means that the number of missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute is the same.

In [None]:
# como se hace para imputar los missing de los dos campos a la vez?

#### Derivation of number of person attributes

In [None]:
# como se hace para imputar los missing del numero de personas? o mejor se elimina?

In [None]:
precollision[precollision['NUMBER OF PERSONS INJURED'].isnull() | precollision['NUMBER OF PERSONS KILLED'].isnull()]

Unnamed: 0,COLLISION_ID,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LOCATION,STREET NAME,CROSS STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5,DERIVATED ZIP
42087,4026403,2018-09-30T00:00:00Z,6:30,QUEENS,11368.0,"(40.749767, -73.86381)",102-21 ROOSEVELT AVENUE,,,,...,,,,,TAXI,,,,,"SimpleZipcode(zipcode='11372', zipcode_type='S..."
62406,4026219,2018-08-08T00:00:00Z,13:45,STATEN ISLAND,10312.0,"(40.562523, -74.19211)",1531 ARTHUR KILL ROAD,,1.0,,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='10312', zipcode_type='S..."
69596,4026267,2018-08-23T00:00:00Z,8:45,MANHATTAN,10001.0,"(40.74528, -73.99254)",135 WEST 26 STREET,,0.0,,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='10098', zipcode_type='S..."
90894,4026185,2018-07-23T00:00:00Z,12:50,BROOKLYN,11221.0,"(40.68603, -73.93268)",STUYVESANT AVENUE,PUTNAM AVENUE,,0.0,...,,,,,SEDAN,,,,,"SimpleZipcode(zipcode='11221', zipcode_type='S..."
114639,4026046,2018-06-05T00:00:00Z,5:19,BRONX,10459.0,"(40.826992, -73.88984)",1145 WEST FARMS ROAD,,1.0,,...,,,,,STATION WAGON/SPORT UTILITY VEHICLE,,,,,"SimpleZipcode(zipcode='10459', zipcode_type='S..."
