# Visualization Project

---
**Authors**:
-  *Juan P. Zaldivar E.*
-  *Enrique Millán X.*
---

## Introduction

This file contains the preprocessing required for the datasets. The datasets are:
- Colissions dataset.
- Weather dataset.
- New York Map.


<!-- explicar que primer se hace una exploracion visual y luego el preprocesing? -->

### Dataset obtention

The [*collision*](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset was already given by the instructors of the project.

The *weather* dataset was obtained following the next steps:

- Visit the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search) web page.

- Select the following options:
  - `Weather Observation Type/Dataset -> Daily Summaries, Date Range -> 2018-01-01 to 2020-12-31, Search For -> Cities, Search Term -> New York City.`

- Look for "*New York, NY US*" and click in ADD TO CART. Now, click the cart in the top right corner.

- Select "*Custom GHCN-Daily CSV*", and the date previously selected (2018-01-01 to 2020-12-31). We are selecting more information than needed (to avoid disjoint downloads), but we will later filter it with ``Pandas`` and ``Open Refine``. Click continue.

- Fill the three options, and select "*metric units*".

- Fill all the options remaining and click continue. There are some options that will be probably not needed, but we will further analyze this when cleaning the datasets.

- Type the email where you want to receive the data so the order can start.

The *map* dataset was obtained following the next steps:

The datasets are located in the folder `Data/` and the results are saved in the folder `Data/Preprocessed/`. Following are the loading of each dataset and the import of the required libraries.

### Libraries

In [1]:
import os
import re
import pickle
import geopy as gp
import numpy as np
import pandas as pd
from Modules import preprocessing as pre

In [2]:
# !pip install numpy
# comentar las versiones de los paquetes

## Dataset preprocessing

The preprocessing of the files was done conjointly with OpenRefine and the proposed python libraries in order to be able to take advantages of both tools.

In [3]:
dir = './Data'
temp_pre = './Data/tmp_pre'
colission_exists = False

## Collision dataset preprocessing

### Data exploration?

The collision table incorporates information pertaining to individual crash events, with each row representing a distinct collision incident. The dataframe compile details from all police reported motor vehicle collisions in NYC. 

<!-- ...se hace una breve descripcion de la exploracion? -->

The original dataset was first loaded into a ``Pandas`` dataframe in order to filter the desired range of dates. The reason lays in a more efficient way to filter the data, taking the size of the original dataset into account. This volumn of data made the computational process in OpenRefine very slow and ineffcient. After this initial filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine.

In [4]:
if os.path.exists(f'{dir}/collisions_2018-2020.csv'):
    collision = pd.read_csv(f'{dir}/collisions_2018-2020.csv')
    colission_exists = True
else:
    collision = pd.read_csv(f'{dir}/collisions.csv')

collision.shape

(115740, 29)

In [5]:
collision.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,
1,2020-09-20,9:14,,,40.722095,-73.77772,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,,,...,Unspecified,,,,4459141,Sedan,Pick-up Truck,,,
2,2020-09-24,22:00,MANHATTAN,10039.0,40.824757,-73.94052,"(40.824757, -73.94052)",8 AVENUE,WEST 148 STREET,,...,Driver Inattention/Distraction,,,,4461437,Sedan,Motorscooter,,,
3,2020-08-06,5:30,,,,,,,,120 Huge Grant Circle,...,Unspecified,,,,4336560,Sedan,Station Wagon/Sport Utility Vehicle,,,
4,2020-08-01,17:16,BRONX,10462.0,40.840508,-73.85515,"(40.840508, -73.85515)",METROPOLITAN AVENUE,LINDEN DRIVE,,...,Unspecified,,,,4334713,Sedan,E-Bike,,,


If the filtered version does not exist, we proceed with the filtering. To filter the data to the summer of 2018 and 2020, the first step is to change the data type of the **CRASH DATE** attribute.

In [6]:
if not colission_exists:
    collision['CRASH DATE'] = pd.to_datetime(collision['CRASH DATE'])

    collision = collision[((collision['CRASH DATE'] >= '2018-06-01') & (collision['CRASH DATE'] <= '2018-09-30')) | ((collision['CRASH DATE'] >= '2020-06-01') & (collision['CRASH DATE'] <= '2020-09-30'))]
    collision.to_csv(f'{dir}/collisions_2018-2020.csv', index=False)

collision.shape

(115740, 29)

After the filtering, the dataset was exported to a ``.csv`` file and loaded into OpenRefine. The procedure and reasoning taken in OpenRefine will be explained and justified in the present section.

### Data type conversion

The data conversion was done in OpenRefine. The **CRASH DATE** attribute was converted to a date type. The **COLLISION ID** and **CRASH TIME** were set as strings for the time being.

 The attributes relating the greographic location of the collision were set as strings with some special remarks. All the values were set to uppercase and the extra spaces were removed (if any). The reason for this was to ease the work of the clusterization method used to collectively inspect and edit cells in the case that there were some values that were not correctly or consistently written.

For the attributes refering to the number of persons involved in the collision, the data type was set to integer. The reason for this is that the values are discrete and the values are not expected to be negative.

The vehicle and factors attributes were left as strings for the time being. 

### Data selection and transformation

All of the following transformations were applied with OpenRefine, but some comprobations to justify the transformations are carried on in this section of the notebook.

In [7]:
precollision = pd.read_csv(f'{dir}/collisions_2018-2020_prepro_v1.csv')

#### Time attributes

In [8]:
precollision["CRASH DATE"] = pd.to_datetime(precollision["CRASH DATE"])
precollision["DAY NAME"] = precollision["CRASH DATE"].dt.day_name()
# precollision["TYPE OF DAY"] = np.where(precollision["DAY NAME"].isin(["Saturday", "Sunday"]), "Weekend", "Weekday")

# move DAY NAME to the second column
cols = precollision.columns.tolist()
cols = cols[:1] + cols[-1:] + cols[1:-1]
precollision = precollision[cols]

In [9]:
precollision['CRASH TIME'] = pd.to_datetime(precollision['CRASH TIME'], format='%H:%M').dt.time
precollision['CRASH TIME INTERVAL'] = precollision['CRASH TIME'].apply(lambda x: f"{x.hour:02}")
precollision.drop(columns=['CRASH TIME'], inplace=True)

def categorize_moment(hour):
    if 5 <= int(hour) < 12:
        return "Morning"
    elif 12 <= int(hour) < 18:
        return "Afternoon"
    else:
        return "Night"
    
precollision['CRASH MOMENT'] = precollision['CRASH TIME INTERVAL'].apply(categorize_moment)

# move TIME INTERVAL to the fourth column
cols = precollision.columns.tolist()
cols = cols[:3] + cols[-1:] + cols[3:-1]
precollision = precollision[cols]

# move CRASH MOMENT to the fifth column
cols = precollision.columns.tolist()
cols = cols[:4] + cols[-1:] + cols[4:-1]
precollision = precollision[cols]

#### Geographical attributes

At first glance, **ON STREET NAME** and **OFF STREET NAME** seem to be the same attribute, but with different names. After looking at the description of the attributes in the web site of the dataset, the following descriptions:

- **ON STREET NAME**: *Street on which the collision occurred*.
- **OFF STREET NAME**: *Street address if known*.

Which gives the idea that both attributes contain approximatelly the same information. Furthermore, there are no rows with both attributes filled, which makes the idea of merging both attributes plausible.

In [10]:
collision[(collision['ON STREET NAME'].notnull()) & (collision['OFF STREET NAME'].notnull())].shape

(0, 29)

In [11]:
collision[(collision['ON STREET NAME'].notnull()) | (collision['OFF STREET NAME'].notnull())].shape

(115514, 29)

The resulting attribute after merging both columns is called **STREET NAME** and contains the street name/address where the collision occurred, with no missing values. Some rows will have a more detail description of the street, while others will only have the name of the street. 

<!-- This is not a problem, since the attribute will be used to join the dataset with the weather dataset, and the weather dataset only contains the name of the street. -->

**CROSS STREET**, which iss the third attribute related to the street enviroment can be dropped since is not useful for the intended analysis.

Similarly, **LOCATION** seems to contain the tuple (**LATITUDE**, **LONGITUDE**), so we could, a priori, remove the two extra attributes.

In [12]:
collision[(collision['LOCATION'].notnull()) & (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].notnull())].shape

(108073, 29)

The number of rows where the three attributes are not missing does not cover the total number of rows, but there are no rows where the **LOCATION** attribute is missing and at least one of the other two attributes is not missing.

In [13]:
collision[(collision['LOCATION'].isnull()) & (collision['LATITUDE'].isnull()) & (collision['LONGITUDE'].notnull()) | (collision['LATITUDE'].notnull()) & (collision['LONGITUDE'].isnull())].shape

(0, 29)

In [14]:
collision[(collision['LOCATION'].isnull())].shape

(7667, 29)

Which makes the rest of the rows (7667) with missing values in the three attributes. This means that the **LATITUDE** and **LONGITUDE** attributes can be removed, since the **LOCATION** attribute contains the same information. With this transformation, the number of attributes is reduced by two.

The clusterization was done with the *key collision* method and the *fingerprint* keying function. No significant variations were found after a copuble iterations in the values of the attributes but misspellings were found. The misspellings were corrected and the clusterization was done again. The results were the same, which means that the values were already consistent. To verify the result, a NN was done as well but did not find any significant variation.

--> mas

#### Vehicle attributes

Regarding vehicle information, the statement of the project specifies that only the **VEHICLE CODE TYPE 1** is of interest to the visualization, so all the other vehicle codes can be removed (2-5). With this, the cotributing factors of the other vehicles can also be removed.

We have seen already that there are many classes of the **VEHICLE CODE TYPE 1** values, so we reduced the number of classes by clusterization.

In [15]:
len(collision['VEHICLE TYPE CODE 1'].unique())

361

With the clusterization (key collision and fingerprint keying function) of the **VEHICLE TYPE CODE 1** attribute we found a lot of misspellings and inconsistencies. The clusterization was done iteratively, correcting the misspellings and inconsistencies found in each iteration. After a couple of iterations, the number of classes was reduced from 361 to ...

In [16]:
precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
['SUV', 'FLAT', '3-DOOR', 'CHEVY EXPR', 'PC', 'ELEC. UNIC', 'E REVEL SC','F150XL PIC', '2- TO', 'NEW Y', 'STREE', 'RGS', 'OMR', 'DEMA-', 'BK', 'NYPD'], 'CAR')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['99999'], 'UNKNOWN')

precollision['VEHICLE TYPE CODE 1'].fillna('UNKNOWN', inplace=True)

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['BULK AGRICULTURE', 'PK', 'TANK', 'SLINGSHOT', 'UTV', 'JOHN DEERE', '1C', 'STAK', 'PALLET', 'SPRIN', 'ACCES'], 'OTHERS')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['BOX', 'DOT EQUIPM', 'DRILL RIG', 'PAS', 'LOADE', 'SGWS', 'HEAVY'], 'TRUCK')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['MOTORIZED HOME', 'CHASSIS CAB', 'SWT', 'MESSAGE SI', 'RV', 'UHAUL', 'POSTO'], 'VAN')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['MOPED', 'J1'], 'MOTORCYCLE')

precollision['VEHICLE TYPE CODE 1'] = precollision['VEHICLE TYPE CODE 1'].replace(
    ['SANIT'], 'AMBULANCE')

The resulting classes of the **VEHICLE CODE TYPE 1** attribute are:

In [17]:
len(precollision['VEHICLE TYPE CODE 1'].unique()), precollision['VEHICLE TYPE CODE 1'].unique()

(13,
 array(['VAN', 'CAR', 'BUS', 'TRACTOR', 'TAXI', 'SCOOTER', 'TRUCK',
        'MOTORCYCLE', 'BICYCLE', 'OTHERS', 'AMBULANCE', 'FDNY', 'UNKNOWN'],
       dtype=object))

In [18]:
precollision['VEHICLE TYPE CODE 1'].value_counts()

VEHICLE TYPE CODE 1
CAR           54690
VAN           41651
TRUCK          6684
TAXI           4806
BICYCLE        2083
BUS            1430
MOTORCYCLE     1275
TRACTOR        1098
UNKNOWN         739
AMBULANCE       470
OTHERS          414
SCOOTER         324
FDNY             76
Name: count, dtype: int64

Similar strategy was done with the **CONTRIBUTING FACTOR VEHICLE 1** attribute. ...

--> mas

In [19]:
len(precollision['CONTRIBUTING FACTOR VEHICLE 1'].unique()), precollision['CONTRIBUTING FACTOR VEHICLE 1'].unique()

(56,
 array(['Unsafe Lane Changing', 'Driver Inattention/Distraction',
        'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
        'Unspecified', 'Failure to Yield Right-of-Way',
        'Turning Improperly', 'Traffic Control Disregarded',
        'Other Vehicular', 'Unsafe Speed',
        'Passing or Lane Usage Improper', 'Driver Inexperience',
        'Reaction to Uninvolved Vehicle', 'Following Too Closely',
        'Aggressive Driving/Road Rage', 'Passing Too Closely',
        'Oversized Vehicle', 'Alcohol Involvement', 'Backing Unsafely',
        'Steering Failure', 'View Obstructed/Limited', 'Fell Asleep',
        'Brakes Defective', 'Pavement Defective', 'Pavement Slippery',
        'Failure to Keep Right', 'Cell Phone (hand-Held)', 'Glare',
        'Passenger Distraction', nan, 'Driverless/Runaway Vehicle',
        'Obstruction/Debris', 'Animals Action',
        'Traffic Control Device Improper/Non-Working',
        'Tire Failure/Inadequate', 'Outside Car Distracti

#### Number of persons attributes

For the visualization purposes, the differentantion of **PERSONS**, **PEDESTRIANS**, **CYCLISTS** and **MOTORISTS** (**INJURED/KILLED**) is irrelevant. A more useful attribute would be the total number of persons involved in the collision. This can be obtained by summing the four attributes under the condition that the **PERSONS** attribute is not the sum of the other three attributes. 

This condition was needed to be checked because the documentation of the dataset was not precise enough for us to determinate is **NUMBER OF PERSON INJURED/KILLED** was an aggregate from the other three columns or not.

*Note: The metadata information available in the web of the dataset was: "Number of persons injured/killed" regarding the **NUMBER OF PERSONS INJURED/KILLED**.*

In [20]:
collision['NUMBER OF PERSONS INJURED'].equals(collision['NUMBER OF PEDESTRIANS INJURED'] + collision['NUMBER OF CYCLIST INJURED'] + collision['NUMBER OF MOTORIST INJURED'])

False

In [21]:
collision['NUMBER OF PERSONS INJURED'].equals(collision['NUMBER OF PEDESTRIANS INJURED'])


False

As seen by the logical comprobations, the **NUMBER OF PERSONS INJURED/KILLED** is not the sum of the other three attributes. Furthermore, the terms persons and pedestrians are not equal, as one could have thought that the term persons was used to refer to pedestrians.

Based on this, the discrete attributes refering to the injured people were summed to obtain **NUMBER OF INJURED** and the discrete attributes refering to the killed people were summed to obtain **NUMBER OF KILLED**. The **NUMBER OF _ INJURED/KILLED** attributes were removed.

#### OpenRefine results

In [22]:
precollision.head()

Unnamed: 0,COLLISION_ID,DAY NAME,CRASH DATE,CRASH MOMENT,CRASH TIME INTERVAL,BOROUGH,ZIP CODE,LOCATION,STREET NAME,NUMBER OF INJURED,NUMBER OF KILLED,CONTRIBUTING FACTOR VEHICLE 1,VEHICLE TYPE CODE 1
0,4345591,Sunday,2020-09-06 00:00:00+00:00,Night,18,,,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,0.0,2.0,Unsafe Lane Changing,VAN
1,4459141,Sunday,2020-09-20 00:00:00+00:00,Morning,9,,,"(40.722095, -73.77772)",GRAND CENTRAL PKWY,0.0,0.0,Driver Inattention/Distraction,CAR
2,4461437,Thursday,2020-09-24 00:00:00+00:00,Night,22,MANHATTAN,10039.0,"(40.824757, -73.94052)",8 AVENUE,0.0,0.0,Pedestrian/Bicyclist/Other Pedestrian Error/Co...,CAR
3,4336560,Thursday,2020-08-06 00:00:00+00:00,Morning,5,,,,120 HUGE GRANT CIRCLE,2.0,0.0,Unspecified,CAR
4,4334713,Saturday,2020-08-01 00:00:00+00:00,Afternoon,17,BRONX,10462.0,"(40.840508, -73.85515)",METROPOLITAN AVENUE,0.0,2.0,Failure to Yield Right-of-Way,CAR


--> resumen

### Missing values

It has already been mentioned the existence of some missing values. In the previous section, the verification of missing values was done with the ``.isnull()`` method of ``Pandas``. However, this method does not take into account the ``NaN`` values. In order to check the existence of ``NaN`` values, the ``.isna()`` method was used.

In [23]:
comp = (precollision.isnull().sum() == precollision.isna().sum())
comp[comp == False]

Series([], dtype: bool)

As seen previously, all the missing values of the dataset are detected both with ``.isnull()`` and ``.isna()``. After this check, we could group the attributes with missign values in three separeted clusters:
- Geographical attributes
- Injured/Killed attributes
- Vehicle attributes

In [24]:
precollision.isnull().sum()

COLLISION_ID                         0
DAY NAME                             0
CRASH DATE                           0
CRASH MOMENT                         0
CRASH TIME INTERVAL                  0
BOROUGH                          40671
ZIP CODE                         40686
LOCATION                          7667
STREET NAME                        226
NUMBER OF INJURED                    2
NUMBER OF KILLED                     4
CONTRIBUTING FACTOR VEHICLE 1      351
VEHICLE TYPE CODE 1                  0
dtype: int64

#### Imputation of geographic attributes

The first cluster is formed with reference to the geographicals attributes. The attributes in this cluster are:
- **BOROUGH**
- **ZIP CODE**
- **LOCATION**
- **STREET NAME**
- **CROSS STREET NAME**

Notice that the attributes with the less missing values is **STREET NAME** with only a $0.20\%$ of the entire dataset, partially thanks to the merge of **ON STREET** and **OFF STREET** attributes in the previous sections. We used this information to impute the missing coordinates of the **LOCATION** attribute and the rest of the attributes sequentially.


In [25]:
precollision['LOCATION'].isnull().sum(), precollision['STREET NAME'].isnull().sum()

(7667, 226)

In [26]:
precollision[(precollision['LOCATION'].isnull()) & (precollision['STREET NAME'].isnull())].shape[0]

67

In [27]:
street_unique = precollision[precollision['LOCATION'].isnull()]['STREET NAME'].unique()
len(street_unique)

1897

In [28]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="GEoNYC")

if os.path.exists(f'{dir}/locations.pkl'):
    with open(f'{dir}/locations.pkl', 'rb') as f:
        locations = pickle.load(f)
else:
    locations = []
    for street in street_unique:
        if street:
            try:
                location = geolocator.geocode(street.capitalize() + ', New York City, USA')
                locations.append(location)
            except:
                locations.append(None)
    with open(f'{dir}/locations.pkl', 'wb') as f:
        pickle.dump(locations, f)

In [None]:
count = 0
for location in locations:
    if location is None:
        count += 1

(count, len(locations))

(603, 1897)

In [None]:
print(locations[990])

West Drive, Brooklyn, Kings County, City of New York, New York, 11215, United States


In [None]:
borough_values = [str(b).capitalize() for b in precollision['BOROUGH'].unique()]
borough_values = borough_values[1:]
borough_values

['Manhattan', 'Bronx', 'Queens', 'Brooklyn', 'Staten island']

In [None]:
zipcodes = []; boroughs = []; coord = []

for location in locations:
    if location:
        zipcode = re.findall(r'\d{5}', location.address)
        if zipcode:
            zipcodes.append(zipcode[0])
        else:
            zipcodes.append(np.nan)
        
        coord.append('(' + str(location.latitude) + ', ' + str(location.longitude) + ')')

        borough = None
        for b in borough_values:
            if b in location.address:
                borough = b
                boroughs.append(borough.upper())
                break
        if borough is None:
            boroughs.append(np.nan)
    else:
        zipcodes.append(np.nan)
        coord.append(np.nan)
        boroughs.append(np.nan)

In [None]:
len(zipcodes), len(coord), len(boroughs)

(1897, 1897, 1897)

In [None]:
# find if this coordinates 43.06842525, -74.32578185 are in coord

for c in coord:
    if c is not np.nan:
        if c == '(43.06842525, -74.32578185)':
            print(c)

(43.06842525, -74.32578185)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar_word(query_word, word_list):
    # Create a list of vectors, each representing a word
    word_vectors = []
    for word in word_list:
        word_vector = [1 if c in word else 0 for c in query_word]
        word_vectors.append(word_vector)

    # Convert the list of vectors to a numpy array
    word_vectors = np.array(word_vectors)

    try:
        idx = word_list.index(query_word)
    except:
        return None
    # Compute cosine similarity between the query word and each word in the list
    similarities = cosine_similarity([word_vectors[idx]], word_vectors)[0]

    # Find the index of the most similar word
    most_similar_index = np.argmax(similarities)

    return word_list[most_similar_index]

In [None]:
# # Iterate over the rows of the DataFrame
# for index, row in precollision.iterrows():
#     street_name = row['STREET NAME']
#     location = row['LOCATION']
#     zip_code = row['ZIP CODE']

#     street = most_similar_word(str(street_name), street_unique)
#     if street:
#         street_index = street_unique.index(street)

#         if pd.isnull(location):
#             precollision.at[index, 'LOCATION'] = coord[street_index]
#         if pd.isnull(zip_code):
#             precollision.at[index, 'ZIP CODE'] = zipcodes[street_index]
#         if pd.isnull(row['BOROUGH']):
#             precollision.at[index, 'BOROUGH'] = boroughs[street_index]

In [None]:
# Iterate over the rows of the DataFrame
for index, row in precollision.iterrows():
    street_name = row['STREET NAME']
    location = row['LOCATION']
    zip_code = row['ZIP CODE']

    if street_name and street_name in street_unique:
        street_index = np.where(street_unique == street_name)[0][0]
    
        # Check if 'LOCATION' is NaN
        if pd.isnull(location):
            precollision.at[index, 'LOCATION'] = coord[street_index]

        if pd.isnull(zip_code):
            precollision.at[index, 'ZIP CODE'] = float(zipcodes[street_index])

        if pd.isnull(row['BOROUGH']):
            precollision.at[index, 'BOROUGH'] = boroughs[street_index]

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "C:\Users\jpniz\AppData\Roaming\Python\Python311\site-packages\IPython\core\interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\jpniz\AppData\Local\Temp\ipykernel_12092\3653891591.py", line -1, in <module>
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jpniz\AppData\Roaming\Python\Python311\site-packages\IPython\core\interactiveshell.py", line 2052, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jpniz\AppData\Roaming\Python\Python311\site-packages\IPython\core\ultratb.py", line 1112, in structured_traceback
    return FormattedTB.structured_traceback(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jpniz\AppData\Roaming\Python\Python311\site-packages\IPython\core\ultratb.py", line 1006, in

In [None]:
precollision['LOCATION'].isnull().sum(), precollision['ZIP CODE'].isnull().sum(), precollision['BOROUGH'].isnull().sum()

(2248, 16137, 17658)

#### Imputation of vehicle attributes

In [None]:
# poner unespecified en factor $i$ si el vehiculo $i$ es no null y factor $i$ es null
# o
# poner unknown en vehiculo $i$ si el vehiculo $i$ es null y factor $i$ no es null

# son disjuntas estas operaciones?????????????

In some rows of the dataset, the **CONTRIBUTING FACTOR VEHICLE** is missing but the **VEHICLE TYPE CODE** is not. This suggests that the vehicle type is known, but the factor that contributed to the collision is not. In order to fill this missing values, the factor was set as *unespecified*. This was done for all the rows and columns where the **CONTRIBUTING FACTOR VEHICLE** was missing with the above condition.

In [None]:
pre.imputation_with_ref_col(precollision, 'CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1', 'Unspecified')

Likewise, in some rows of the dataset, the **VEHICLE TYPE CODE** is missing but the **CONTRIBUTING FACTOR VEHICLE** is not. This suggests that the factor that contributed to the collision is known, but the vehicle type is not. In order to fill this missing values, the vehicle type was set as *unknown*. This was done for all the rows and columns where the **VEHICLE TYPE CODE** was missing with the above condition.

In [None]:
pre.imputation_with_ref_col(precollision, 'VEHICLE TYPE CODE 1', 'CONTRIBUTING FACTOR VEHICLE 1', 'UNKNOWN')

In [None]:
precollision.isnull().sum()

COLLISION_ID                         0
DAY NAME                             0
CRASH DATE                           0
CRASH MOMENT                         0
CRASH TIME INTERVAL                  0
BOROUGH                          17658
ZIP CODE                         16137
LOCATION                          2248
STREET NAME                        226
NUMBER OF INJURED                    2
NUMBER OF KILLED                     4
CONTRIBUTING FACTOR VEHICLE 1        0
VEHICLE TYPE CODE 1                  0
dtype: int64

<!-- comentar la reduccion de missing values -->

Notice that the only missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute are in the same rows. This means that the number of missing values in the **CONTRIBUTING FACTOR VEHICLE** attribute and **VEHICLE TYPE CODE** attribute is the same.

In [None]:
# como se hace para imputar los missing de los dos campos a la vez?

# code 1 como unespecified/unknown
# code resto no existent

#### Imputation of number of person attributes

In [None]:
# como se hace para imputar los missing del numero de personas? o mejor se elimina?

In [None]:
precollision.columns

Index(['COLLISION_ID', 'DAY NAME', 'CRASH DATE', 'CRASH MOMENT',
       'CRASH TIME INTERVAL', 'BOROUGH', 'ZIP CODE', 'LOCATION', 'STREET NAME',
       'NUMBER OF INJURED', 'NUMBER OF KILLED',
       'CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1'],
      dtype='object')

In [None]:
precollision[precollision['NUMBER OF INJURED'].isnull() | precollision['NUMBER OF KILLED'].isnull()]

Unnamed: 0,COLLISION_ID,DAY NAME,CRASH DATE,CRASH MOMENT,CRASH TIME INTERVAL,BOROUGH,ZIP CODE,LOCATION,STREET NAME,NUMBER OF INJURED,NUMBER OF KILLED,CONTRIBUTING FACTOR VEHICLE 1,VEHICLE TYPE CODE 1
42087,4026403,Sunday,2018-09-30 00:00:00+00:00,Morning,6,QUEENS,11368.0,"(40.749767, -73.86381)",102-21 ROOSEVELT AVENUE,,,Unspecified,TAXI
62406,4026219,Wednesday,2018-08-08 00:00:00+00:00,Afternoon,13,STATEN ISLAND,10312.0,"(40.562523, -74.19211)",1531 ARTHUR KILL ROAD,2.0,,Lost Consciousness,CAR
69596,4026267,Thursday,2018-08-23 00:00:00+00:00,Morning,8,MANHATTAN,10001.0,"(40.74528, -73.99254)",135 WEST 26 STREET,0.0,,Driver Inattention/Distraction,CAR
90894,4026185,Monday,2018-07-23 00:00:00+00:00,Afternoon,12,BROOKLYN,11221.0,"(40.68603, -73.93268)",STUYVESANT AVENUE,,0.0,Unspecified,CAR
114639,4026046,Tuesday,2018-06-05 00:00:00+00:00,Morning,5,BRONX,10459.0,"(40.826992, -73.88984)",1145 WEST FARMS ROAD,2.0,,Unspecified,VAN


Since the resulting rows with missing values of the number of persons involved in the collision are a small fraction of the total dataset, we decided to set them to 0, assuming that there were no persons involved in the collision.

Given the small percentage of the rows $0.01$, we think that there would not be a significant impact in the final visualization whether we set the missing values to 0 or we drop the rows with missing values. However, we decided to set the missing values to 0 in order to keep the rows and not lose information.

In [None]:
precollision['NUMBER OF INJURED'].fillna(0, inplace=True)
precollision['NUMBER OF KILLED'].fillna(0, inplace=True)

In [None]:
precollision.isnull().sum()

COLLISION_ID                         0
DAY NAME                             0
CRASH DATE                           0
CRASH MOMENT                         0
CRASH TIME INTERVAL                  0
BOROUGH                          17658
ZIP CODE                         16137
LOCATION                          2248
STREET NAME                        226
NUMBER OF INJURED                    0
NUMBER OF KILLED                     0
CONTRIBUTING FACTOR VEHICLE 1        0
VEHICLE TYPE CODE 1                  0
dtype: int64

### Save the results

In [None]:
precollision.to_csv(f'{dir}/collisions_clean.csv', index=False)

In [None]:
precollision.columns

Index(['COLLISION_ID', 'DAY NAME', 'CRASH DATE', 'CRASH MOMENT',
       'CRASH TIME INTERVAL', 'BOROUGH', 'ZIP CODE', 'LOCATION', 'STREET NAME',
       'NUMBER OF INJURED', 'NUMBER OF KILLED',
       'CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1'],
      dtype='object')