# Pre-Processing and Feature Engineering<a id='Pre-Processing_and_Feature_Engineering'></a>

### 1 Table of Contents<a id='Contents'></a>
* [Pre-Processing and Feature Engineering](#Pre-Processing_and_Feature_Engineering)
  * [1 Contents](#Contents)
  * [2 Introduction](#2_Introduction)
  * [3 Imports](#3_Imports)
  * [4 Load Data](#4_Load_Data)
  * [5 Creating the Tensor](#5_Creating_the_Tensor)
  * [6 Split the Data](#6_Split_the_Data)
  * [7 Saving as an H5PY File](#7_Saving_as_an_H5PY_File)
  * [8 Baseline Performance](#8_Baseline_Performance)
  * [9 Conclusion](#9_Conclusion)

### 2 Introduction<a id='2_Introduction'>

In the last notebook, the spatial and temporal resolution that will be used for the tensors was decided. For spatial resolution, a 50x30 array will be used.  Traffic volume data was found from the Los Angeles Department of Transportation to be used as one of the tensor layers. The other layers will be made up of weather variables such as temperature, visibility, and cloud cover. The target is a 1 or 0 for each grid cell, representing whether an accident occured or not. 

In this notebook, we will build the complete tensor of weather, traffic, and collision data. The tensor will then be split into test and train sets and a baseline performance for the model will be determined.

### 3 Imports<a id='3_Imports'>

In [1]:
import warnings
warnings.simplefilter('ignore')
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
import requests
import h5py
import torch
import googlemaps
from library.sb_utils import save_file
import os
import math
import csv

### 4 Load Data<a id='4_Load_Data'>

In [2]:
LA_collisions = pd.read_csv('../Data/LA_collisions.csv', index_col = 'Unnamed: 0')
LA_weather = pd.read_csv('../Data/LA_weather_cleaned.csv', index_col = 'Unnamed: 0')
LA_traffic = pd.read_csv('../Data/LA_traffic.csv', index_col = 'Unnamed: 0')
openWeather_api_key = pd.read_json('../credentials.json', typ='series')['openWeather_api_key']
maps_api_key = pd.read_json('../credentials.json', typ='series')['maps_api_key']

### 5 Creating the Tensor <a id='5_Creating_the_Tensor'>

The unmerged dataframes will be used to fill the tensor, as the unmerged weather dataframe (which contains most of the input features) has complete hourly records of weather features dating back to 2006, the other input feature is the count data in the traffic dataframe, and the output feature is the presence of a collision in the collisions dataframe. First, to determine the columns in each of our dataframes

In [3]:
LA_collisions.columns

Index(['primary_road', 'secondary_road', 'intersection', 'side_of_highway',
       'severity', 'type', 'pedestrian', 'bicycle', 'motorcycle', 'truck',
       'same_day_crashes', 'same_road_crashes', 'latitude', 'longitude',
       'datetime'],
      dtype='object')

In [4]:
LA_weather.columns

Index(['dt', 'temp', 'visibility', 'dew_point', 'temp_min', 'temp_max',
       'pressure', 'humidity', 'wind_speed', 'wind_deg', 'wind_gust',
       'rain_1h', 'rain_3h', 'clouds_all', 'datetime'],
      dtype='object')

In [5]:
LA_traffic.columns

Index(['primary_road', 'secondary_road', 'lat', 'lon', 'Total'], dtype='object')

Before we start using the collisions coordinates to create our grid boundaries, we need to make sure there are no erroneous locations (which we've already seen in previous notebooks)

In [6]:
LA_collisions[LA_collisions['latitude'] > 35]

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime


In [7]:
LA_collisions[LA_collisions['latitude'] < 33].head(5)

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime
300034,HOOVER,JEFFERSON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2021-02-28 18:00:00
302523,RIVERSIDE,FULTON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 18:00:00
302541,VENTURA BL,LA MAIDA ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-14 15:00:00
302566,VENTURA BL,DONNA AV,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 13:00:00
302671,VAN NUYS BL,HUSTON ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 15:00:00


In [8]:
LA_collisions[LA_collisions['longitude'] < -119]

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime


In [9]:
LA_collisions[LA_collisions['longitude'] > -117].head(5)

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime
300034,HOOVER,JEFFERSON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2021-02-28 18:00:00
302523,RIVERSIDE,FULTON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 18:00:00
302541,VENTURA BL,LA MAIDA ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-14 15:00:00
302566,VENTURA BL,DONNA AV,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 13:00:00
302671,VAN NUYS BL,HUSTON ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 15:00:00


Looks like the erroneous locations are all at (0,0). We used the geocoding api quite a bit in the last notebook, so, in the interest of preserving cost, these rows will be dropped

In [10]:
LA_collisions = LA_collisions[LA_collisions['latitude'] > 0]

The factors that are important to the model are temp, visibility, dew point, pressure, humidity, wind speed, wind gust, rain 1h, rain 3h, and clouds all. Thats ten weather variables, the traffic flow data, and the target variable (collision or no collision) over time. To start, with a temporal resolution of 1 hour, how many frames will our data have?

In [11]:
LA_weather['datetime'] = pd.to_datetime(LA_weather['datetime'])
LA_collisions['datetime'] = pd.to_datetime(LA_collisions['datetime'])

In [12]:
LA_collisions.datetime.min()

Timestamp('2010-01-01 00:00:00')

In [13]:
LA_collisions.datetime.max()

Timestamp('2023-02-04 08:00:00')

In [14]:
LA_weather.datetime.min()

Timestamp('2006-12-31 16:00:00')

In [15]:
LA_weather.datetime.max()

Timestamp('2021-12-31 15:00:00')

The weather data starts earlier, but also ends sooner. Let's drop the extra rows from the two dataframes to make the times matchup.

In [16]:
LA_weather = LA_weather[LA_weather['datetime'] >= pd.to_datetime('2010-01-01 00:00:00')]
LA_collisions = LA_collisions[LA_collisions['datetime'] <= pd.to_datetime('2021-12-31 15:00:00')]

In [17]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)
LA_collisions.sort_values('datetime', ignore_index = True, inplace = True)

In [18]:
LA_weather.datetime.max() - LA_weather.datetime.min()

Timedelta('4382 days 15:00:00')

In [19]:
(4382 * 24) + 15

105183

In [20]:
LA_weather.shape[0]

103273

There's going to be over 100,000 frames of data. However, there are obviously some missing times in our weather data, almost 2000. Let's see if we can input those missing times rows. The time will be handled first and the weather features can be handled after with .fillna

In [21]:
for idx in range(1, 105184):
    if ((LA_weather['datetime'][idx] - LA_weather['datetime'][idx - 1]) != pd.to_timedelta('0 days 01:00:00')):
        idxs = np.split(LA_weather.index, [idx])
        LA_weather.set_index(idxs[0].union(idxs[1] + 1), inplace = True)
        LA_weather.loc[idx] = [np.NAN] * LA_weather.shape[1]
        LA_weather['datetime'][idx] = LA_weather['datetime'][idx - 1] + pd.to_timedelta('01:00:00')
        LA_weather.sort_values('datetime', ignore_index = True, inplace = True)

Now that the missing times are imputed, we can sort by the datetime and fill in the missing weather variables using pad or backfill.

In [22]:
LA_weather.isna().sum()

dt            1911
temp          1911
visibility    1911
dew_point     1911
temp_min      1911
temp_max      1911
pressure      1911
humidity      1911
wind_speed    1911
wind_deg      1911
wind_gust     1911
rain_1h       1911
rain_3h       1911
clouds_all    1911
datetime         0
dtype: int64

In [23]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)
LA_weather.fillna(method = 'pad', inplace = True)

In [24]:
LA_weather.isna().sum()

dt            0
temp          0
visibility    0
dew_point     0
temp_min      0
temp_max      0
pressure      0
humidity      0
wind_speed    0
wind_deg      0
wind_gust     0
rain_1h       0
rain_3h       0
clouds_all    0
datetime      0
dtype: int64

In [25]:
LA_weather.shape[0]

105184

Now, just to make sure that the weather dataframe is sorted by the datetime

In [26]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)

We can now start creating the tensor. While the weather dataframe has a temporal component, it has no spatial component. The historical weather data was retrieved only for a single coordinate in the center of Los Angeles in order to preserve cost. That makes the nd-array creation much easier for those features. <br>
The collisions dataframe and the weather dataframe have the same start and end datetime. This, and the fact that each time index represents one hour from the previous, means that to find the temporal index of the row in the collisions dataframe, we need only to subtract the minimum datetime from the corresponding datetime and record the number of hours. The latitude and longitude values can be found in the same way as the previous notebook.

In [27]:
LA_grid = np.zeros((50,30,12,LA_weather.shape[0]))

In [28]:
lats, lats_step = np.linspace(
    LA_traffic['lat'].max(), LA_collisions['latitude'].min(), num = 50, retstep = True)
lons, lons_step = np.linspace(
    LA_traffic['lon'].min(), LA_collisions['longitude'].max(), num = 30, retstep = True)

In [29]:
for idx in range(LA_weather.shape[0]):
        LA_grid[:,:,2,idx] = LA_weather['temp'][idx]
        LA_grid[:,:,3,idx] = LA_weather['visibility'][idx]
        LA_grid[:,:,4,idx] = LA_weather['humidity'][idx]
        LA_grid[:,:,5,idx] = LA_weather['rain_1h'][idx]
        LA_grid[:,:,6,idx] = LA_weather['rain_3h'][idx]
        LA_grid[:,:,7,idx] = LA_weather['clouds_all'][idx]
        LA_grid[:,:,8,idx] = LA_weather['pressure'][idx]
        LA_grid[:,:,9,idx] = LA_weather['wind_speed'][idx]
        LA_grid[:,:,10,idx] = LA_weather['wind_gust'][idx]
        LA_grid[:,:,11,idx] = LA_weather['dew_point'][idx]

In [30]:
lat_idxs = (LA_traffic['lat'] - LA_traffic['lat'].max()) / lats_step
lon_idxs = (LA_traffic['lon'] - LA_traffic['lon'].min()) / lons_step
for idx in range(lat_idxs.shape[0]):
    lat_idx = abs(math.floor(lat_idxs.iloc[idx]))
    lon_idx = abs(math.floor(lon_idxs.iloc[idx]))
    LA_grid[lat_idx, lon_idx,1,:]  = LA_traffic['Total'][idx]

In [31]:
LA_collisions['lat_idxs'] = (LA_collisions['latitude'] - LA_traffic['lat'].max()) / lats_step
LA_collisions['lon_idxs'] = (LA_collisions['longitude'] - LA_traffic['lon'].min()) / lons_step
LA_collisions['time_idxs'] = (LA_collisions['datetime'] - LA_collisions['datetime'].min())
LA_collisions['time_idxs'] = LA_collisions['time_idxs'].dt.components['days'] * 24 + LA_collisions['time_idxs'].dt.components['hours']
for idx in range(LA_collisions.shape[0]):
    lat_idx = abs(math.floor(LA_collisions['lat_idxs'][idx]))
    lon_idx = abs(math.floor(LA_collisions['lon_idxs'][idx]))
    time_idx = math.floor(LA_collisions['time_idxs'][idx])
    LA_grid[lat_idx, lon_idx,0,time_idx] = 1

In [32]:
LA_grid.shape

(50, 30, 12, 105184)

Now that all of the grids have been created, we can split the data into test and train data.

### 6 Split the Data<a id='6_Split_the_Data'>

To split this nd-array, we will create a list containing the indices of the 4th dimension (the time dimension). We can then split that list of indices and use that to retrieve test and train data from the tensor that we just created.

In [33]:
all_indices = list(range(LA_weather.shape[0]))
train_ind, test_ind = train_test_split(all_indices, test_size=0.25)

In [34]:
LA_train = LA_grid[:,:,:,train_ind]
LA_test = LA_grid[:,:,:, test_ind]

In [35]:
LA_train.shape, LA_test.shape

((50, 30, 12, 78888), (50, 30, 12, 26296))

### 7 Saving as an H5PY File <a id='7_Saving_as_an_H5PY_File'>

We can now store those two arrays as datasets in an h5py file. This will allow for easy retrieval later when modeling. We will create an h5py file and then the datasets.

In [37]:
LA_data_split = h5py.File('../Data/LA_data_split.hdf5','w-')

In [38]:
train = LA_data_split.create_dataset('train', data = LA_train, chunks = (50,30,12,3287), compression="gzip");
test = LA_data_split.create_dataset('test', data = LA_test, chunks = (50,30,12,3287), compression="gzip");

### 8 Baseline Performance <a id='8_Baseline_Performance'>

To get a baseline performance for the model, we'll look at the Zero Rate Classifier (always predicting to the highest class) and the Random Rate Classifier (guessing based on the percentage of each class).

To get the Zero Rate, we only need to count the total number of target cells in our tensor and also the number of each calss (0 and 1). We can do this with the train dataset that was just created.

In [39]:
totalCells = 50 * 30 * LA_train.shape[3]
totalZeros = 0
for chunk in train.iter_chunks():
    arr = train[chunk]
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            for k in range(arr.shape[3]):
                if (arr[i, j, 0, k] == 0):
                    totalZeros += 1
totalZeros

117795429

In [40]:
ZeroR = totalZeros/totalCells * 100
ZeroR

99.54655460906602

For the Random Rate, we can use the distribution of classes to inform the rate that we use. The zero class makes up 99.546% of the data. Therefore, the one class makes up about 0.454%. So, if we predict a zero 99.546% of the time and a one 0.454% of the time, then the rate of accurate predictions can be found as follows.

In [43]:
Acc = (0.00454**2 + .99546**2) * 100
Acc

99.09612232

Clearly, our classes are quite imbalanced.

In [42]:
LA_data_split.close()

### 9 Conclusion<a id='9_Conclusion'>

In this notebook, we created the tensor that will be used for modeling , split the data, and created an h5py file to hold that split data. The data was split 75/25. The classes in are target matrix are quite imbalanced, so we will consider performing SMOTE or some other technique to account for this, such that our model is not negatively affected.

We are now going to move into the modeling phase of the project in the following notebook.