# Pre-Processing and Feature Engineering<a id='Pre-Processing_and_Feature_Engineering'></a>

### 1 Table of Contents<a id='Contents'></a>
* [Pre-Processing and Feature Engineering](#Pre-Processing_and_Feature_Engineering)
  * [1 Contents](#Contents)
  * [2 Introduction](#2_Introduction)
  * [3 Imports](#3_Imports)
  * [4 Load Data](#4_Load_Data)
  * [5 Creating the Tensor](#5_Creating_the_Tensor)
  * [6 Split the Data](#6_Split_the_Data)
  * [7 Saving as an H5PY File](#7_Saving_as_an_H5PY_File)
  * [8 Creating Windowed Datasets](#8_Creating_Windowed_Datasets)
  * [9 Conclusion](#9_Conclusion)

### 2 Introduction<a id='2_Introduction'>

In the last notebook, the spatial and temporal resolution that will be used for the tensors was decided. For spatial resolution, a 50x30 array will be used.  Traffic volume data was found from the Los Angeles Department of Transportation to be used as one of the tensor layers. The other layers will be made up of weather variables such as temperature, visibility, and cloud cover. The target is a 1 or 0 for each grid cell, representing whether an accident occured or not. 

In this notebook, we will build the complete tensor of weather, traffic, and collision data. The tensor will then be split into test and train sets and a baseline performance for the model will be determined.

### 3 Imports<a id='3_Imports'>

In [1]:
import warnings
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
import h5py
import torch as t
import torch.utils.data as dt
from sklearn.model_selection import train_test_split
import gc
from library.sb_utils import save_file
import os
import math
import csv

### 4 Load Data<a id='4_Load_Data'>

In [2]:
LA_collisions = pd.read_csv('../Data/LA_collisions.csv', index_col = 'Unnamed: 0')
LA_weather = pd.read_csv('../Data/LA_weather_cleaned.csv', index_col = 'Unnamed: 0')
LA_traffic = pd.read_csv('../Data/LA_traffic.csv', index_col = 'Unnamed: 0')

### 5 Creating the Tensor <a id='5_Creating_the_Tensor'>

The unmerged dataframes will be used to fill the tensor, as the unmerged weather dataframe (which contains most of the input features) has complete hourly records of weather features dating back to 2006, the other input feature is the count data in the traffic dataframe, and the output feature is the presence of a collision in the collisions dataframe. First, to determine the columns in each of our dataframes

In [3]:
LA_collisions.columns

Index(['primary_road', 'secondary_road', 'intersection', 'side_of_highway',
       'severity', 'type', 'pedestrian', 'bicycle', 'motorcycle', 'truck',
       'same_day_crashes', 'same_road_crashes', 'latitude', 'longitude',
       'datetime'],
      dtype='object')

In [4]:
LA_weather.columns

Index(['dt', 'temp', 'visibility', 'dew_point', 'temp_min', 'temp_max',
       'pressure', 'humidity', 'wind_speed', 'wind_deg', 'wind_gust',
       'rain_1h', 'rain_3h', 'clouds_all', 'datetime'],
      dtype='object')

In [5]:
LA_traffic.columns

Index(['primary_road', 'secondary_road', 'lat', 'lon', 'Total'], dtype='object')

Before we start using the collisions coordinates to create our grid boundaries, we need to make sure there are no erroneous locations (which we've already seen in previous notebooks)

In [6]:
LA_collisions[LA_collisions['latitude'] > 35]

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime


In [7]:
LA_collisions[LA_collisions['latitude'] < 33].head(5)

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime
300034,HOOVER,JEFFERSON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2021-02-28 18:00:00
302523,RIVERSIDE,FULTON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 18:00:00
302541,VENTURA BL,LA MAIDA ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-14 15:00:00
302566,VENTURA BL,DONNA AV,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 13:00:00
302671,VAN NUYS BL,HUSTON ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 15:00:00


In [8]:
LA_collisions[LA_collisions['longitude'] < -119]

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime


In [9]:
LA_collisions[LA_collisions['longitude'] > -117].head(5)

Unnamed: 0,primary_road,secondary_road,intersection,side_of_highway,severity,type,pedestrian,bicycle,motorcycle,truck,same_day_crashes,same_road_crashes,latitude,longitude,datetime
300034,HOOVER,JEFFERSON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2021-02-28 18:00:00
302523,RIVERSIDE,FULTON,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 18:00:00
302541,VENTURA BL,LA MAIDA ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-14 15:00:00
302566,VENTURA BL,DONNA AV,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 13:00:00
302671,VAN NUYS BL,HUSTON ST,0.0,Not Available,Not Available,Not Available,0,0,0,0,0,0,0.0,0.0,2020-04-16 15:00:00


Looks like the erroneous locations are all at (0,0). We used the geocoding api quite a bit in the last notebook, so, in the interest of preserving cost, these rows will be dropped

In [10]:
LA_collisions = LA_collisions[LA_collisions['latitude'] > 0]

The factors that are important to the model are temp, visibility, dew point, pressure, humidity, wind speed, wind gust, rain 1h, rain 3h, and clouds all. Thats ten weather variables, the traffic flow data, and the target variable (collision or no collision) over time. To start, with a temporal resolution of 1 hour, how many frames will our data have?

In [11]:
LA_weather['datetime'] = pd.to_datetime(LA_weather['datetime'])
LA_collisions['datetime'] = pd.to_datetime(LA_collisions['datetime'])

In [12]:
LA_collisions.datetime.min()

Timestamp('2010-01-01 00:00:00')

In [13]:
LA_collisions.datetime.max()

Timestamp('2023-02-04 08:00:00')

In [14]:
LA_weather.datetime.min()

Timestamp('2006-12-31 16:00:00')

In [15]:
LA_weather.datetime.max()

Timestamp('2021-12-31 15:00:00')

The weather data starts earlier, but also ends sooner. Let's drop the extra rows from the two dataframes to make the times matchup.

In [16]:
LA_weather = LA_weather[LA_weather['datetime'] >= pd.to_datetime('2010-01-01 00:00:00')]
LA_collisions = LA_collisions[LA_collisions['datetime'] <= pd.to_datetime('2021-12-31 15:00:00')]

In [17]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)
LA_collisions.sort_values('datetime', ignore_index = True, inplace = True)

In [18]:
LA_weather.datetime.max() - LA_weather.datetime.min()

Timedelta('4382 days 15:00:00')

In [19]:
(4382 * 24) + 15

105183

In [20]:
LA_weather.shape[0]

103273

There's going to be over 100,000 frames of data. However, there are obviously some missing times in our weather data, almost 2000. Let's see if we can input those missing times rows. The time will be handled first and the weather features can be handled after with .fillna

In [21]:
for idx in range(1, 105184):
    if ((LA_weather['datetime'][idx] - LA_weather['datetime'][idx - 1]) != pd.to_timedelta('0 days 01:00:00')):
        idxs = np.split(LA_weather.index, [idx])
        LA_weather.set_index(idxs[0].union(idxs[1] + 1), inplace = True)
        LA_weather.loc[idx] = [np.NAN] * LA_weather.shape[1]
        LA_weather['datetime'][idx] = LA_weather['datetime'][idx - 1] + pd.to_timedelta('01:00:00')
        LA_weather.sort_values('datetime', ignore_index = True, inplace = True)

Now that the missing times are imputed, we can sort by the datetime and fill in the missing weather variables using pad or backfill.

In [22]:
LA_weather.isna().sum()

dt            1911
temp          1911
visibility    1911
dew_point     1911
temp_min      1911
temp_max      1911
pressure      1911
humidity      1911
wind_speed    1911
wind_deg      1911
wind_gust     1911
rain_1h       1911
rain_3h       1911
clouds_all    1911
datetime         0
dtype: int64

In [23]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)
LA_weather.fillna(method = 'pad', inplace = True)

In [24]:
LA_weather.isna().sum()

dt            0
temp          0
visibility    0
dew_point     0
temp_min      0
temp_max      0
pressure      0
humidity      0
wind_speed    0
wind_deg      0
wind_gust     0
rain_1h       0
rain_3h       0
clouds_all    0
datetime      0
dtype: int64

In [25]:
LA_weather.shape[0]

105184

Now, just to make sure that the weather dataframe is sorted by the datetime

In [26]:
LA_weather.sort_values('datetime', ignore_index = True, inplace = True)

We can now start creating the tensor. While the weather dataframe has a temporal component, it has no spatial component. The historical weather data was retrieved only for a single coordinate in the center of Los Angeles in order to preserve cost. That makes the nd-array creation much easier for those features. <br>
The collisions dataframe and the weather dataframe have the same start and end datetime. This, and the fact that each time index represents one hour from the previous, means that to find the temporal index of the row in the collisions dataframe, we need only to subtract the minimum datetime from the corresponding datetime and record the number of hours. The latitude and longitude values can be found in the same way as the previous notebook.

In [27]:
LA_grid = np.zeros((LA_weather.shape[0],50,30,12))

In [28]:
lats, lats_step = np.linspace(
    LA_traffic['lat'].max(), LA_collisions['latitude'].min(), num = 50, retstep = True)
lons, lons_step = np.linspace(
    LA_traffic['lon'].min(), LA_collisions['longitude'].max(), num = 30, retstep = True)

In [29]:
for idx in range(LA_weather.shape[0]):
        LA_grid[idx,:,:,2] = LA_weather['temp'][idx]
        LA_grid[idx,:,:,3] = LA_weather['visibility'][idx]
        LA_grid[idx,:,:,4] = LA_weather['humidity'][idx]
        LA_grid[idx,:,:,5] = LA_weather['rain_1h'][idx]
        LA_grid[idx,:,:,6] = LA_weather['rain_3h'][idx]
        LA_grid[idx,:,:,7] = LA_weather['clouds_all'][idx]
        LA_grid[idx,:,:,8] = LA_weather['pressure'][idx]
        LA_grid[idx,:,:,9] = LA_weather['wind_speed'][idx]
        LA_grid[idx,:,:,10] = LA_weather['wind_gust'][idx]
        LA_grid[idx,:,:,11] = LA_weather['dew_point'][idx]

In [30]:
lat_idxs = (LA_traffic['lat'] - LA_traffic['lat'].max()) / lats_step
lon_idxs = (LA_traffic['lon'] - LA_traffic['lon'].min()) / lons_step
for idx in range(lat_idxs.shape[0]):
    lat_idx = abs(math.floor(lat_idxs.iloc[idx]))
    lon_idx = abs(math.floor(lon_idxs.iloc[idx]))
    LA_grid[:,lat_idx,lon_idx,1]  = LA_traffic['Total'][idx]

In [31]:
LA_collisions['lat_idxs'] = (LA_collisions['latitude'] - LA_traffic['lat'].max()) / lats_step
LA_collisions['lon_idxs'] = (LA_collisions['longitude'] - LA_traffic['lon'].min()) / lons_step
LA_collisions['time_idxs'] = (LA_collisions['datetime'] - LA_collisions['datetime'].min())
LA_collisions['time_idxs'] = LA_collisions['time_idxs'].dt.components['days'] * 24 + LA_collisions['time_idxs'].dt.components['hours']
for idx in range(LA_collisions.shape[0]):
    lat_idx = abs(math.floor(LA_collisions['lat_idxs'][idx]))
    lon_idx = abs(math.floor(LA_collisions['lon_idxs'][idx]))
    time_idx = math.floor(LA_collisions['time_idxs'][idx])
    LA_grid[time_idx,lat_idx,lon_idx,0] = 1

In [32]:
LA_grid.shape

(105184, 50, 30, 12)

It would be useful now to determine which, if any, grid cells never see a collision. If they don't see a collision then it might be useful to drop those cells from our graph. (This is possible in the next notebook when the grid is formatted for a GNN)

In [33]:
LA_grid = np.array(LA_grid)

In [34]:
collision_idxs = np.zeros(1500)

In [35]:
for idx0 in range(LA_grid.shape[0]):
    for idx1 in range(LA_grid.shape[1]):
        for idx2 in range(LA_grid.shape[2]):
            if (LA_grid[idx0, idx1, idx2, 0] == 1):
                collision_idxs[((idx1 * 30) + idx2)] += 1 

In [36]:
np.unique(collision_idxs, return_counts = True)

(array([0.0000e+00, 1.0000e+00, 2.0000e+00, 3.0000e+00, 4.0000e+00,
        5.0000e+00, 6.0000e+00, 7.0000e+00, 1.0000e+01, 1.1000e+01,
        1.2000e+01, 1.3000e+01, 1.6000e+01, 1.7000e+01, 1.9000e+01,
        3.2000e+01, 1.0500e+02, 1.1200e+02, 1.5100e+02, 1.6900e+02,
        2.0300e+02, 2.1900e+02, 2.4400e+02, 3.3100e+02, 3.7000e+02,
        4.6200e+02, 5.0400e+02, 5.8200e+02, 6.6100e+02, 7.5200e+02,
        8.4200e+02, 9.5700e+02, 1.4680e+03, 1.4990e+03, 1.5780e+03,
        1.5960e+03, 1.6670e+03, 2.0330e+03, 2.2480e+03, 2.2530e+03,
        2.4440e+03, 2.6100e+03, 3.2350e+03, 3.5400e+03, 3.6040e+03,
        3.6820e+03, 4.0280e+03, 4.1990e+03, 4.4740e+03, 4.5890e+03,
        4.8960e+03, 4.9480e+03, 5.0400e+03, 5.8920e+03, 6.5490e+03,
        6.5680e+03, 6.6160e+03, 6.7190e+03, 6.7710e+03, 6.7750e+03,
        6.7890e+03, 7.1010e+03, 7.3100e+03, 7.6050e+03, 7.7690e+03,
        7.7750e+03, 8.2040e+03, 8.3130e+03, 8.3430e+03, 8.4710e+03,
        8.6880e+03, 9.1730e+03, 9.9550e+03, 1.04

Looks like 1384 of our grid cells never see a collision. This is expected since our grid is a rectangle, while the streets of LA are a much more organic shape. Let's get the indexes that never see an accident.

In [37]:
no_collisions = []
for idx in range(collision_idxs.shape[0]):
    if (collision_idxs[idx] == 0):
        no_collisions.append(idx)

In [38]:
datapath = '../Data'
save_file(pd.DataFrame(no_collisions), 'no_collision_indexes.csv', datapath)

Writing file.  "../Data\no_collision_indexes.csv"


Now that all of the grids have been created, we can split the data into test and train data.

### 6 Split the Data<a id='6_Split_the_Data'>

It is expected that the temporal aspect of the data will be important to model performance. Specifically, we want the model to be able to predict the accidents occuring in the next hour based on the weather and traffic data at the time and the weather,traffic, and accident data at the current time. For this reason, a random split of the data will lose the temporal clarity that our model will need to make accurate predictions about the future. So, a single split will be made such that 75% of the data will make up the training data, consisting of data from the start date to the split date, and the remaining 25% of the date will be the test data, consisting of data from the split date to the end date. However, I also plan to run a spatial model without the temporal component. So, I will also perform a random split.

In [39]:
split_ind = math.floor(LA_grid.shape[0] * 0.75)
LA_train_temporal = LA_grid[:split_ind,:,:,:]
LA_test_temporal = LA_grid[split_ind:,:,:,:]

In [40]:
LA_train_temporal.shape, LA_test_temporal.shape

((78888, 50, 30, 12), (26296, 50, 30, 12))

In [41]:
all_indices = list(range(LA_grid.shape[0]))
train_ind, test_ind = train_test_split(all_indices, test_size=0.25)
LA_train_random = LA_grid[train_ind,:,:,:]
LA_test_random = LA_grid[test_ind,:,:,:]

In [42]:
LA_train_random.shape, LA_test_random.shape

((78888, 50, 30, 12), (26296, 50, 30, 12))

### 7 Saving as an H5PY File <a id='7_Saving_as_an_H5PY_File'>

We can now store those arrays as datasets in h5py files. This will allow us to clear our memory of those datasets, which will make the next step possible, creating new datasets from windows of the old ones.

In [43]:
LA_data_split_temporal = h5py.File('../Data/LA_data_split_temporal.hdf5','w-')

In [44]:
LA_data_split_temporal.create_dataset('train', data = LA_train_temporal, chunks = (32,50,30,12), compression="gzip");
LA_data_split_temporal.create_dataset('test', data = LA_test_temporal, chunks = (32,50,30,12), compression="gzip");

In [45]:
LA_data_split_temporal.close()

In [46]:
del LA_train_temporal
del LA_test_temporal
gc.collect()

18

In [47]:
LA_data_split_random = h5py.File('../Data/LA_data_split_random.hdf5','w-')

In [48]:
LA_data_split_random.create_dataset('train', data = LA_train_random, chunks = (32,50,30,12), compression="gzip");
LA_data_split_random.create_dataset('test', data = LA_test_random, chunks = (32,50,30,12), compression="gzip");

In [49]:
LA_data_split_random.close()

In [50]:
del LA_train_random
del LA_test_random
del LA_grid
gc.collect()

0

### 8 Creating Windowed Datasets <a id='8_Creating_Windowed_Datasets'>

I am now going to create a windowed dataset using the H5PY file just created. The window will be of size 6 and will include only the collision information, no traffic or weather. After, I will make a new H5PY file for the windowed datasets, which will be used in the modeling notebook.

In [51]:
LA_windowed_split = h5py.File('../Data/LA_windowed_split.hdf5','w-')

In [52]:
class H5dataset(dt.Dataset):
    def __init__(self, h5_path, set_name):
        self.h5 = h5py.File(h5_path, 'r')
        self.set = self.h5[set_name]
        self.chunk_dict = {}
        for ch, sl in enumerate(self.set.iter_chunks()):
            self.chunk_dict[ch] = sl
        
    def __len__(self):
        return self.set.shape[0]
    
    def __getitem__(self, chunk, chunk_idx):
        sample = self.set[self.chunk_dict[chunk]][chunk_idx,:,:,0]
        return t.tensor(sample)
    
    def CLOSE(self):
        self.h5.close()

In [53]:
train_data = H5dataset('../Data/LA_data_split_temporal.hdf5', 'train')
test_data = H5dataset('../Data/LA_data_split_temporal.hdf5', 'test')

In [54]:
def get_window(dataset, start, end):
    chunk = math.floor(start / 32)
    chunk_idx = start%32
    windows = []
    for idx in range(start, end+1):
        window = dataset.__getitem__(chunk, chunk_idx)
        chunk_idx += 1
        if (chunk_idx == 32):
            chunk += 1
            chunk_idx = 0
        windows.append(window)
    window_tensor = t.stack(windows)
    return window_tensor

In [55]:
def create_windowed_dataset(dataset, lookback):
    windows = []
    for i in range(len(dataset) - lookback):
        window = get_window(dataset, i, i+lookback)
        windows.append(window)
    windowed_dataset = t.stack(windows)
    return windowed_dataset

In [56]:
train_windowed = create_windowed_dataset(train_data, 5)

In [57]:
train_windowed.shape

torch.Size([78883, 6, 50, 30])

In [58]:
LA_windowed_split.create_dataset('train_windowed', data = train_windowed, chunks = (32,6,50,30), compression="gzip");

In [59]:
train_data.CLOSE()

In [60]:
del train_windowed
del train_data
gc.collect()

0

In [61]:
test_windowed = create_windowed_dataset(test_data, 5)

In [62]:
test_windowed.shape

torch.Size([26291, 6, 50, 30])

In [63]:
LA_windowed_split.create_dataset('test_windowed', data = test_windowed, chunks = (32,6,50,30), compression="gzip");

In [64]:
test_data.CLOSE()

In [65]:
del test_windowed
del test_data
gc.collect()

0

In [66]:
LA_windowed_split.close()

### 9 Conclusion<a id='9_Conclusion'>

In this notebook, we created the tensor that will be used for modeling , split the data, and created an h5py file to hold that split data. The data was split 75/25. We then created windowed datasats for our train and test sets.. FInally, another H5PY file was created to hold the windowed datasets for use in modeling.

We are now going to move into the modeling phase of the project in the following notebook.