# Pre-Processing and Feature Engineering<a id='Pre-Processing_and_Feature_Engineering'></a>

### 1 Table of Contents<a id='Contents'></a>
* [Pre-Processing and Feature Engineering](#Pre-Processing_and_Feature_Engineering)
  * [1 Contents](#Contents)
  * [2 Introduction](#2_Introduction)
  * [3 Imports](#3_Imports)
  * [4 Load Data](#4_Load_Data)
  * [5 Creating the Tensors](#5_Creating_the_Tensors)
  * [Save Data](#Save_Data)
  * [Conclusion](#Conclusion)

### 2 Introduction<a id='2_Introduction'>

In the last notebook, the spatial and temporal resolution that will be used for the tensors was decided. For spatial resolution, a 30x30 grid will be used.  Traffic volume data was found from the Los Angeles Department of Transportation to be used as one of the tensor layers. The other layers will be made up of weather variables such as temperature, visibility, and cloud cover. The target is a 1 or 0 for each grid cell, representing whether an accident occured or not. 

In this notebook, the complete tensor relationship will be built using pytorch. The tensors will then be split into test and train sets and a baseline performance for the model will be determined.

### 3 Imports<a id='3_Imports'>

In [2]:
import warnings
warnings.simplefilter('ignore')
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
import requests
import torch
import googlemaps
from library.sb_utils import save_file
import os
import math
import csv

### 4 Load Data<a id='4_Load_Data'>

In [9]:
LA_data = pd.read_csv('../Data/LA_data.csv', index_col = 'Unnamed: 0')
LA_traffic = pd.read_csv('../Data/LA_traffic.csv', index_col = 'Unnamed: 0')
openWeather_api_key = pd.read_json('../credentials.json', typ='series')['openWeather_api_key']
maps_api_key = pd.read_json('../credentials.json', typ='series')['maps_api_key']

### 5 Creating the Tensors <a id='5_Creating_the_Tensors'>

The tensors for the weather variables and the traffic flow will be created in the same way that the grids were created in the last notebook. However, this time, each tensor needs only to describe one hour of weather variables. The traffic data is static so that will be a single layer that is carried to each future tensor. First, to determine how many weather variables there are.

In [10]:
LA_data.columns

Index(['level_0', 'index', 'primary_road', 'secondary_road', 'intersection',
       'side_of_highway', 'severity', 'type', 'pedestrian', 'bicycle',
       'motorcycle', 'truck', 'same_day_crashes', 'same_road_crashes',
       'latitude', 'longitude', 'datetime', 'temp', 'visibility', 'dew_point',
       'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'wind_gust', 'rain_1h', 'rain_3h', 'clouds_all'],
      dtype='object')

The factors that are important to the model are temp, visibility, dew point, pressure, humidity, wind speed, wind gust, rain 1h, rain 3h, and clouds all. Thats ten weather variables, the traffic flow data, and the target variable (collision or no collision) over time. To start, with a temporal resolution of 1 hour, how many frames will our data have?

In [11]:
LA_data.datetime.min()

'2010-01-01 00:00:00'

In [12]:
LA_data.datetime.max()

'2021-12-31 15:00:00'

In [14]:
LA_data['datetime'] = pd.to_datetime(LA_data['datetime'])

In [15]:
LA_data.datetime.max() - LA_data.datetime.min()

Timedelta('4382 days 15:00:00')

In [16]:
(4382 * 24) + 15

105183

There's going to be over 100,000 frames of data. To aide in the process of creating all of these frames, a function can be made to fill pytorch tensors, which will be used to model later.

In [None]:
def fillGrid(cols)
    for idx0 in range(lat_idxs.shape[0]):
        lat_idx = math.floor(lat_idxs.iloc[idx0])
        lon_idx = math.floor(lon_idxs.iloc[idx0])
        lat_lon_grid.iloc[lat_idx, lon_idx] += 1
        

### Save Data<a id='Save_Data'>

### Conclusion<a id='Conclusion'>