# Introduction for Large-scale Forecasting with Deep Learning

- Goal: to predict the hourly westbound traffic on I-94 between Minneapolis and St. Paul in Minnesota
- Dataset:  
    - Total: 6 features and 17,551 rows 
    - Time horizon: September 29, 2016, at 5 p.m. and ends on September 30, 2018, at 11 p.m.

| Feature | Description |
|:--------|:------------|
| `date_time` | Date and time of the data, recorded in the CST time zone. The format is YYYY-MM-DD HH:MM:SS.|
| `temp` | Average temperature recorded in the hour, expressed in Kelvin. |
| `rain_1h` | Amount of rain that occurred in the hour, expressed in millimeters. |
| `snow_1h` | Amount of snow that occurred in the hour, expressed in millimeters. |
| `clouds_all`| Percentage of cloud cover during the hour. |
| `traffic_volume`| Volume of traffic reported westbound on I-94 during the hour.|

In [39]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler

import plotly.express as px
import datetime

In [1]:
import tensorflow as tf


from tensorflow.keras import Model, Sequential

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import MeanAbsoluteError

from tensorflow.keras.layers import Dense, Conv1D, LSTM, Lambda, Reshape, RNN, LSTMCell


## EDA

In [2]:
df = pd.read_csv("../../data/book-time-series-forecasting-in-python/metro_interstate_traffic_volume_preprocessed.csv", parse_dates=[0])

In [3]:
df.head()

Unnamed: 0,date_time,temp,rain_1h,snow_1h,clouds_all,traffic_volume
0,2016-09-29 17:00:00,291.75,0.0,0,0,5551.0
1,2016-09-29 18:00:00,290.36,0.0,0,0,4132.0
2,2016-09-29 19:00:00,287.86,0.0,0,0,3435.0
3,2016-09-29 20:00:00,285.91,0.0,0,0,2765.0
4,2016-09-29 21:00:00,284.31,0.0,0,0,2443.0


In [4]:
df.loc[79, 'date_time'].day_of_week # 0 is monday

0

In [29]:
date_list = pd.date_range(start="2016-10-03 00:00:00",end="2016-10-16 00:00:00").tolist()
date_list[:7]

[Timestamp('2016-10-03 00:00:00'),
 Timestamp('2016-10-04 00:00:00'),
 Timestamp('2016-10-05 00:00:00'),
 Timestamp('2016-10-06 00:00:00'),
 Timestamp('2016-10-07 00:00:00'),
 Timestamp('2016-10-08 00:00:00'),
 Timestamp('2016-10-09 00:00:00')]

In [32]:
def plot_feature(title:str, feature:str='traffic_volume'):
    fig = px.line(df[79:79+14*24], x='date_time', y=feature)
    for date in date_list:
        fig.add_vline(date, line=dict(dash='dot'))
    fig.update_layout(
        xaxis= dict(
            tickmode = 'array',
            tickvals = date_list,
            ticktext = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']*2
        ),
        yaxis_title=feature,
        xaxis_title='Time',
        title=title
    )  
    fig.show()

plot_feature('Westbound traffic volume on I-94 between Minneapolis and St. Paul in Minnesota', 'traffic_volume')

- There is a clear daily seasonality, since the traffic volume is lower at the start and end of each day, and peaks at 7AM and 4PM in the week day.
- Also, a smaller traffic volume during the weekends. 
- As for the trend, 1 week of data is likely insufficient to draw a reasonable conclusion, but it seems that the volume is neither increasing nor decreasing over time.

In [33]:
plot_feature('Hourly Temparature (in Kelvin)','temp')

- The temperature is indeed lower at the start and end of each day and peaks toward the middle of each day. 
- This suggests daily seasonality

In [34]:
df.describe().transpose()

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
date_time,17551.0,2017-09-30 08:00:00,2016-09-29 17:00:00,2017-03-31 12:30:00,2017-09-30 08:00:00,2018-04-01 03:30:00,2018-09-30 23:00:00,
temp,17551.0,281.416203,243.39,272.22,282.41,291.89,310.07,12.688262
rain_1h,17551.0,0.025523,0.0,0.0,0.0,0.0,10.6,0.259794
snow_1h,17551.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
clouds_all,17551.0,42.034129,0.0,1.0,40.0,90.0,100.0,39.06596
traffic_volume,17551.0,3321.484588,113.0,1298.0,3518.0,4943.0,7280.0,1969.223949


- `rain_1h` is mostly 0 throughout the dataset, as its third quartile is still at 0. 
    - Since at least 75% of the values for `rain_1h` are 0.
    - Thus this feature will be removed, as it is unlikely that it is a strong predictor of traffic volume.
- `snow_1h` is at 0 through the entire dataset. 
    - This feature will also be removed from the dataset.

In [35]:
cols_to_drop = ['rain_1h', 'snow_1h']
df = df.drop(cols_to_drop, axis=1)

## Feature Engineering

In [37]:
timestamp_s = pd.to_datetime(df['date_time']).map(datetime.datetime.timestamp)
day = 24 * 60 * 60                     
 
df['day_sin'] = (np.sin(timestamp_s * (2*np.pi/day))).values
df['day_cos'] = (np.cos(timestamp_s * (2*np.pi/day))).values   
df = df.drop(['date_time'], axis=1)

- 70:20:10 split for the train, validation, and test sets

In [38]:
n = len(df)
 
# Split 70:20:10 (train:validation:test)
train_df = df[0:int(n*0.7)]               
val_df = df[int(n*0.7):int(n*0.9)]        
test_df = df[int(n*0.9):] 

In [43]:
scaler = MinMaxScaler()
scaler.fit(train_df)       
 
train_df.loc[:, train_df.columns] = scaler.transform(train_df[train_df.columns])
val_df.loc[:,val_df.columns] = scaler.transform(val_df[val_df.columns])
test_df.loc[:,test_df.columns] = scaler.transform(test_df[test_df.columns])