# Sequence Processing using Multivariate Time-Series

_**Analyze a simple time-series dataset and perform multivariate time-series forecasting with a simple Recurrent Neural Network (RNN).**_

The following experiment considers Chicago Transit Authority (CTA) daily ridership dataset available at   (CTA). This dataset shows system-wide boardings for both bus and rail services provided by CTA and it is available at https://data.cityofchicago.org/. The dataset having updates till August 1, 2024 was considered in this experiment. Note that attribute value **W**, **A** ans **U** in attribute **day_type** represent **Weekday**, **Saturday** and **Sunday/Holiday**, respectively.

In [2]:
# Imports required packages

import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

## Retrieving & Analysing Dataset

In [4]:
# Load the ridership dataset
ridership = pd.read_csv("./data/CTA-Ridership-Daily_Boarding_Totals_20240829.csv", parse_dates=["service_date"])

In [None]:
# Print the dataset

<code here>

In [9]:
# Set column "service_date" as index to make date/time related operations easier
ridership = ridership.sort_values("service_date").set_index("service_date")

In [None]:
# Print the dataset to check for new index

<code here>

In [13]:
# Drop the calculated column "total_rides" as this is just element-wise addition from columns "bus" and "rail_boardings".

ridership = ridership.drop("total_rides", axis=1)

In [15]:
# Remove duplicate observations, if any
ridership = ridership.drop_duplicates()

In [None]:
# Print shape of the dataset

<code here>

In [None]:
# Look at the ridership for March, April and May of 2019

ridership["2019-03":"2019-05"].plot(grid=True, marker=".", figsize=(8, 3.5))

plt.show()

Looking at the above figure, weekly seasonality was observed.

## Multivariate Forecasting using Simple RNN

Forecasting tomorrow's rail ridership based on both bus and rail ridership [multiple variables as input] of the past 8 weeks (56 days). Day type (weekday, a weekend, or a holiday) is also take into consideration.

**Prepare Datasets for Modeling**

In [None]:
# Prepare dataset with multiple features as input for modeling
# Ridership values are once again scaled down by a factor of one million, 
# to ensure the values are near the 0–1 range

ridership_multivar = ridership[["bus", "rail_boardings"]] / 1e6
ridership_multivar["next_day_type"] = ridership["day_type"].shift(-1)  # we know tomorrow's type

ridership_multivar = pd.get_dummies(ridership_multivar)  # one-hot encode the day type

# Changes datatypes of day type columns from bool to float to create get TensorFlow dataset
ridership_multivar["next_day_type_A"] = ridership_multivar["next_day_type_A"].astype(float)
ridership_multivar["next_day_type_U"] = ridership_multivar["next_day_type_U"].astype(float)
ridership_multivar["next_day_type_W"] = ridership_multivar["next_day_type_W"].astype(float)

# Show the encoded multivariate dataset
display(ridership_multivar)

The above dataset is a DataFrame with five columns: bus, rail_boardings, plus three columns containing the one-hot encoding of the next day’s type

In [128]:
# Split the time-series into three periods, for training, validation and testing

multivar_train = ridership_multivar["2016-01":"2018-12"]     # 3 years
multivar_val = ridership_multivar["2019-01":"2019-05"]       # 5 months
multivar_test = ridership_multivar["2019-06":]               # remaining period from 2019-06

In [140]:
# Prepare TensorFlow specific datasets

tf.random.set_seed(42)

multivar_train_ds = tf.keras.utils.timeseries_dataset_from_array(
    multivar_train.to_numpy(),                              # use all 5 columns as input
    targets=multivar_train["rail_boardings"][seq_length:],  # forecast only the rail series
    sequence_length=seq_length,
    batch_size=32,
    shuffle=True,
    seed=42
)

multivar_val_ds = tf.keras.utils.timeseries_dataset_from_array(
    multivar_val.to_numpy(),
    targets=multivar_val["rail_boardings"][seq_length:],
    sequence_length=seq_length,
    batch_size=32
)

In [143]:
# Resets all the keras states
tf.keras.backend.clear_session()

tf.random.set_seed(42)

# Creates an RNN with 32 recurrent neurons followed by a dense output layer with one output neuron
# The same model was used before for univariate forecasting, but it is now being used for multivariate forecasting
multivar_simple_rnn = tf.keras.Sequential([
    # Instantiate a "SimpleRNN" layer with 32 units as output and input shape as [None, 5]
    <code here>,
    
    # Instantiate a "Dense" layer with 1 unit as output
    <code here>
])

In [None]:
# Sets callback to stop training when model does improve after a certain number of training iterations
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_mae", patience=50, restore_best_weights=True)

# Instantiate SGD optimizer with 0.05 as learning rate and 0.9 as momentum
optimizer = <code here>

# Compile the model with "huber" as loss function, already created optimizer and "mae" as metric
<code here>

# Fit the model already created training dataset, 500 as epochs, validation dataset and early stopping callback
# Starts model training process over specified training, validation data and callbacks
history = <code here>

In [None]:
# After training, model gets evaluated against validation data
# Show the metric with factor i.e. 1e6 multiplied during scaling

val_loss, val_mae = <code here>
print("Validation MAE of the Multivariate Simple RNN:", val_mae * 1e6)

**Note down the multivariate model's performance.**

**Observations:**

Note down all your observations in green/blue book.