<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neurons and Layers

Up to now we have focused on feedforward neural networks, where the activations flow only in one direction, from the input layer to the output layer. A recurrent neural network (RNN) looks very much like a feedforward neural network, except it also has connections pointing backward.

A recurrent neuron receives:
- The input at time step *t*, denoted **x(t)**
- Its own output (or hidden state) from the previous time step, **ŷ(t−1)**

At the first time step, the previous output is typically initialized to zero. Representing the same neuron across multiple time steps is called **unrolling the network through time**.


## Recurrent Layers

A recurrent layer contains multiple recurrent neurons. At each time step *t*, every neuron receives:
- The input vector **x(t)**
- The output vector from the previous time step **ŷ(t−1)**

Each neuron has two sets of weights:
- One for the inputs
- One for the recurrent (previous output) connections

For the whole layer, these weights are stored in matrices:
- **Wₓ** for inputs
- **Wᵧ̂** for recurrent connections


## Mathematical Formulation

For a single instance, the output of a recurrent layer at time step *t* is:

Ŷ(t) = φ(X(t)Wₓ + Ŷ(t−1)Wᵧ̂ + b)

Where:
- φ(·) is an activation function (e.g., ReLU or tanh)
- b is a bias vector

For a mini-batch, the computation can be written as:

Ŷ(t) = φ([X(t) Ŷ(t−1)]W + b)

The notation [X(t) Ŷ(t−1)] represents the horizontal concatenation of the input and previous output.


## Temporal Dependency

Because Ŷ(t) depends on Ŷ(t−1), which depends on Ŷ(t−2), and so on, the output at time *t* is a function of **all previous inputs**:

X(0), X(1), …, X(t)

This is what gives RNNs their ability to model sequences.


## Memory Cells

A component that preserves information across time steps is called a **memory cell**.

The hidden state at time *t* is denoted **h(t)** and defined as:

h(t) = f(x(t), h(t−1))

The output at time *t* is often simply equal to the hidden state, but in more advanced architectures this may differ.


## Output Feedback vs State Feedback

- **Jordan RNN (1986):** feeds back the output
- **Elman RNN (1990):** feeds back the hidden state (most common today)

Modern RNNs almost always use **state feedback**.


## Input and Output Sequences

RNNs can be structured in several ways:

1. **Sequence → Sequence**  
   Example: time series forecasting

2. **Sequence → Vector**  
   Example: sentiment analysis

3. **Vector → Sequence**  
   Example: image captioning

4. **Encoder–Decoder (Sequence → Vector → Sequence)**  
   Example: machine translation

Encoder–decoder models work better than direct sequence-to-sequence models because the full input sequence is processed before generating outputs.


In [None]:
# Simple example of a basic RNN cell in PyTorch

import torch
import torch.nn as nn

class SimpleRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.Wx = nn.Linear(input_size, hidden_size)
        self.Wh = nn.Linear(hidden_size, hidden_size)
        self.activation = nn.Tanh()

    def forward(self, x_t, h_prev):
        h_t = self.activation(self.Wx(x_t) + self.Wh(h_prev))
        return h_t


This basic RNN cell:
- Takes the current input x(t)
- Combines it with the previous hidden state h(t−1)
- Produces a new hidden state h(t)

Training such models requires **Backpropagation Through Time (BPTT)**, which introduces challenges such as vanishing and exploding gradients—motivating more advanced cells like **LSTMs** and **GRUs**.


## Training RNNs

To train an RNN, the trick is to unroll it through time and then use regular backpropagation.  
This strategy is called **backpropagation through time (BPTT)**.

Just like in regular backpropagation, there is a forward pass through the unrolled network, followed by computing a loss over the output sequence:

ℒ(Y(0), Y(1), …, Y(T); Ŷ(0), Ŷ(1), …, Ŷ(T))

The gradients of that loss are then propagated backward through the unrolled network.  
If some outputs are ignored by the loss (e.g., sequence-to-vector models), gradients only flow through the outputs that contribute to the loss.

Since the same parameters are reused at each time step, their gradients are accumulated multiple times.  
Once all gradients are computed, a gradient descent step updates the parameters.

Fortunately, modern frameworks like PyTorch handle all of this automatically.


## Forecasting a Time Series

Suppose you are tasked with forecasting daily bus and rail ridership for Chicago’s Transit Authority.  
You have daily ridership data since 2001, and your goal is to predict tomorrow’s ridership.

We begin by loading and cleaning the data.


In [None]:
import pandas as pd
from pathlib import Path

path = Path("datasets/ridership/CTA_-_Ridership_-_Daily_Boarding_Totals.csv")
df = pd.read_csv(path, parse_dates=["service_date"])
df.columns = ["date", "day_type", "bus", "rail", "total"]
df = df.sort_values("date").set_index("date")
df = df.drop("total", axis=1)
df = df.drop_duplicates()


Let’s inspect the first few rows of the dataset.


In [None]:
df.head()


The `day_type` column encodes:
- `W`: weekday
- `A`: Saturday
- `U`: Sunday or holiday

Next, let’s visualize bus and rail ridership over a few months in 2019.


In [None]:
import matplotlib.pyplot as plt

df["2019-03":"2019-05"].plot(grid=True, marker=".", figsize=(8, 3.5))
plt.show()


This is a **multivariate time series** since multiple values exist per time step.
A strong **weekly seasonality** is clearly visible.

A simple baseline is **naive forecasting**: copying the value from one week earlier.


### Naive Forecasting and Differencing

To visualize naive forecasts, we compare the original series with a 7-day lagged version and compute their difference.


In [None]:
diff_7 = df[["bus", "rail"]].diff(7)["2019-03":"2019-05"]

fig, axs = plt.subplots(2, 1, sharex=True, figsize=(8, 5))
df.plot(ax=axs[0], legend=False, marker=".")
df.shift(7).plot(ax=axs[0], grid=True, legend=False, linestyle=":")
diff_7.plot(ax=axs[1], grid=True, marker=".")
plt.show()


The lagged series tracks the original closely, indicating **autocorrelation**.
Large deviations correspond to holidays, such as Memorial Day.


In [None]:
list(df.loc["2019-05-25":"2019-05-27"]["day_type"])


### Error Metrics

We compute the **mean absolute error (MAE)** and **mean absolute percentage error (MAPE)**.


In [None]:
diff_7.abs().mean()


In [None]:
targets = df[["bus", "rail"]]["2019-03":"2019-05"]
(diff_7 / targets).abs().mean()


MAPE is approximately:
- 8.3% for bus
- 9.0% for rail

These provide a strong baseline.


## Yearly Seasonality and Trends

We now examine yearly patterns using monthly averages and rolling means.


In [None]:
period = slice("2001", "2019")
df_monthly = df.select_dtypes(include="number").resample("ME").mean()
rolling_average_12_months = df_monthly.loc[period].rolling(window=12).mean()

fig, ax = plt.subplots(figsize=(8, 4))
df_monthly[period].plot(ax=ax, marker=".")
rolling_average_12_months.plot(ax=ax, grid=True, legend=False)
plt.show()


Yearly seasonality and long-term trends are visible, especially for rail ridership.

Differencing removes both seasonality and trend, producing a more stationary series.


In [None]:
Yearly seasonality and long-term trends are visible, especially for rail ridership.

Differencing removes both seasonality and trend, producing a more stationary series.


## The ARMA / ARIMA / SARIMA Family

- **ARMA**: weighted sum of past values and past forecast errors
- **ARIMA**: adds differencing to handle non-stationarity
- **SARIMA**: models seasonal patterns explicitly

These models assume stationarity, which differencing helps achieve.


### Fitting a SARIMA Model

We forecast rail ridership using a SARIMA model with weekly seasonality.


In [None]:
from statsmodels.tsa.arima.model import ARIMA

origin, today = "2019-01-01", "2019-05-31"
rail_series = df.loc[origin:today]["rail"].asfreq("D")

model = ARIMA(
    rail_series,
    order=(1, 0, 0),
    seasonal_order=(0, 1, 1, 7)
)
model = model.fit()
model.forecast()


While the single-day forecast may perform poorly, retraining daily and averaging performance yields much better results.


In [None]:
origin, start_date, end_date = "2019-01-01", "2019-03-01", "2019-05-31"
time_period = pd.date_range(start_date, end_date)
rail_series = df.loc[origin:end_date]["rail"].asfreq("D")

y_preds = []
for today in time_period.shift(-1):
    model = ARIMA(
        rail_series[origin:today],
        order=(1, 0, 0),
        seasonal_order=(0, 1, 1, 7)
    ).fit()
    y_preds.append(model.forecast().iloc[0])

y_preds = pd.Series(y_preds, index=time_period)
(y_preds - rail_series[time_period]).abs().mean()


## Preparing Data for Machine Learning Models

We now prepare sliding windows of historical data to train ML models.
Each input window contains 56 days, and the target is the following day.


In [None]:
import torch

class TimeSeriesDataset(torch.utils.data.Dataset):
    def __init__(self, series, window_length):
        self.series = series
        self.window_length = window_length

    def __len__(self):
        return len(self.series) - self.window_length

    def __getitem__(self, idx):
        end = idx + self.window_length
        window = self.series[idx:end]
        target = self.series[end]
        return window, target


Testing the dataset on a toy example confirms correctness.


In [None]:
my_series = torch.tensor([[0], [1], [2], [3], [4], [5]])
my_dataset = TimeSeriesDataset(my_series, window_length=3)

for window, target in my_dataset:
    print("Window:", window, "Target:", target)


## From Here On

- Linear regression baselines
- Simple RNNs
- Deep RNNs
- Multivariate time series
- Multi-step forecasting
- Sequence-to-sequence RNNs

All follow the same **windowed dataset + PyTorch training loop** pattern.


## Handling Long Sequences

To train an RNN on long sequences, we must run it over many time steps, making the unrolled RNN a very deep network. Like any deep neural network, it may suffer from the unstable gradients problem, and it may also gradually forget the first inputs in the sequence.

We will look at:
1. The unstable gradients problem
2. The short-term memory problem


## Fighting the Unstable Gradients Problem

Many techniques used for deep feedforward networks also apply to RNNs:
- Good parameter initialization
- Faster optimizers
- Dropout

However, nonsaturating activation functions like ReLU may actually make RNNs more unstable. Since the same weights are reused at each time step, small increases in outputs can compound over time and explode.

Using a smaller learning rate or a saturating activation function like `tanh` helps reduce this risk.


### Gradient Explosion and Clipping

Gradients can also explode during backpropagation through time.  
If training is unstable, monitor gradient norms and consider **gradient clipping**.


### Batch Normalization and Layer Normalization in RNNs

Batch normalization cannot be used efficiently across time steps in RNNs. It can only be applied between recurrent layers, not within them.

Layer normalization tends to work better inside recurrent layers. It is usually applied just before the activation function at each time step.


In [None]:
self.memory_cell = nn.Sequential(
    nn.Linear(input_size + hidden_size, hidden_size),
    nn.LayerNorm(hidden_size),
    nn.Tanh()
)


Layer normalization does not always help, but it is more effective in gated RNNs such as LSTM and GRU, especially when seasonality or trends have been removed from the data.


## Tackling the Short-Term Memory Problem

As information passes through many time steps, an RNN gradually forgets earlier inputs.  
This makes learning long-term dependencies difficult.

To solve this, **long-term memory cells** were introduced.


## LSTM Cells

The Long Short-Term Memory (LSTM) cell was introduced in 1997.  
It splits the state into:
- Short-term state: h(t)
- Long-term state: c(t)

This allows the model to decide what to store, forget, and retrieve.


### Gates in an LSTM Cell

An LSTM cell contains four fully connected layers:
- g(t): candidate memory
- i(t): input gate
- f(t): forget gate
- o(t): output gate

Each gate uses a sigmoid activation to output values in [0, 1], controlling information flow.


At each time step:
- The forget gate decides what to erase from long-term memory
- The input gate decides what to store
- The output gate decides what to reveal as output

This mechanism allows LSTMs to capture long-term patterns.


## Implementing an LSTM Manually with LSTMCell


In [None]:
class LstmModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.memory_cell = nn.LSTMCell(input_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)

    def forward(self, X):
        batch_size, window_length, dimensionality = X.shape
        X_time_first = X.transpose(0, 1)
        H = torch.zeros(batch_size, self.hidden_size, device=X.device)
        C = torch.zeros(batch_size, self.hidden_size, device=X.device)
        for X_t in X_time_first:
            H, C = self.memory_cell(X_t, (H, C))
        return self.output(H)


This is similar to a simple RNN, except:
- The hidden state is split into H (short-term) and C (long-term)
- An `nn.LSTMCell` is used instead of a simple linear layer


## GRU Cells

The Gated Recurrent Unit (GRU) is a simplified version of the LSTM:
- Uses a single state vector h(t)
- Merges forget and input gates into a single update gate z(t)
- Removes the output gate


GRUs often perform as well as LSTMs while being faster and simpler.
PyTorch provides:
- `nn.GRU`
- `nn.GRUCell`


## Using 1D Convolutional Layers with Sequences

1D convolutional layers slide kernels across sequences instead of images.
They can:
- Detect short-term patterns
- Reduce sequence length
- Help RNNs capture longer dependencies


⚠️ nn.Conv1d expects input shape:
[batch_size, features, sequence_length]

So we must permute dimensions before and after the convolution.


In [None]:
class DownsamplingModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.conv = nn.Conv1d(input_size, hidden_size, kernel_size=4, stride=2)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, X):
        Z = X.permute(0, 2, 1)
        Z = self.conv(Z)
        Z = Z.permute(0, 2, 1)
        Z = torch.relu(Z)
        Z, _ = self.gru(Z)
        return self.linear(Z)


The convolution downsamples the sequence, allowing the GRU to model longer patterns more efficiently.


In [None]:
class DownsampledDataset(Seq2SeqDataset):
    def __getitem__(self, idx):
        window, target = super().__getitem__(idx)
        return window, target[3::2]


## WaveNet

WaveNet uses stacked 1D convolutions with exponentially increasing dilation rates:
1, 2, 4, 8, …

This allows the network to capture extremely long-term dependencies efficiently.


### Causal Convolutions

Causal convolutions pad inputs on the left only, ensuring the model never looks into the future.


In [None]:
import torch.nn.functional as F

class CausalConv1d(nn.Conv1d):
    def forward(self, X):
        padding = (self.kernel_size[0] - 1) * self.dilation[0]
        X = F.pad(X, (padding, 0))
        return super().forward(X)


## WaveNet Model Implementation


In [None]:
class WavenetModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        layers = []
        for dilation in (1, 2, 4, 8) * 2:
            conv = CausalConv1d(
                input_size, hidden_size, kernel_size=2, dilation=dilation
            )
            layers += [conv, nn.ReLU()]
            input_size = hidden_size
        self.convs = nn.Sequential(*layers)
        self.output = nn.Linear(hidden_size, output_size)

    def forward(self, X):
        Z = X.permute(0, 2, 1)
        Z = self.convs(Z)
        Z = Z.permute(0, 2, 1)
        return self.output(Z)


Thanks to causal padding, the output sequence length matches the input length.
No cropping or downsampling of targets is required.


## Final Notes

- LSTM, GRU, CNN-RNN hybrids, and WaveNet can all model long sequences
- Performance depends heavily on data and task
- Models trained on past data may fail if patterns change (e.g., COVID-19)

Always validate on recent data and monitor production performance.
