# Chapter 15: Processing Sequences Using RNNs and CNNs

## 1. Chapter Overview
**Goal:** Predicting the future is one of the most exciting applications of Machine Learning. Whether it is stock prices, weather, or sentences (predicting the next word), the data is sequential. In this chapter, we learn how to handle data where order matters using **Recurrent Neural Networks (RNNs)**, **LSTMs**, **GRUs**, and even **1D CNNs** (WaveNet).

**Key Concepts:**
* **Recurrent Neurons:** Neurons that feed their output back into themselves (Memory).
* **Unrolling through time:** How RNNs are trained using Backpropagation Through Time (BPTT).
* **Sequence-to-Sequence vs Sequence-to-Vector:** Different topologies for different tasks.
* **The Memory Problem:** Why simple RNNs forget long-term patterns.
* **LSTM (Long Short-Term Memory):** The gold standard for handling long sequences. Introduces Forget, Input, and Output gates.
* **GRU (Gated Recurrent Unit):** A simplified, faster version of LSTM.
* **1D CNNs (WaveNet):** Processing sequences using convolution instead of recurrence.

**Practical Skills:**
* Generating synthetic time series data.
* Building a **SimpleRNN** to predict the next step in a series.
* Building a **Deep RNN** (stacked layers).
* Implementing **LSTM** and **GRU** layers.
* Forecasting 10 steps ahead (Multi-step forecasting).

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)

## 2. Theoretical Explanation (In-Depth)

### 1. Recurrent Neurons
A Feed-Forward network (like MLP or CNN) has no memory; it processes each input independently. An RNN processes sequences by iterating through the sequence elements and maintaining a **state** containing information relative to what it has seen so far.

**Math:** $y_{(t)} = \phi(W_x x_{(t)} + W_y y_{(t-1)} + b)$
At time step $t$, the neuron receives the input $x_{(t)}$ AND its own output from the previous time step $y_{(t-1)}$.

### 2. The Problem of Long-Term Dependencies
Simple RNNs suffer from the vanishing gradient problem severely. If a sequence is long (e.g., 100 steps), the signal from the start of the sequence is lost by the time it reaches the end. It's like trying to remember the first word of a book after reading the whole book.

### 3. LSTM (Long Short-Term Memory)
Invented in 1997 by Hochreiter and Schmidhuber. It maintains two state vectors:
1.  **$h_{(t)}$ (Short-term state):** The output.
2.  **$c_{(t)}$ (Long-term state / Cell state):** The conveyor belt that runs straight down the entire chain.

It uses **Gates** to regulate flow:
* **Forget Gate:** Decides what to throw away from the long-term state.
* **Input Gate:** Decides what new information to store.
* **Output Gate:** Decides what to output based on the state.

### 4. GRU (Gated Recurrent Unit)
A simplified LSTM (2014). It merges the cell state and hidden state into one. It has fewer parameters and is faster to train, often matching LSTM performance.

### 5. 1D CNNs (WaveNet)
Surprisingly, we can use Convolution for sequences. A 1D filter slides over the time axis. By stacking many 1D conv layers with **dilation** (skipping steps), the network can see a very long history (receptive field) without the slowness of sequential processing.

## 3. Code Reproduction

### 3.1 Generating Synthetic Time Series
We generate a sum of two sine waves plus noise to simulate a time series.

In [None]:
def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # noise
    return series[..., np.newaxis].astype(np.float32)

n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

def plot_series(series, y=None, y_pred=None, x_label="$t$", y_label="$x(t)$"):
    plt.plot(series, ".-")
    if y is not None:
        plt.plot(n_steps, y, "bx", markersize=10)
    if y_pred is not None:
        plt.plot(n_steps, y_pred, "ro")
    plt.grid(True)
    plt.xlabel(x_label)
    plt.ylabel(y_label)

plt.figure(figsize=(10, 6))
plot_series(X_valid[0, :, 0], y_valid[0, 0])
plt.title("Time Series Example")
plt.show()

### 3.2 Baseline Metrics
Before building complex RNNs, let's establish a baseline.
1.  **Naive Forecasting:** Predict the last observed value.
2.  **Linear Regression:** A simple Dense layer.

In [None]:
# 1. Naive Forecasting
y_pred = X_valid[:, -1]
print("Naive MSE:", np.mean(keras.losses.mean_squared_error(y_valid, y_pred)))

# 2. Linear Regression (Simple Dense)
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]),
    keras.layers.Dense(1)
])
model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20, verbose=0)
print("Linear MSE:", model.evaluate(X_valid, y_valid, verbose=0))

### 3.3 Simple RNN
Using the simplest RNN layer. It has a single hidden state looped back.

In [None]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1]) # None allows variable length sequences
])

model.compile(loss="mse", optimizer="adam")
history = model.fit(X_train, y_train, epochs=20, verbose=0, validation_data=(X_valid, y_valid))
print("SimpleRNN MSE:", model.evaluate(X_valid, y_valid, verbose=0))

### 3.4 Deep RNN
Stacking multiple RNN layers. We must set `return_sequences=True` for all layers except the last one, so the next layer receives a 3D sequence, not a 2D vector.

In [None]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(1)
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20, verbose=0)
print("Deep RNN MSE:", model.evaluate(X_valid, y_valid, verbose=0))

### 3.5 LSTM and GRU
Replacing SimpleRNN with LSTM or GRU to handle longer patterns better.

In [None]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20),
    keras.layers.Dense(1)
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20, verbose=0)
print("LSTM MSE:", model.evaluate(X_valid, y_valid, verbose=0))

### 3.6 Predicting 10 Steps Ahead (Sequence-to-Vector)
Instead of predicting just the next value ($t+1$), we predict the next 10 values ($t+1$ to $t+10$). We need to regenerate the target data $Y$ to be a vector of 10 values.

In [None]:
# Generate new data where Y contains 10 future steps
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

model = keras.models.Sequential([
    keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.GRU(20),
    keras.layers.Dense(10) # Output layer has 10 neurons now
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, Y_train, epochs=20, verbose=0)
print("Multi-step GRU MSE:", model.evaluate(X_valid, Y_valid, verbose=0))

# Visualization of 10-step forecast
Y_pred = model.predict(X_new)
plt.figure(figsize=(10, 6))
plt.plot(np.arange(n_steps), X_test[0, :, 0], "b.-", label="History")
plt.plot(np.arange(n_steps, n_steps + 10), Y_test[0], "gx", label="Actual Future")
plt.plot(np.arange(n_steps, n_steps + 10), Y_pred[0], "ro", label="Forecast")
plt.legend()
plt.title("10-Step Forecast")
plt.show()

## 4. Step-by-Step Explanation

### 1. Shape of Data
RNNs in Keras expect 3D input: `[batch_size, time_steps, dimensionality]`.
* `batch_size`: Number of samples (e.g., 32).
* `time_steps`: Length of the sequence (e.g., 50).
* `dimensionality`: Number of features per step. For univariate time series (just value), it is 1. For multivariate (e.g., Price + Temperature), it could be 2+.

### 2. Return Sequences
This is the most common confusion point.
* `return_sequences=False` (Default): The layer outputs a 2D array `[batch_size, units]`. It only returns the output of the *last* time step. Used for the final layer or before a Dense layer.
* `return_sequences=True`: The layer outputs a 3D array `[batch_size, time_steps, units]`. It outputs the hidden state for *every* time step. This is required when stacking RNN layers, so the next RNN layer has a sequence to process.

### 3. LSTM Internals

The LSTM cell has a "highway" for the cell state $c_{(t)}$ to pass through with minimal interference (multiplication by 1 or 0). This allows gradients to flow back many steps without vanishing, solving the memory problem.

### 4. Sequence-to-Vector vs Sequence-to-Sequence
* **Seq-to-Vec:** Input is a sequence, output is a vector (e.g., Sentiment Analysis, predicting the next value). We ignore intermediate outputs.
* **Seq-to-Seq:** Input is a sequence, output is a sequence (e.g., Translation, Frame-by-frame video classification). We use `return_sequences=True` and `TimeDistributed(Dense(...))` to apply a Dense layer to every time step.

## 5. Chapter Summary

* **RNNs** are designed for sequential data.
* **SimpleRNN** is generally too weak for real tasks due to vanishing gradients.
* **LSTM** and **GRU** are the standard solutions. They use gates to learn what to remember and what to forget.
* **Stacking:** Deep RNNs work better than shallow ones, but training is slow.
* **Forecasting:** You can predict 1 step ahead or N steps ahead.
* **1D CNNs (WaveNet):** A powerful alternative to RNNs. They can handle very long sequences efficiently using dilated convolutions.