# Chapter 15: Processing Sequences Using RNNs and CNNs

## Comprehensive Study Guide with Theory, Implementation, and Exercises

### Based on "Hands-On Machine Learning" by Aurélien Géron

---

## Table of Contents

1. **Introduction to Sequence Processing**
2. **Recurrent Neurons and Layers**
3. **Training RNNs - Backpropagation Through Time**
4. **Time Series Forecasting**
5. **Handling Long Sequences**
6. **LSTM and GRU Cells**
7. **1D Convolutional Layers for Sequences**
8. **WaveNet Architecture**
9. **Exercises and Solutions**
10. **Mathematical Foundations**

---

## Setup and Imports

First, let's set up our environment with all necessary imports and configurations for Google Colab.

In [None]:
# Google Colab setup
!pip install -q tensorflow-datasets

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds

# Configuration
tf.random.set_seed(42)
np.random.seed(42)

# Plotting configuration
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")

## 1. Introduction to Sequence Processing

### Theoretical Foundation

Sequences are everywhere in machine learning:
- **Time series**: Stock prices, weather data, sensor readings
- **Natural language**: Sentences, documents, speech
- **Biological sequences**: DNA, protein sequences
- **Video**: Sequences of frames

**Key Challenge**: Traditional neural networks expect fixed-size inputs, but sequences have variable lengths and temporal dependencies.

**Solution**: Recurrent Neural Networks (RNNs) can:
1. Process sequences of arbitrary length
2. Maintain memory of previous inputs
3. Share parameters across time steps

### Mathematical Motivation

For a sequence $\mathbf{x} = (x_1, x_2, ..., x_T)$, we want to model:

$$P(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} P(x_t | x_1, ..., x_{t-1})$$

This requires a model that can capture dependencies between distant time steps.

## 2. Recurrent Neurons and Layers

### Basic RNN Architecture

A recurrent neuron at time step $t$ receives:
- Input vector $\mathbf{x}^{(t)}$
- Previous output $\mathbf{y}^{(t-1)}$ (which becomes the hidden state)

### Mathematical Formulation

For a single recurrent neuron:
$$y^{(t)} = \phi(\mathbf{w}_x^T \mathbf{x}^{(t)} + \mathbf{w}_y^T \mathbf{y}^{(t-1)} + b)$$

For a layer of recurrent neurons:
$$\mathbf{y}^{(t)} = \phi(\mathbf{W}_x \mathbf{x}^{(t)} + \mathbf{W}_y \mathbf{y}^{(t-1)} + \mathbf{b})$$

Where:
- $\mathbf{W}_x$: weight matrix for inputs (shape: $n_{inputs} \times n_{neurons}$)
- $\mathbf{W}_y$: weight matrix for previous outputs (shape: $n_{neurons} \times n_{neurons}$)
- $\mathbf{b}$: bias vector
- $\phi$: activation function (typically tanh)

### Memory Cells

A memory cell preserves state across time steps:
- **Hidden state** $\mathbf{h}^{(t)}$: internal state of the cell
- **Output** $\mathbf{y}^{(t)}$: what the cell outputs (may equal hidden state)

General form:
$$\mathbf{h}^{(t)} = f(\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)})$$
$$\mathbf{y}^{(t)} = g(\mathbf{h}^{(t)})$$

In [None]:
# Demonstration of basic RNN concepts

def visualize_rnn_unrolling():
    """
    This function creates a visualization showing how an RNN is 'unrolled' through time.
    It demonstrates the concept that the same RNN cell is applied at each time step.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

    # Folded RNN (left)
    ax1.set_title("Folded RNN", fontsize=14, fontweight='bold')
    ax1.text(0.5, 0.7, "RNN\nCell", ha='center', va='center',
             bbox=dict(boxstyle="round,pad=0.3", facecolor='lightblue'),
             fontsize=12)
    ax1.arrow(0.3, 0.4, 0, 0.2, head_width=0.02, head_length=0.02, fc='black')
    ax1.text(0.3, 0.3, "x(t)", ha='center', fontsize=10)
    ax1.arrow(0.5, 0.9, 0, 0.08, head_width=0.02, head_length=0.02, fc='black')
    ax1.text(0.5, 1.0, "y(t)", ha='center', fontsize=10)
    # Self-loop for recurrence
    circle = plt.Circle((0.7, 0.7), 0.1, fill=False, linestyle='--')
    ax1.add_patch(circle)
    ax1.arrow(0.8, 0.7, 0.1, 0, head_width=0.02, head_length=0.02, fc='red')
    ax1.text(0.95, 0.7, "y(t-1)", ha='left', fontsize=10, color='red')
    ax1.set_xlim(0, 1.2)
    ax1.set_ylim(0, 1.2)
    ax1.axis('off')

    # Unrolled RNN (right)
    ax2.set_title("Unrolled RNN Through Time", fontsize=14, fontweight='bold')
    time_steps = 4
    for t in range(time_steps):
        x_pos = 0.2 + t * 0.2
        # RNN cell
        ax2.text(x_pos, 0.5, f"RNN\nCell", ha='center', va='center',
                bbox=dict(boxstyle="round,pad=0.15", facecolor='lightblue'),
                fontsize=8)
        # Input
        ax2.arrow(x_pos, 0.3, 0, 0.1, head_width=0.01, head_length=0.01, fc='black')
        ax2.text(x_pos, 0.25, f"x({t})", ha='center', fontsize=8)
        # Output
        ax2.arrow(x_pos, 0.65, 0, 0.1, head_width=0.01, head_length=0.01, fc='black')
        ax2.text(x_pos, 0.8, f"y({t})", ha='center', fontsize=8)
        # Connection to next time step
        if t < time_steps - 1:
            ax2.arrow(x_pos + 0.05, 0.5, 0.1, 0, head_width=0.02, head_length=0.01, fc='red')

    ax2.text(0.5, 0.1, "Time →", ha='center', fontsize=12, fontweight='bold')
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1)
    ax2.axis('off')

    plt.tight_layout()
    plt.show()

visualize_rnn_unrolling()

print("Key Concepts Illustrated:")
print("1. Same RNN cell is reused at each time step")
print("2. Hidden state flows from one time step to the next")
print("3. Unrolling reveals the deep architecture through time")

### Input and Output Sequences

RNNs can handle different sequence patterns:

1. **Sequence-to-Sequence**: Input sequence → Output sequence
   - Example: Time series forecasting
   - Use case: Predict next N values given previous M values

2. **Sequence-to-Vector**: Input sequence → Single output
   - Example: Sentiment analysis
   - Use case: Classify entire sequence

3. **Vector-to-Sequence**: Single input → Output sequence
   - Example: Image captioning
   - Use case: Generate sequence from single input

4. **Encoder-Decoder**: Sequence → Vector → Sequence
   - Example: Machine translation
   - Use case: Transform one sequence type to another

### Practical Implementation

In [None]:
# Demonstration of different sequence types with simple examples

def create_sample_data():
    """
    Creates sample data for different sequence processing tasks.
    This helps understand the input/output shapes for different RNN architectures.
    """
    # Sample time series data
    time_steps = 50
    n_samples = 1000

    # Generate synthetic time series
    np.random.seed(42)
    t = np.linspace(0, 4*np.pi, time_steps)
    series = []

    for i in range(n_samples):
        # Different frequencies and phases for variety
        freq1 = np.random.uniform(0.5, 2.0)
        freq2 = np.random.uniform(0.1, 0.5)
        phase1 = np.random.uniform(0, 2*np.pi)
        phase2 = np.random.uniform(0, 2*np.pi)

        # Combine sine waves with noise
        signal = (0.5 * np.sin(freq1 * t + phase1) +
                 0.3 * np.sin(freq2 * t + phase2) +
                 0.1 * np.random.randn(time_steps))
        series.append(signal)

    return np.array(series)

def demonstrate_sequence_types():
    """
    Shows the data shapes and use cases for different sequence processing tasks.
    """
    series_data = create_sample_data()

    print("Original time series shape:", series_data.shape)
    print("Format: (samples, time_steps)\n")

    # 1. Sequence-to-Sequence: Forecast next 10 steps
    window_size = 30
    forecast_steps = 10

    X_seq2seq = []
    y_seq2seq = []

    for series in series_data:
        for i in range(len(series) - window_size - forecast_steps + 1):
            X_seq2seq.append(series[i:i+window_size])
            y_seq2seq.append(series[i+window_size:i+window_size+forecast_steps])

    X_seq2seq = np.array(X_seq2seq).reshape(-1, window_size, 1)
    y_seq2seq = np.array(y_seq2seq)

    print("1. SEQUENCE-TO-SEQUENCE (Time Series Forecasting)")
    print(f"   Input shape: {X_seq2seq.shape} (samples, time_steps, features)")
    print(f"   Output shape: {y_seq2seq.shape} (samples, forecast_steps)")
    print(f"   Use case: Given {window_size} past values, predict next {forecast_steps} values\n")

    # 2. Sequence-to-Vector: Classify trend
    X_seq2vec = X_seq2seq[:1000]  # Use subset
    # Create binary classification: upward vs downward trend
    y_seq2vec = (X_seq2vec[:, -1, 0] > X_seq2vec[:, 0, 0]).astype(int)

    print("2. SEQUENCE-TO-VECTOR (Trend Classification)")
    print(f"   Input shape: {X_seq2vec.shape} (samples, time_steps, features)")
    print(f"   Output shape: {y_seq2vec.shape} (samples,)")
    print(f"   Use case: Classify if sequence shows upward (1) or downward (0) trend\n")

    # 3. Vector-to-Sequence: Generate sequence from initial condition
    X_vec2seq = series_data[:100, 0].reshape(-1, 1)  # Just first value
    y_vec2seq = series_data[:100, 1:21]  # Next 20 values

    print("3. VECTOR-TO-SEQUENCE (Sequence Generation)")
    print(f"   Input shape: {X_vec2seq.shape} (samples, features)")
    print(f"   Output shape: {y_vec2seq.shape} (samples, sequence_length)")
    print(f"   Use case: Generate sequence given initial condition\n")

    # Visualize examples
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # Original series
    axes[0, 0].plot(series_data[0])
    axes[0, 0].set_title("Original Time Series")
    axes[0, 0].set_xlabel("Time Step")
    axes[0, 0].set_ylabel("Value")

    # Sequence-to-sequence example
    example_idx = 5
    input_seq = X_seq2seq[example_idx, :, 0]
    target_seq = y_seq2seq[example_idx]

    axes[0, 1].plot(range(len(input_seq)), input_seq, 'b-', label='Input', linewidth=2)
    axes[0, 1].plot(range(len(input_seq), len(input_seq) + len(target_seq)),
                   target_seq, 'r-', label='Target', linewidth=2)
    axes[0, 1].axvline(x=len(input_seq)-1, color='gray', linestyle='--', alpha=0.7)
    axes[0, 1].set_title("Sequence-to-Sequence Example")
    axes[0, 1].set_xlabel("Time Step")
    axes[0, 1].set_ylabel("Value")
    axes[0, 1].legend()

    # Trend distribution
    axes[1, 0].hist(y_seq2vec, bins=2, alpha=0.7, edgecolor='black')
    axes[1, 0].set_title("Trend Classification Distribution")
    axes[1, 0].set_xlabel("Trend (0=Down, 1=Up)")
    axes[1, 0].set_ylabel("Count")
    axes[1, 0].set_xticks([0, 1])

    # Vector-to-sequence example
    vec_example = 5
    initial_val = X_vec2seq[vec_example, 0]
    generated_seq = y_vec2seq[vec_example]

    axes[1, 1].scatter(0, initial_val, color='red', s=100, label='Initial Value', zorder=5)
    axes[1, 1].plot(range(1, len(generated_seq)+1), generated_seq, 'b-',
                   label='Generated Sequence', linewidth=2)
    axes[1, 1].set_title("Vector-to-Sequence Example")
    axes[1, 1].set_xlabel("Time Step")
    axes[1, 1].set_ylabel("Value")
    axes[1, 1].legend()

    plt.tight_layout()
    plt.show()

    return X_seq2seq, y_seq2seq, X_seq2vec, y_seq2vec, X_vec2seq, y_vec2seq

# Run the demonstration
X_seq2seq, y_seq2seq, X_seq2vec, y_seq2vec, X_vec2seq, y_vec2seq = demonstrate_sequence_types()

## 3. Training RNNs - Backpropagation Through Time (BPTT)

### Theoretical Foundation

Training RNNs requires a modified version of backpropagation called **Backpropagation Through Time (BPTT)**.

### The Process

1. **Forward Pass**: Unroll the RNN and compute outputs for all time steps
2. **Compute Loss**: Calculate loss using all relevant outputs
3. **Backward Pass**: Propagate gradients backward through time
4. **Update Parameters**: Use accumulated gradients to update weights

### Mathematical Formulation

For an RNN with hidden state $\mathbf{h}^{(t)} = f(\mathbf{W}\mathbf{x}^{(t)} + \mathbf{U}\mathbf{h}^{(t-1)} + \mathbf{b})$:

**Forward Pass:**
$$\mathbf{h}^{(t)} = f(\mathbf{W}\mathbf{x}^{(t)} + \mathbf{U}\mathbf{h}^{(t-1)} + \mathbf{b})$$
$$\mathbf{y}^{(t)} = g(\mathbf{V}\mathbf{h}^{(t)} + \mathbf{c})$$

**Loss Function:**
$$L = \sum_{t=1}^{T} L^{(t)}(\mathbf{y}^{(t)}, \hat{\mathbf{y}}^{(t)})$$

**Gradient Computation:**
The gradient of the loss with respect to $\mathbf{U}$ involves the chain rule across time:

$$\frac{\partial L}{\partial \mathbf{U}} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial L^{(t)}}{\partial \mathbf{h}^{(t)}} \frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{h}^{(k)}} \frac{\partial \mathbf{h}^{(k)}}{\partial \mathbf{U}}$$

Where:
$$\frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{h}^{(k)}} = \prod_{i=k+1}^{t} \frac{\partial \mathbf{h}^{(i)}}{\partial \mathbf{h}^{(i-1)}} = \prod_{i=k+1}^{t} \mathbf{U}^T \text{diag}(f'(\mathbf{a}^{(i)}))$$

### Challenges

1. **Vanishing Gradients**: $\prod_{i=k+1}^{t} \mathbf{U}^T \text{diag}(f'(\mathbf{a}^{(i)}))$ can become very small
2. **Exploding Gradients**: The product can become very large
3. **Computational Complexity**: $O(T)$ memory and computation

In [None]:
# Demonstration of BPTT concepts and gradient flow

def demonstrate_gradient_flow():
    """
    This function demonstrates how gradients flow through time in RNNs
    and illustrates the vanishing gradient problem.
    """
    # Simulate gradient magnitudes through time
    time_steps = 20

    # Different scenarios
    scenarios = {
        'Vanishing (tanh, small weights)': {
            'weight_scale': 0.5,
            'activation': 'tanh'
        },
        'Exploding (tanh, large weights)': {
            'weight_scale': 2.0,
            'activation': 'tanh'
        },
        'Stable (proper initialization)': {
            'weight_scale': 1.0,
            'activation': 'tanh'
        }
    }

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    for idx, (scenario_name, params) in enumerate(scenarios.items()):
        gradients = [1.0]  # Start with gradient of 1.0 at the end

        # Simulate backward propagation
        for t in range(time_steps - 1):
            # Simplified gradient computation
            # For tanh: derivative is approximately 1 - tanh²(x)
            if params['activation'] == 'tanh':
                activation_grad = 0.25  # Average derivative for tanh

            # Gradient through recurrent connection
            weight_contribution = params['weight_scale']
            new_gradient = gradients[-1] * weight_contribution * activation_grad
            gradients.append(new_gradient)

        # Plot gradients (reverse order for backward flow)
        time_points = list(range(time_steps, 0, -1))
        axes[idx].semilogy(time_points, gradients, 'o-', linewidth=2, markersize=6)
        axes[idx].set_title(f'{scenario_name}', fontweight='bold')
        axes[idx].set_xlabel('Time Steps Back')
        axes[idx].set_ylabel('Gradient Magnitude (log scale)')
        axes[idx].grid(True, alpha=0.3)
        axes[idx].axhline(y=1.0, color='red', linestyle='--', alpha=0.7, label='Initial gradient')

        # Add interpretation
        final_gradient = gradients[-1]
        if final_gradient < 0.01:
            interpretation = "Vanishing!"
            color = 'red'
        elif final_gradient > 100:
            interpretation = "Exploding!"
            color = 'orange'
        else:
            interpretation = "Stable"
            color = 'green'

        axes[idx].text(0.7, 0.1, f'Final: {final_gradient:.6f}\n{interpretation}',
                      transform=axes[idx].transAxes,
                      bbox=dict(boxstyle="round,pad=0.3", facecolor=color, alpha=0.3),
                      fontweight='bold')

        if idx == 0:
            axes[idx].legend()

    plt.tight_layout()
    plt.show()

    print("Gradient Flow Analysis:")
    print("• Vanishing: Gradients decrease exponentially → early layers don't learn")
    print("• Exploding: Gradients increase exponentially → unstable training")
    print("• Stable: Gradients remain in reasonable range → effective learning")

def implement_simple_bptt():
    """
    Demonstrates the BPTT algorithm with a simple RNN implementation.
    This is educational - in practice, use TensorFlow's automatic differentiation.
    """
    print("\n" + "="*50)
    print("SIMPLE BPTT IMPLEMENTATION (Educational)")
    print("="*50)

    # Simple RNN for binary sequence classification
    class SimpleRNN:
        def __init__(self, input_size, hidden_size, output_size):
            # Initialize weights (small values to avoid exploding gradients)
            self.Wxh = np.random.randn(input_size, hidden_size) * 0.1
            self.Whh = np.random.randn(hidden_size, hidden_size) * 0.1
            self.Why = np.random.randn(hidden_size, output_size) * 0.1
            self.bh = np.zeros((1, hidden_size))
            self.by = np.zeros((1, output_size))

        def forward(self, X):
            """
            Forward pass through the RNN
            X: input sequence (seq_len, input_size)
            """
            seq_len, _ = X.shape

            # Store activations for backward pass
            self.h = np.zeros((seq_len + 1, self.Whh.shape[0]))
            self.y = np.zeros((seq_len, self.Why.shape[1]))

            # Forward through time
            for t in range(seq_len):
                self.h[t+1] = np.tanh(X[t:t+1] @ self.Wxh + self.h[t:t+1] @ self.Whh + self.bh)
                self.y[t] = self.h[t+1] @ self.Why + self.by

            return self.y

        def backward(self, X, y_true, learning_rate=0.01):
            """
            Backward pass (BPTT)
            """
            seq_len = X.shape[0]

            # Initialize gradients
            dWxh = np.zeros_like(self.Wxh)
            dWhh = np.zeros_like(self.Whh)
            dWhy = np.zeros_like(self.Why)
            dbh = np.zeros_like(self.bh)
            dby = np.zeros_like(self.by)

            # Gradient w.r.t. output
            dy = self.y - y_true

            # Backward through time
            dh_next = np.zeros_like(self.h[0:1])

            for t in reversed(range(seq_len)):
                # Output layer gradients
                dWhy += self.h[t+1].T @ dy[t:t+1]
                dby += dy[t:t+1]

                # Hidden layer gradients
                dh = dy[t:t+1] @ self.Why.T + dh_next
                dh_raw = (1 - self.h[t+1] ** 2) * dh  # tanh derivative

                # Weight gradients
                dWxh += X[t:t+1].T @ dh_raw
                dWhh += self.h[t:t+1].T @ dh_raw
                dbh += dh_raw

                # Gradient for next iteration
                dh_next = dh_raw @ self.Whh.T

            # Update weights
            self.Wxh -= learning_rate * dWxh
            self.Whh -= learning_rate * dWhh
            self.Why -= learning_rate * dWhy
            self.bh -= learning_rate * dbh
            self.by -= learning_rate * dby

            return np.mean(dy ** 2)  # MSE loss

    # Create sample data
    seq_len = 10
    input_size = 1
    hidden_size = 5
    output_size = 1

    # Simple task: sum of sequence
    X = np.random.randn(seq_len, input_size)
    y_true = np.array([[np.sum(X)]] * seq_len)  # Target is sum at each step

    # Initialize RNN
    rnn = SimpleRNN(input_size, hidden_size, output_size)

    print(f"Training simple RNN on sequence summation task...")
    print(f"Input sequence length: {seq_len}")
    print(f"Hidden size: {hidden_size}")
    print(f"Target: sum of sequence = {np.sum(X):.3f}")

    # Training loop
    losses = []
    for epoch in range(100):
        y_pred = rnn.forward(X)
        loss = rnn.backward(X, y_true, learning_rate=0.01)
        losses.append(loss)

        if epoch % 20 == 0:
            print(f"Epoch {epoch:3d}: Loss = {loss:.6f}, Final prediction = {y_pred[-1, 0]:.3f}")

    # Plot training progress
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    plt.plot(losses)
    plt.title('Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('MSE Loss')
    plt.yscale('log')

    plt.subplot(1, 2, 2)
    final_pred = rnn.forward(X)
    plt.plot(range(seq_len), y_true[:, 0], 'b-', label='Target', linewidth=2)
    plt.plot(range(seq_len), final_pred[:, 0], 'r--', label='Prediction', linewidth=2)
    plt.title('Final Predictions vs Targets')
    plt.xlabel('Time Step')
    plt.ylabel('Value')
    plt.legend()

    plt.tight_layout()
    plt.show()

# Run demonstrations
demonstrate_gradient_flow()
implement_simple_bptt()

## 4. Time Series Forecasting

### Theoretical Background

Time series forecasting involves predicting future values based on past observations. The goal is to model:

$$x_{t+h} = f(x_t, x_{t-1}, ..., x_{t-n+1}) + \epsilon_{t+h}$$

Where:
- $h$ is the forecast horizon
- $n$ is the lookback window
- $\epsilon$ is the error term

### Types of Time Series

1. **Univariate**: Single variable (e.g., stock price)
2. **Multivariate**: Multiple variables (e.g., weather data with temperature, humidity, pressure)

### Key Challenges

1. **Trend**: Long-term increase or decrease
2. **Seasonality**: Regular patterns (daily, weekly, yearly)
3. **Stationarity**: Statistical properties change over time
4. **Noise**: Random fluctuations

### Baseline Methods

Before using complex models, establish baselines:

1. **Naive Forecast**: $\hat{x}_{t+1} = x_t$
2. **Moving Average**: $\hat{x}_{t+1} = \frac{1}{k}\sum_{i=0}^{k-1} x_{t-i}$
3. **Linear Regression**: $\hat{x}_{t+1} = \alpha + \beta t$

In [None]:
# Time series generation function (as described in the book)

def generate_time_series(batch_size, n_steps):
    """
    Generates synthetic time series as described in the book.
    Each series is a combination of two sine waves with different frequencies
    and phases, plus some noise.

    Args:
        batch_size: Number of time series to generate
        n_steps: Number of time steps in each series

    Returns:
        NumPy array of shape [batch_size, n_steps, 1]
    """
    # Generate random frequencies and offsets for each series
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)

    # Create time axis
    time = np.linspace(0, 1, n_steps)

    # Generate series as sum of two sine waves plus noise
    series = (0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10)) +     # wave 1
              0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) +     # wave 2
              0.1 * (np.random.rand(batch_size, n_steps) - 0.5))        # noise

    return series[..., np.newaxis].astype(np.float32)

def demonstrate_time_series_concepts():
    """
    Demonstrates key time series concepts with visualizations.
    """
    print("DEMONSTRATING TIME SERIES CONCEPTS")
    print("="*50)

    # Generate sample data
    n_steps = 50
    series = generate_time_series(10000, n_steps + 1)

    # Split into train/validation/test
    X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
    X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
    X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

    print(f"Dataset shapes:")
    print(f"Training: X={X_train.shape}, y={y_train.shape}")
    print(f"Validation: X={X_valid.shape}, y={y_valid.shape}")
    print(f"Test: X={X_test.shape}, y={y_test.shape}")

    # Visualize examples
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Show different series
    for i in range(6):
        row, col = i // 3, i % 3
        full_series = np.concatenate([X_train[i, :, 0], y_train[i:i+1]])
        axes[row, col].plot(range(len(X_train[i, :, 0])), X_train[i, :, 0], 'b-', linewidth=2, label='Input')
        axes[row, col].plot(len(X_train[i, :, 0]), y_train[i], 'ro', markersize=8, label='Target')
        axes[row, col].set_title(f'Time Series {i+1}')
        axes[row, col].set_xlabel('Time Step')
        axes[row, col].set_ylabel('Value')
        axes[row, col].legend()
        axes[row, col].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return X_train, y_train, X_valid, y_valid, X_test, y_test

def baseline_methods(X_valid, y_valid):
    """
    Implements baseline forecasting methods as described in the book.
    """
    print("\nBASELINE FORECASTING METHODS")
    print("="*40)

    # 1. Naive forecast (last value)
    naive_pred = X_valid[:, -1, 0]  # Last value in sequence
    naive_mse = np.mean((y_valid - naive_pred) ** 2)

    print(f"1. Naive Forecast (last value): MSE = {naive_mse:.6f}")

    # 2. Moving average
    window_sizes = [3, 5, 10]
    for window in window_sizes:
        ma_pred = np.mean(X_valid[:, -window:, 0], axis=1)
        ma_mse = np.mean((y_valid - ma_pred) ** 2)
        print(f"2. Moving Average (window={window}): MSE = {ma_mse:.6f}")

    # 3. Linear regression baseline
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error

    # Flatten sequences for linear regression
    X_flat = X_valid.reshape(X_valid.shape[0], -1)

    lr_model = LinearRegression()
    # Use training data for fitting
    X_train_flat = X_train.reshape(X_train.shape[0], -1)
    lr_model.fit(X_train_flat, y_train)

    lr_pred = lr_model.predict(X_flat)
    lr_mse = mean_squared_error(y_valid, lr_pred)

    print(f"3. Linear Regression: MSE = {lr_mse:.6f}")

    # Visualize predictions
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Sample indices for visualization
    sample_indices = np.random.choice(len(y_valid), 100, replace=False)

    methods = [
        ('Naive', naive_pred[sample_indices], naive_mse),
        ('Moving Avg (5)', np.mean(X_valid[sample_indices, -5:, 0], axis=1),
         np.mean((y_valid[sample_indices] - np.mean(X_valid[sample_indices, -5:, 0], axis=1)) ** 2)),
        ('Linear Regression', lr_pred[sample_indices], lr_mse)
    ]

    for i, (method_name, predictions, mse) in enumerate(methods):
        axes[i].scatter(y_valid[sample_indices], predictions, alpha=0.6, s=20)
        axes[i].plot([y_valid.min(), y_valid.max()], [y_valid.min(), y_valid.max()],
                    'r--', linewidth=2, label='Perfect prediction')
        axes[i].set_xlabel('True Values')
        axes[i].set_ylabel('Predicted Values')
        axes[i].set_title(f'{method_name}\nMSE = {mse:.6f}')
        axes[i].legend()
        axes[i].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return {
        'naive': naive_mse,
        'moving_avg_5': np.mean((y_valid - np.mean(X_valid[:, -5:, 0], axis=1)) ** 2),
        'linear_regression': lr_mse
    }

# Generate data and run baseline analysis
X_train, y_train, X_valid, y_valid, X_test, y_test = demonstrate_time_series_concepts()
baseline_results = baseline_methods(X_valid, y_valid)

### Simple RNN Implementation

Now let's implement a simple RNN for time series forecasting, following the book's approach.

In [None]:
# Simple RNN for time series forecasting

def build_simple_rnn(input_shape, units=1):
    """
    Builds a simple RNN model as described in the book.

    Args:
        input_shape: Shape of input sequences (time_steps, features)
        units: Number of RNN units

    Returns:
        Compiled Keras model
    """
    model = keras.Sequential([
        keras.layers.SimpleRNN(units, input_shape=input_shape)
    ])

    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

def build_deep_rnn(input_shape, units=[20, 20, 1]):
    """
    Builds a deep RNN model as described in the book.

    Args:
        input_shape: Shape of input sequences
        units: List of units for each layer

    Returns:
        Compiled Keras model
    """
    model = keras.Sequential()

    # First layer
    model.add(keras.layers.SimpleRNN(
        units[0],
        return_sequences=True,
        input_shape=input_shape
    ))

    # Hidden layers
    for i in range(1, len(units)-1):
        model.add(keras.layers.SimpleRNN(
            units[i],
            return_sequences=True
        ))

    # Output layer
    if len(units) > 1:
        model.add(keras.layers.SimpleRNN(units[-1]))

    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

def build_dense_output_rnn(input_shape, rnn_units=[20, 20], dense_units=1):
    """
    Builds RNN with Dense output layer (as recommended in the book).
    """
    model = keras.Sequential()

    # RNN layers
    for i, units in enumerate(rnn_units):
        return_seq = i < len(rnn_units) - 1  # Return sequences except for last layer
        if i == 0:
            model.add(keras.layers.SimpleRNN(
                units,
                return_sequences=return_seq,
                input_shape=input_shape
            ))
        else:
            model.add(keras.layers.SimpleRNN(
                units,
                return_sequences=return_seq
            ))

    # Dense output layer
    model.add(keras.layers.Dense(dense_units))

    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

def train_and_evaluate_models():
    """
    Trains and evaluates different RNN architectures.
    """
    print("\nTRAINING RNN MODELS")
    print("="*30)

    input_shape = [None, 1]  # Variable length sequences with 1 feature

    # Build models
    models = {
        'Simple RNN (1 unit)': build_simple_rnn(input_shape, units=1),
        'Deep RNN (20-20-1)': build_deep_rnn(input_shape, units=[20, 20, 1]),
        'RNN + Dense': build_dense_output_rnn(input_shape, rnn_units=[20, 20], dense_units=1)
    }

    results = {}
    histories = {}

    # Training parameters
    epochs = 20
    batch_size = 32

    for name, model in models.items():
        print(f"\nTraining {name}...")
        print(f"Model parameters: {model.count_params():,}")

        # Train model
        history = model.fit(
            X_train, y_train,
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(X_valid, y_valid),
            verbose=0
        )

        # Evaluate
        train_loss = model.evaluate(X_train, y_train, verbose=0)[0]
        val_loss = model.evaluate(X_valid, y_valid, verbose=0)[0]
        test_loss = model.evaluate(X_test, y_test, verbose=0)[0]

        results[name] = {
            'train_mse': train_loss,
            'val_mse': val_loss,
            'test_mse': test_loss,
            'params': model.count_params()
        }

        histories[name] = history

        print(f"Results: Train MSE = {train_loss:.6f}, Val MSE = {val_loss:.6f}, Test MSE = {test_loss:.6f}")

    # Compare with baselines
    print(f"\nCOMPARISON WITH BASELINES:")
    print(f"Naive forecast MSE: {baseline_results['naive']:.6f}")
    print(f"Moving average MSE: {baseline_results['moving_avg_5']:.6f}")
    print(f"Linear regression MSE: {baseline_results['linear_regression']:.6f}")

    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # Training curves
    ax1 = axes[0, 0]
    for name, history in histories.items():
        ax1.plot(history.history['loss'], label=f'{name} (train)', alpha=0.7)
        ax1.plot(history.history['val_loss'], label=f'{name} (val)', linestyle='--', alpha=0.7)
    ax1.set_title('Training Curves')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('MSE Loss')
    ax1.legend()
    ax1.set_yscale('log')
    ax1.grid(True, alpha=0.3)

    # Performance comparison
    ax2 = axes[0, 1]
    model_names = list(results.keys()) + ['Naive', 'Moving Avg', 'Linear Reg']
    mse_values = ([results[name]['test_mse'] for name in results.keys()] +
                 [baseline_results['naive'], baseline_results['moving_avg_5'], baseline_results['linear_regression']])

    colors = ['skyblue'] * len(results) + ['orange'] * 3
    bars = ax2.bar(range(len(model_names)), mse_values, color=colors, alpha=0.7)
    ax2.set_title('Test MSE Comparison')
    ax2.set_xlabel('Model')
    ax2.set_ylabel('MSE')
    ax2.set_xticks(range(len(model_names)))
    ax2.set_xticklabels(model_names, rotation=45, ha='right')
    ax2.grid(True, alpha=0.3)

    # Add value labels on bars
    for bar, value in zip(bars, mse_values):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
                f'{value:.4f}', ha='center', va='bottom', fontsize=8)

    # Model complexity vs performance
    ax3 = axes[1, 0]
    param_counts = [results[name]['params'] for name in results.keys()]
    test_mses = [results[name]['test_mse'] for name in results.keys()]

    ax3.scatter(param_counts, test_mses, s=100, alpha=0.7)
    for i, name in enumerate(results.keys()):
        ax3.annotate(name, (param_counts[i], test_mses[i]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    ax3.set_title('Model Complexity vs Performance')
    ax3.set_xlabel('Number of Parameters')
    ax3.set_ylabel('Test MSE')
    ax3.grid(True, alpha=0.3)

    # Prediction examples
    ax4 = axes[1, 1]
    best_model_name = min(results.keys(), key=lambda x: results[x]['test_mse'])
    best_model = models[best_model_name]

    # Get predictions for a few test samples
    sample_indices = [0, 1, 2]
    for i, idx in enumerate(sample_indices):
        # Plot input sequence
        input_seq = X_test[idx, :, 0]
        true_next = y_test[idx]
        pred_next = best_model.predict(X_test[idx:idx+1], verbose=0)[0, 0]

        time_steps = range(len(input_seq))
        ax4.plot(time_steps, input_seq, 'b-', alpha=0.7, linewidth=1)
        ax4.scatter(len(input_seq), true_next, color='red', s=60, alpha=0.8, label='True' if i == 0 else '')
        ax4.scatter(len(input_seq), pred_next, color='orange', s=60, alpha=0.8, marker='x',
                   label='Predicted' if i == 0 else '')

    ax4.set_title(f'Predictions: {best_model_name}')
    ax4.set_xlabel('Time Step')
    ax4.set_ylabel('Value')
    ax4.legend()
    ax4.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return results, models

# Train and evaluate models
model_results, trained_models = train_and_evaluate_models()

## 5. LSTM and GRU Cells

### Long Short-Term Memory (LSTM)

LSTM cells address the vanishing gradient problem in RNNs through a sophisticated gating mechanism.

#### Mathematical Formulation

LSTM maintains two states:
- **Cell state** $\mathbf{c}^{(t)}$: Long-term memory
- **Hidden state** $\mathbf{h}^{(t)}$: Short-term memory/output

**Gate Equations:**
$$\mathbf{i}^{(t)} = \sigma(\mathbf{W}_{xi}\mathbf{x}^{(t)} + \mathbf{W}_{hi}\mathbf{h}^{(t-1)} + \mathbf{b}_i)$$ (Input gate)
$$\mathbf{f}^{(t)} = \sigma(\mathbf{W}_{xf}\mathbf{x}^{(t)} + \mathbf{W}_{hf}\mathbf{h}^{(t-1)} + \mathbf{b}_f)$$ (Forget gate)
$$\mathbf{o}^{(t)} = \sigma(\mathbf{W}_{xo}\mathbf{x}^{(t)} + \mathbf{W}_{ho}\mathbf{h}^{(t-1)} + \mathbf{b}_o)$$ (Output gate)

**Candidate Values:**
$$\mathbf{g}^{(t)} = \tanh(\mathbf{W}_{xg}\mathbf{x}^{(t)} + \mathbf{W}_{hg}\mathbf{h}^{(t-1)} + \mathbf{b}_g)$$

**State Updates:**
$$\mathbf{c}^{(t)} = \mathbf{f}^{(t)} \odot \mathbf{c}^{(t-1)} + \mathbf{i}^{(t)} \odot \mathbf{g}^{(t)}$$
$$\mathbf{h}^{(t)} = \mathbf{o}^{(t)} \odot \tanh(\mathbf{c}^{(t)})$$

Where $\odot$ denotes element-wise multiplication and $\sigma$ is the sigmoid function.

### Gated Recurrent Unit (GRU)

GRU is a simplified version of LSTM with fewer parameters:

**Gate Equations:**
$$\mathbf{z}^{(t)} = \sigma(\mathbf{W}_{xz}\mathbf{x}^{(t)} + \mathbf{W}_{hz}\mathbf{h}^{(t-1)} + \mathbf{b}_z)$$ (Update gate)
$$\mathbf{r}^{(t)} = \sigma(\mathbf{W}_{xr}\mathbf{x}^{(t)} + \mathbf{W}_{hr}\mathbf{h}^{(t-1)} + \mathbf{b}_r)$$ (Reset gate)

**Candidate State:**
$$\mathbf{g}^{(t)} = \tanh(\mathbf{W}_{xg}\mathbf{x}^{(t)} + \mathbf{W}_{hg}(\mathbf{r}^{(t)} \odot \mathbf{h}^{(t-1)}) + \mathbf{b}_g)$$

**State Update:**
$$\mathbf{h}^{(t)} = \mathbf{z}^{(t)} \odot \mathbf{h}^{(t-1)} + (1 - \mathbf{z}^{(t)}) \odot \mathbf{g}^{(t)}$$

In [None]:
# LSTM and GRU implementation and comparison

def visualize_lstm_architecture():
    """
    Creates a visual representation of LSTM cell architecture.
    """
    fig, ax = plt.subplots(1, 1, figsize=(14, 8))

    # LSTM cell components
    ax.text(0.5, 0.9, 'LSTM Cell Architecture', ha='center', fontsize=16, fontweight='bold')

    # Cell state flow (top)
    ax.arrow(0.1, 0.7, 0.8, 0, head_width=0.02, head_length=0.02, fc='blue', linewidth=3)
    ax.text(0.5, 0.75, 'Cell State C(t-1) → C(t)', ha='center', fontsize=12, color='blue', fontweight='bold')

    # Gates
    gates = [
        (0.2, 0.5, 'Forget\nGate', 'lightcoral'),
        (0.4, 0.5, 'Input\nGate', 'lightgreen'),
        (0.6, 0.5, 'Candidate\nValues', 'lightyellow'),
        (0.8, 0.5, 'Output\nGate', 'lightblue')
    ]

    for x, y, label, color in gates:
        rect = plt.Rectangle((x-0.05, y-0.08), 0.1, 0.16,
                           facecolor=color, edgecolor='black', linewidth=1)
        ax.add_patch(rect)
        ax.text(x, y, label, ha='center', va='center', fontsize=10, fontweight='bold')

    # Input connections
    ax.arrow(0.2, 0.2, 0, 0.22, head_width=0.01, head_length=0.01, fc='black')
    ax.arrow(0.4, 0.2, 0, 0.22, head_width=0.01, head_length=0.01, fc='black')
    ax.arrow(0.6, 0.2, 0, 0.22, head_width=0.01, head_length=0.01, fc='black')
    ax.arrow(0.8, 0.2, 0, 0.22, head_width=0.01, head_length=0.01, fc='black')

    ax.text(0.5, 0.15, 'x(t) and h(t-1)', ha='center', fontsize=12, fontweight='bold')

    # Mathematical operations
    ax.text(0.25, 0.35, '×', ha='center', fontsize=20, fontweight='bold')  # Forget
    ax.text(0.45, 0.35, '×', ha='center', fontsize=20, fontweight='bold')  # Input
    ax.text(0.5, 0.7, '+', ha='center', fontsize=20, fontweight='bold', color='blue')  # Add
    ax.text(0.75, 0.35, '×', ha='center', fontsize=20, fontweight='bold')  # Output

    # Output
    ax.arrow(0.8, 0.58, 0, 0.1, head_width=0.01, head_length=0.01, fc='red', linewidth=2)
    ax.text(0.85, 0.65, 'h(t)', ha='left', fontsize=12, color='red', fontweight='bold')

    # Key equations
    equations = [
        'f(t) = σ(Wf·[h(t-1), x(t)] + bf)',
        'i(t) = σ(Wi·[h(t-1), x(t)] + bi)',
        'g(t) = tanh(Wg·[h(t-1), x(t)] + bg)',
        'o(t) = σ(Wo·[h(t-1), x(t)] + bo)',
        'C(t) = f(t)⊙C(t-1) + i(t)⊙g(t)',
        'h(t) = o(t)⊙tanh(C(t))'
    ]

    for i, eq in enumerate(equations):
        ax.text(0.02, 0.02 + i*0.04, eq, fontsize=9, fontfamily='monospace')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    plt.tight_layout()
    plt.show()

def build_lstm_models():
    """
    Builds LSTM models with different configurations.
    """
    input_shape = [None, 1]

    models = {}

    # Simple LSTM
    models['Simple LSTM'] = keras.Sequential([
        keras.layers.LSTM(20, input_shape=input_shape),
        keras.layers.Dense(1)
    ])

    # Deep LSTM
    models['Deep LSTM'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=input_shape),
        keras.layers.LSTM(20, return_sequences=True),
        keras.layers.LSTM(20),
        keras.layers.Dense(1)
    ])

    # Simple GRU
    models['Simple GRU'] = keras.Sequential([
        keras.layers.GRU(20, input_shape=input_shape),
        keras.layers.Dense(1)
    ])

    # Deep GRU
    models['Deep GRU'] = keras.Sequential([
        keras.layers.GRU(20, return_sequences=True, input_shape=input_shape),
        keras.layers.GRU(20, return_sequences=True),
        keras.layers.GRU(20),
        keras.layers.Dense(1)
    ])

    # LSTM with dropout
    models['LSTM + Dropout'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=input_shape,
                         dropout=0.2, recurrent_dropout=0.2),
        keras.layers.LSTM(20, dropout=0.2, recurrent_dropout=0.2),
        keras.layers.Dense(1)
    ])

    # Compile all models
    for model in models.values():
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    return models

def compare_architectures():
    """
    Compares different RNN architectures (SimpleRNN, LSTM, GRU).
    """
    print("\nCOMPARING RNN ARCHITECTURES")
    print("="*40)

    models = build_lstm_models()

    results = {}
    training_times = {}

    epochs = 15
    batch_size = 32

    for name, model in models.items():
        print(f"\nTraining {name}...")
        print(f"Parameters: {model.count_params():,}")

        # Time the training
        import time
        start_time = time.time()

        history = model.fit(
            X_train, y_train,
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(X_valid, y_valid),
            verbose=0
        )

        training_time = time.time() - start_time
        training_times[name] = training_time

        # Evaluate
        test_loss = model.evaluate(X_test, y_test, verbose=0)[0]
        val_loss = min(history.history['val_loss'])

        results[name] = {
            'test_mse': test_loss,
            'best_val_mse': val_loss,
            'params': model.count_params(),
            'training_time': training_time,
            'history': history
        }

        print(f"Test MSE: {test_loss:.6f}")
        print(f"Best Val MSE: {val_loss:.6f}")
        print(f"Training time: {training_time:.1f}s")

    # Visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))

    # Training curves
    ax1 = axes[0, 0]
    for name, result in results.items():
        history = result['history']
        ax1.plot(history.history['val_loss'], label=name, linewidth=2)
    ax1.set_title('Validation Loss Curves')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('MSE Loss')
    ax1.legend()
    ax1.set_yscale('log')
    ax1.grid(True, alpha=0.3)

    # Performance vs Parameters
    ax2 = axes[0, 1]
    params = [result['params'] for result in results.values()]
    test_mses = [result['test_mse'] for result in results.values()]
    names = list(results.keys())

    ax2.scatter(params, test_mses, s=100, alpha=0.7)
    for i, name in enumerate(names):
        ax2.annotate(name, (params[i], test_mses[i]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    ax2.set_title('Performance vs Model Complexity')
    ax2.set_xlabel('Number of Parameters')
    ax2.set_ylabel('Test MSE')
    ax2.grid(True, alpha=0.3)

    # Training time comparison
    ax3 = axes[0, 2]
    times = [result['training_time'] for result in results.values()]
    bars = ax3.bar(range(len(names)), times, alpha=0.7)
    ax3.set_title('Training Time Comparison')
    ax3.set_xlabel('Model')
    ax3.set_ylabel('Training Time (seconds)')
    ax3.set_xticks(range(len(names)))
    ax3.set_xticklabels(names, rotation=45, ha='right')

    # Performance comparison bar chart
    ax4 = axes[1, 0]
    bars = ax4.bar(range(len(names)), test_mses, alpha=0.7)
    ax4.set_title('Test MSE Comparison')
    ax4.set_xlabel('Model')
    ax4.set_ylabel('Test MSE')
    ax4.set_xticks(range(len(names)))
    ax4.set_xticklabels(names, rotation=45, ha='right')

    # Add value labels
    for bar, value in zip(bars, test_mses):
        ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
                f'{value:.4f}', ha='center', va='bottom', fontsize=8)

    # Memory usage visualization (parameter count breakdown)
    ax5 = axes[1, 1]
    # Group by architecture type
    lstm_models = [name for name in names if 'LSTM' in name]
    gru_models = [name for name in names if 'GRU' in name]

    lstm_params = [results[name]['params'] for name in lstm_models]
    gru_params = [results[name]['params'] for name in gru_models]

    x_pos = np.arange(max(len(lstm_models), len(gru_models)))
    width = 0.35

    if lstm_params:
        ax5.bar(x_pos[:len(lstm_params)] - width/2, lstm_params, width,
               label='LSTM variants', alpha=0.7)
    if gru_params:
        ax5.bar(x_pos[:len(gru_params)] + width/2, gru_params, width,
               label='GRU variants', alpha=0.7)

    ax5.set_title('Parameter Count by Architecture')
    ax5.set_xlabel('Model Variant')
    ax5.set_ylabel('Number of Parameters')
    ax5.legend()

    # Efficiency plot (performance vs time)
    ax6 = axes[1, 2]
    ax6.scatter(times, test_mses, s=100, alpha=0.7)
    for i, name in enumerate(names):
        ax6.annotate(name, (times[i], test_mses[i]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    ax6.set_title('Efficiency: Performance vs Training Time')
    ax6.set_xlabel('Training Time (seconds)')
    ax6.set_ylabel('Test MSE')
    ax6.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Summary table
    print("\nSUMMARY TABLE:")
    print("=" * 80)
    print(f"{'Model':<20} {'Test MSE':<12} {'Parameters':<12} {'Time (s)':<10} {'Efficiency':<10}")
    print("-" * 80)

    for name, result in results.items():
        efficiency = result['training_time'] / (1 / result['test_mse'])  # Lower is better
        print(f"{name:<20} {result['test_mse']:<12.6f} {result['params']:<12,} {result['training_time']:<10.1f} {efficiency:<10.2f}")

    return results, models

# Visualize LSTM architecture and compare models
visualize_lstm_architecture()
lstm_results, lstm_models = compare_architectures()

### Multi-Step Forecasting

The book discusses two approaches for forecasting multiple time steps:

1. **Iterative Approach**: Predict one step, add to input, repeat
2. **Direct Approach**: Train model to output multiple steps at once
3. **Sequence-to-Sequence**: Use all time steps for training

#### Mathematical Framework

For forecasting $h$ steps ahead:

**Iterative:**
$$\hat{x}_{t+1} = f(x_t, x_{t-1}, ..., x_{t-n+1})$$
$$\hat{x}_{t+2} = f(\hat{x}_{t+1}, x_t, ..., x_{t-n+2})$$
$$\vdots$$

**Direct:**
$$[\hat{x}_{t+1}, \hat{x}_{t+2}, ..., \hat{x}_{t+h}] = f(x_t, x_{t-1}, ..., x_{t-n+1})$$

**Sequence-to-Sequence:**
Each time step predicts the next $h$ values, providing more training signal.

In [None]:
# Multi-step forecasting implementation

def prepare_multistep_data(series, n_steps, forecast_horizon):
    """
    Prepares data for multi-step forecasting.

    Args:
        series: Time series data
        n_steps: Number of input time steps
        forecast_horizon: Number of steps to forecast

    Returns:
        X, y for multi-step forecasting
    """
    X, y = [], []

    for i in range(len(series) - n_steps - forecast_horizon + 1):
        X.append(series[i:i + n_steps])
        y.append(series[i + n_steps:i + n_steps + forecast_horizon])

    return np.array(X), np.array(y)

def iterative_forecasting(model, initial_sequence, forecast_steps):
    """
    Implements iterative forecasting approach.
    """
    sequence = initial_sequence.copy()
    predictions = []

    for step in range(forecast_steps):
        # Predict next value
        next_pred = model.predict(sequence[-50:].reshape(1, -1, 1), verbose=0)[0, 0]
        predictions.append(next_pred)

        # Add prediction to sequence for next iteration
        sequence = np.append(sequence, next_pred)

    return np.array(predictions)

def build_multistep_models(input_shape, forecast_horizon):
    """
    Builds models for different multi-step forecasting approaches.
    """
    models = {}

    # 1. Direct multi-output model
    models['Direct Multi-Output'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=input_shape),
        keras.layers.LSTM(20),
        keras.layers.Dense(forecast_horizon)  # Output all future steps at once
    ])

    # 2. Sequence-to-sequence model
    models['Sequence-to-Sequence'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=input_shape),
        keras.layers.LSTM(20, return_sequences=True),
        keras.layers.TimeDistributed(keras.layers.Dense(forecast_horizon))
    ])

    # Compile models
    for model in models.values():
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    return models

def demonstrate_multistep_forecasting():
    """
    Demonstrates different approaches to multi-step forecasting.
    """
    print("\nMULTI-STEP FORECASTING DEMONSTRATION")
    print("=" * 50)

    # Parameters
    n_steps = 50
    forecast_horizon = 10

    # Generate extended series for multi-step forecasting
    series = generate_time_series(1000, n_steps + forecast_horizon)

    # Prepare data for different approaches

    # 1. For iterative approach (single-step model)
    X_single, y_single = series[:, :n_steps], series[:, n_steps]

    # 2. For direct multi-output approach
    X_multi, y_multi = prepare_multistep_data(series[0, :, 0], n_steps, forecast_horizon)

    # Expand to all series
    X_multi_all, y_multi_all = [], []
    for i in range(series.shape[0]):
        X_temp, y_temp = prepare_multistep_data(series[i, :, 0], n_steps, forecast_horizon)
        if len(X_temp) > 0:
            X_multi_all.extend(X_temp)
            y_multi_all.extend(y_temp)

    X_multi_all = np.array(X_multi_all).reshape(-1, n_steps, 1)
    y_multi_all = np.array(y_multi_all)

    # 3. For sequence-to-sequence approach
    # Create targets where each time step predicts next forecast_horizon values
    Y_seq2seq = np.zeros((len(X_multi_all), n_steps, forecast_horizon))
    for i in range(len(X_multi_all)):
        for t in range(n_steps):
            if t + forecast_horizon < n_steps:
                # At each time step, predict the next forecast_horizon steps
                start_idx = t + 1
                end_idx = min(t + 1 + forecast_horizon, n_steps)
                pred_length = end_idx - start_idx
                Y_seq2seq[i, t, :pred_length] = X_multi_all[i, start_idx:end_idx, 0]

    print(f"Data shapes:")
    print(f"Single-step: X={X_single.shape}, y={y_single.shape}")
    print(f"Multi-output: X={X_multi_all.shape}, y={y_multi_all.shape}")
    print(f"Seq2Seq: X={X_multi_all.shape}, Y={Y_seq2seq.shape}")

    # Split data
    train_size = int(0.7 * len(X_multi_all))
    val_size = int(0.2 * len(X_multi_all))

    X_train_multi = X_multi_all[:train_size]
    y_train_multi = y_multi_all[:train_size]
    Y_train_seq2seq = Y_seq2seq[:train_size]

    X_val_multi = X_multi_all[train_size:train_size + val_size]
    y_val_multi = y_multi_all[train_size:train_size + val_size]
    Y_val_seq2seq = Y_seq2seq[train_size:train_size + val_size]

    X_test_multi = X_multi_all[train_size + val_size:]
    y_test_multi = y_multi_all[train_size + val_size:]
    Y_test_seq2seq = Y_seq2seq[train_size + val_size:]

    # Train models
    print("\nTraining models...")

    # 1. Train single-step model for iterative approach
    single_model = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(20),
        keras.layers.Dense(1)
    ])
    single_model.compile(optimizer='adam', loss='mse')

    # Use original single-step data
    train_idx = int(0.7 * len(X_single))
    val_idx = int(0.9 * len(X_single))

    single_model.fit(
        X_single[:train_idx], y_single[:train_idx],
        epochs=10, batch_size=32,
        validation_data=(X_single[train_idx:val_idx], y_single[train_idx:val_idx]),
        verbose=0
    )

    # 2. Train multi-output model
    multi_model = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(20),
        keras.layers.Dense(forecast_horizon)
    ])
    multi_model.compile(optimizer='adam', loss='mse')

    multi_model.fit(
        X_train_multi, y_train_multi,
        epochs=10, batch_size=32,
        validation_data=(X_val_multi, y_val_multi),
        verbose=0
    )

    # 3. Custom metric for sequence-to-sequence (only last time step matters for evaluation)
    def last_time_step_mse(y_true, y_pred):
        return keras.metrics.mean_squared_error(y_true[:, -1], y_pred[:, -1])

    seq2seq_model = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(20, return_sequences=True),
        keras.layers.TimeDistributed(keras.layers.Dense(forecast_horizon))
    ])
    seq2seq_model.compile(optimizer='adam', loss='mse', metrics=[last_time_step_mse])

    seq2seq_model.fit(
        X_train_multi, Y_train_seq2seq,
        epochs=10, batch_size=32,
        validation_data=(X_val_multi, Y_val_seq2seq),
        verbose=0
    )

    # Evaluate approaches
    print("\nEvaluating approaches...")

    # Test on a few examples
    test_examples = 5
    results = {}

    for i in range(test_examples):
        input_seq = X_test_multi[i, :, 0]
        true_future = y_test_multi[i]

        # 1. Iterative approach
        iter_pred = iterative_forecasting(single_model, input_seq, forecast_horizon)
        iter_mse = np.mean((true_future - iter_pred) ** 2)

        # 2. Direct multi-output
        multi_pred = multi_model.predict(X_test_multi[i:i+1], verbose=0)[0]
        multi_mse = np.mean((true_future - multi_pred) ** 2)

        # 3. Sequence-to-sequence (use last time step output)
        seq2seq_pred = seq2seq_model.predict(X_test_multi[i:i+1], verbose=0)[0, -1, :]
        seq2seq_mse = np.mean((true_future - seq2seq_pred) ** 2)

        results[f'Example_{i+1}'] = {
            'iterative_mse': iter_mse,
            'multi_output_mse': multi_mse,
            'seq2seq_mse': seq2seq_mse,
            'predictions': {
                'iterative': iter_pred,
                'multi_output': multi_pred,
                'seq2seq': seq2seq_pred,
                'true': true_future
            }
        }

    # Visualize results
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Plot examples
    for i in range(min(3, test_examples)):
        ax = axes[0, i]
        preds = results[f'Example_{i+1}']['predictions']

        # Input sequence
        input_seq = X_test_multi[i, :, 0]
        ax.plot(range(len(input_seq)), input_seq, 'b-', linewidth=2, label='Input')

        # Predictions
        future_range = range(len(input_seq), len(input_seq) + forecast_horizon)
        ax.plot(future_range, preds['true'], 'k-', linewidth=3, label='True', alpha=0.8)
        ax.plot(future_range, preds['iterative'], 'r--', linewidth=2, label='Iterative')
        ax.plot(future_range, preds['multi_output'], 'g--', linewidth=2, label='Multi-output')
        ax.plot(future_range, preds['seq2seq'], 'm--', linewidth=2, label='Seq2Seq')

        ax.axvline(x=len(input_seq)-0.5, color='gray', linestyle=':', alpha=0.7)
        ax.set_title(f'Example {i+1}')
        ax.set_xlabel('Time Step')
        ax.set_ylabel('Value')
        if i == 0:
            ax.legend()
        ax.grid(True, alpha=0.3)

    # Performance comparison
    ax = axes[1, 0]
    methods = ['Iterative', 'Multi-output', 'Seq2Seq']
    avg_mses = []

    for method_key in ['iterative_mse', 'multi_output_mse', 'seq2seq_mse']:
        mses = [results[f'Example_{i+1}'][method_key] for i in range(test_examples)]
        avg_mses.append(np.mean(mses))

    bars = ax.bar(methods, avg_mses, alpha=0.7, color=['red', 'green', 'magenta'])
    ax.set_title('Average MSE Comparison')
    ax.set_ylabel('MSE')

    # Add value labels
    for bar, value in zip(bars, avg_mses):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
                f'{value:.4f}', ha='center', va='bottom')

    # MSE distribution
    ax = axes[1, 1]
    all_mses = {'Iterative': [], 'Multi-output': [], 'Seq2Seq': []}

    for i in range(test_examples):
        all_mses['Iterative'].append(results[f'Example_{i+1}']['iterative_mse'])
        all_mses['Multi-output'].append(results[f'Example_{i+1}']['multi_output_mse'])
        all_mses['Seq2Seq'].append(results[f'Example_{i+1}']['seq2seq_mse'])

    ax.boxplot(all_mses.values(), labels=all_mses.keys())
    ax.set_title('MSE Distribution')
    ax.set_ylabel('MSE')
    ax.grid(True, alpha=0.3)

    # Error by forecast step
    ax = axes[1, 2]
    step_errors = {'Iterative': [], 'Multi-output': [], 'Seq2Seq': []}

    for step in range(forecast_horizon):
        iter_errors = []
        multi_errors = []
        seq2seq_errors = []

        for i in range(test_examples):
            preds = results[f'Example_{i+1}']['predictions']
            true_val = preds['true'][step]

            iter_errors.append((preds['iterative'][step] - true_val) ** 2)
            multi_errors.append((preds['multi_output'][step] - true_val) ** 2)
            seq2seq_errors.append((preds['seq2seq'][step] - true_val) ** 2)

        step_errors['Iterative'].append(np.mean(iter_errors))
        step_errors['Multi-output'].append(np.mean(multi_errors))
        step_errors['Seq2Seq'].append(np.mean(seq2seq_errors))

    steps = range(1, forecast_horizon + 1)
    ax.plot(steps, step_errors['Iterative'], 'r-o', label='Iterative', linewidth=2)
    ax.plot(steps, step_errors['Multi-output'], 'g-s', label='Multi-output', linewidth=2)
    ax.plot(steps, step_errors['Seq2Seq'], 'm-^', label='Seq2Seq', linewidth=2)

    ax.set_title('MSE by Forecast Step')
    ax.set_xlabel('Forecast Step')
    ax.set_ylabel('MSE')
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print summary
    print("\nSUMMARY:")
    print(f"Average MSE - Iterative: {avg_mses[0]:.6f}")
    print(f"Average MSE - Multi-output: {avg_mses[1]:.6f}")
    print(f"Average MSE - Seq2Seq: {avg_mses[2]:.6f}")

    return results

# Run multi-step forecasting demonstration
multistep_results = demonstrate_multistep_forecasting()

## 6. Handling Long Sequences

### Challenges with Long Sequences

1. **Vanishing Gradients**: As sequences get longer, gradients become exponentially smaller
2. **Limited Memory**: RNNs forget early information in long sequences
3. **Computational Complexity**: Memory and computation scale with sequence length

### Mathematical Analysis of Gradient Flow

For an RNN with hidden state $\mathbf{h}^{(t)} = f(\mathbf{W}\mathbf{h}^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)})$:

The gradient of loss with respect to early parameters involves:
$$\frac{\partial L}{\partial \mathbf{W}} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial L^{(t)}}{\partial \mathbf{h}^{(t)}} \frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{h}^{(k)}} \frac{\partial \mathbf{h}^{(k)}}{\partial \mathbf{W}}$$

The problematic term is:
$$\frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{h}^{(k)}} = \prod_{i=k+1}^{t} \frac{\partial \mathbf{h}^{(i)}}{\partial \mathbf{h}^{(i-1)}} = \prod_{i=k+1}^{t} \mathbf{W}^T \text{diag}(f'(\mathbf{h}^{(i-1)}))$$

If the largest eigenvalue of $\mathbf{W}^T \text{diag}(f'())$ is:
- **< 1**: Gradients vanish exponentially
- **> 1**: Gradients explode exponentially

### Solutions

1. **LSTM/GRU**: Gating mechanisms control information flow
2. **Gradient Clipping**: Prevent exploding gradients
3. **Layer Normalization**: Normalize activations
4. **Residual Connections**: Skip connections for gradient flow
5. **1D Convolutions**: Process sequences more efficiently

In [None]:
# Advanced techniques for handling long sequences

class LayerNormSimpleRNNCell(keras.layers.Layer):
    """
    Custom RNN cell with Layer Normalization as described in the book.
    This demonstrates how to create custom cells with advanced techniques.
    """
    def __init__(self, units, activation='tanh', **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units, activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)

    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

def demonstrate_gradient_techniques():
    """
    Demonstrates gradient clipping and layer normalization.
    """
    print("\nDEMONSTRATING GRADIENT TECHNIQUES")
    print("=" * 50)

    # Create longer sequences to demonstrate the problem
    n_steps = 100  # Longer sequences
    series_long = generate_time_series(1000, n_steps + 1)

    X_long = series_long[:, :n_steps]
    y_long = series_long[:, -1]

    # Split data
    train_size = int(0.7 * len(X_long))
    val_size = int(0.2 * len(X_long))

    X_train_long = X_long[:train_size]
    y_train_long = y_long[:train_size]
    X_val_long = X_long[train_size:train_size + val_size]
    y_val_long = y_long[train_size:train_size + val_size]
    X_test_long = X_long[train_size + val_size:]
    y_test_long = y_long[train_size + val_size:]

    print(f"Long sequence data shapes:")
    print(f"X_train: {X_train_long.shape}, y_train: {y_train_long.shape}")

    # Build models with different techniques
    models = {}

    # 1. Standard LSTM
    models['Standard LSTM'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(20),
        keras.layers.Dense(1)
    ])

    # 2. LSTM with gradient clipping
    models['LSTM + Grad Clip'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(20),
        keras.layers.Dense(1)
    ])

    # 3. LSTM with dropout
    models['LSTM + Dropout'] = keras.Sequential([
        keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1],
                         dropout=0.2, recurrent_dropout=0.2),
        keras.layers.LSTM(20, dropout=0.2, recurrent_dropout=0.2),
        keras.layers.Dense(1)
    ])

    # 4. Custom cell with Layer Normalization
    models['Layer Norm RNN'] = keras.Sequential([
        keras.layers.RNN(LayerNormSimpleRNNCell(20), return_sequences=True, input_shape=[None, 1]),
        keras.layers.RNN(LayerNormSimpleRNNCell(20)),
        keras.layers.Dense(1)
    ])

    # Compile models with different optimizers
    optimizers = {
        'Standard LSTM': keras.optimizers.Adam(learning_rate=0.001),
        'LSTM + Grad Clip': keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0),  # Gradient clipping
        'LSTM + Dropout': keras.optimizers.Adam(learning_rate=0.001),
        'Layer Norm RNN': keras.optimizers.Adam(learning_rate=0.001)
    }

    for name, model in models.items():
        model.compile(optimizer=optimizers[name], loss='mse', metrics=['mae'])

    # Custom training loop to track gradients
    def train_with_gradient_monitoring(model, name, X_train, y_train, X_val, y_val, epochs=15):
        """
        Trains model while monitoring gradient norms.
        """
        history = {'loss': [], 'val_loss': [], 'gradient_norm': []}

        # Prepare datasets
        train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(32)

        for epoch in range(epochs):
            epoch_losses = []
            epoch_grad_norms = []

            for batch_x, batch_y in train_dataset:
                with tf.GradientTape() as tape:
                    predictions = model(batch_x, training=True)
                    loss = keras.losses.mse(batch_y, predictions)
                    loss = tf.reduce_mean(loss)

                # Compute gradients
                gradients = tape.gradient(loss, model.trainable_variables)

                # Calculate gradient norm
                grad_norm = tf.sqrt(sum([tf.reduce_sum(tf.square(g)) for g in gradients if g is not None]))
                epoch_grad_norms.append(float(grad_norm))

                # Apply gradients
                model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                epoch_losses.append(float(loss))

            # Validation loss
            val_predictions = model(X_val, training=False)
            val_loss = tf.reduce_mean(keras.losses.mse(y_val, val_predictions))

            history['loss'].append(np.mean(epoch_losses))
            history['val_loss'].append(float(val_loss))
            history['gradient_norm'].append(np.mean(epoch_grad_norms))

            if epoch % 5 == 0:
                print(f"{name} - Epoch {epoch}: Loss={history['loss'][-1]:.6f}, Val Loss={history['val_loss'][-1]:.6f}, Grad Norm={history['gradient_norm'][-1]:.4f}")

        return history

    # Train models and collect results
    results = {}
    histories = {}

    for name, model in models.items():
        print(f"\nTraining {name}...")
        history = train_with_gradient_monitoring(
            model, name, X_train_long, y_train_long, X_val_long, y_val_long
        )
        histories[name] = history

        # Evaluate final performance
        test_pred = model(X_test_long, training=False)
        test_mse = float(tf.reduce_mean(keras.losses.mse(y_test_long, test_pred)))

        results[name] = {
            'test_mse': test_mse,
            'final_grad_norm': history['gradient_norm'][-1],
            'gradient_stability': np.std(history['gradient_norm'])
        }

    # Visualize results
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Training loss curves
    ax = axes[0, 0]
    for name, history in histories.items():
        ax.plot(history['loss'], label=f'{name} (train)', alpha=0.7)
        ax.plot(history['val_loss'], label=f'{name} (val)', linestyle='--', alpha=0.7)
    ax.set_title('Training Curves')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('MSE Loss')
    ax.set_yscale('log')
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax.grid(True, alpha=0.3)

    # Gradient norm evolution
    ax = axes[0, 1]
    for name, history in histories.items():
        ax.plot(history['gradient_norm'], label=name, linewidth=2)
    ax.set_title('Gradient Norm Evolution')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Gradient Norm')
    ax.set_yscale('log')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Final performance comparison
    ax = axes[0, 2]
    names = list(results.keys())
    test_mses = [results[name]['test_mse'] for name in names]

    bars = ax.bar(range(len(names)), test_mses, alpha=0.7)
    ax.set_title('Test MSE Comparison')
    ax.set_xlabel('Model')
    ax.set_ylabel('Test MSE')
    ax.set_xticks(range(len(names)))
    ax.set_xticklabels(names, rotation=45, ha='right')

    # Add value labels
    for bar, value in zip(bars, test_mses):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
                f'{value:.4f}', ha='center', va='bottom', fontsize=9)

    # Gradient stability analysis
    ax = axes[1, 0]
    grad_stabilities = [results[name]['gradient_stability'] for name in names]

    bars = ax.bar(range(len(names)), grad_stabilities, alpha=0.7, color='orange')
    ax.set_title('Gradient Stability (Lower = Better)')
    ax.set_xlabel('Model')
    ax.set_ylabel('Gradient Norm Std Dev')
    ax.set_xticks(range(len(names)))
    ax.set_xticklabels(names, rotation=45, ha='right')

    # Performance vs Gradient Stability
    ax = axes[1, 1]
    ax.scatter(grad_stabilities, test_mses, s=100, alpha=0.7)
    for i, name in enumerate(names):
        ax.annotate(name, (grad_stabilities[i], test_mses[i]),
                   xytext=(5, 5), textcoords='offset points', fontsize=9)
    ax.set_xlabel('Gradient Stability (Std Dev)')
    ax.set_ylabel('Test MSE')
    ax.set_title('Performance vs Gradient Stability')
    ax.grid(True, alpha=0.3)

    # Convergence speed analysis
    ax = axes[1, 2]
    for name, history in histories.items():
        # Find epoch where validation loss stabilizes (minimum reached)
        min_val_loss_epoch = np.argmin(history['val_loss'])
        ax.scatter(min_val_loss_epoch, min(history['val_loss']),
                  label=name, s=100, alpha=0.7)

    ax.set_xlabel('Epochs to Best Val Loss')
    ax.set_ylabel('Best Val Loss')
    ax.set_title('Convergence Speed vs Final Performance')
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print detailed analysis
    print("\nDETAILED ANALYSIS:")
    print("=" * 60)
    print(f"{'Model':<20} {'Test MSE':<12} {'Grad Stability':<15} {'Final Grad Norm':<15}")
    print("-" * 60)

    for name in names:
        print(f"{name:<20} {results[name]['test_mse']:<12.6f} {results[name]['gradient_stability']:<15.4f} {results[name]['final_grad_norm']:<15.4f}")

    return results, histories

# Run gradient techniques demonstration
gradient_results, gradient_histories = demonstrate_gradient_techniques()

## 7. 1D Convolutional Layers for Sequence Processing

### Theoretical Foundation

1D Convolutional layers can effectively process sequences by:
1. **Local Pattern Detection**: Each filter detects specific patterns
2. **Computational Efficiency**: Parallel processing vs sequential RNNs
3. **Long-Range Dependencies**: Dilated convolutions extend receptive field

### Mathematical Formulation

For a 1D convolution with input $\mathbf{x} \in \mathbb{R}^{T \times d}$ and filter $\mathbf{w} \in \mathbb{R}^{k \times d}$:

$$y_t = \sum_{i=0}^{k-1} \sum_{j=0}^{d-1} w_{i,j} \cdot x_{t+i-\lfloor k/2 \rfloor, j} + b$$

With stride $s$ and dilation $r$:
$$y_t = \sum_{i=0}^{k-1} \sum_{j=0}^{d-1} w_{i,j} \cdot x_{t \cdot s + i \cdot r, j} + b$$

### Receptive Field Analysis

For a stack of $L$ layers with kernel size $k$:
- **Without dilation**: Receptive field = $1 + L(k-1)$
- **With dilation** $r_l$ at layer $l$: Receptive field = $1 + \sum_{l=1}^{L} (k-1) \cdot \prod_{i=1}^{l} r_i$

### Advantages over RNNs

1. **Parallelization**: All outputs computed simultaneously
2. **Stable Gradients**: No vanishing gradient problem
3. **Flexible Receptive Fields**: Controlled via kernel size and dilation
4. **Hierarchical Features**: Lower layers capture local, higher layers capture global patterns

In [None]:
# 1D Convolutional Networks for Sequence Processing

def visualize_1d_convolution():
    """
    Visualizes how 1D convolution works on sequences.
    """
    fig, axes = plt.subplots(3, 2, figsize=(15, 12))

    # Create sample sequence
    sequence_length = 20
    sequence = np.sin(np.linspace(0, 4*np.pi, sequence_length)) + 0.1 * np.random.randn(sequence_length)

    # Different kernel sizes
    kernel_sizes = [3, 5, 7]

    for i, kernel_size in enumerate(kernel_sizes):
        # Create different types of kernels
        kernels = {
            'Edge Detector': np.array([-1, 0, 1] + [0] * (kernel_size - 3)),
            'Smoother': np.ones(kernel_size) / kernel_size
        }

        for j, (kernel_name, kernel) in enumerate(kernels.items()):
            ax = axes[i, j]

            # Apply convolution (simplified)
            output = np.convolve(sequence, kernel, mode='same')

            # Plot
            ax.plot(sequence, 'b-', linewidth=2, label='Input', alpha=0.7)
            ax.plot(output, 'r-', linewidth=2, label='Output')

            # Show kernel
            kernel_pos = len(sequence) // 4
            kernel_x = np.arange(kernel_pos, kernel_pos + len(kernel))
            ax.stem(kernel_x, kernel * 2 + np.mean(sequence),
                   basefmt=' ', linefmt='g-', markerfmt='go', label='Kernel')

            ax.set_title(f'{kernel_name} (kernel size {kernel_size})')
            ax.set_xlabel('Time Step')
            ax.set_ylabel('Value')
            ax.legend()
            ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

def build_1d_cnn_models():
    """
    Builds various 1D CNN architectures for sequence processing.
    """
    input_shape = [None, 1]
    models = {}

    # 1. Simple 1D CNN
    models['Simple 1D CNN'] = keras.Sequential([
        keras.layers.Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=input_shape),
        keras.layers.Conv1D(filters=32, kernel_size=3, activation='relu'),
        keras.layers.GlobalMaxPooling1D(),
        keras.layers.Dense(50, activation='relu'),
        keras.layers.Dense(1)
    ])

    # 2. Deep 1D CNN with pooling
    models['Deep 1D CNN'] = keras.Sequential([
        keras.layers.Conv1D(filters=32, kernel_size=5, activation='relu', input_shape=input_shape),
        keras.layers.Conv1D(filters=32, kernel_size=5, activation='relu'),
        keras.layers.MaxPooling1D(pool_size=2),
        keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu'),
        keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu'),
        keras.layers.GlobalMaxPooling1D(),
        keras.layers.Dense(50, activation='relu'),
        keras.layers.Dense(1)
    ])

    # 3. 1D CNN + RNN hybrid
    models['CNN-RNN Hybrid'] = keras.Sequential([
        keras.layers.Conv1D(filters=32, kernel_size=5, activation='relu', input_shape=input_shape),
        keras.layers.Conv1D(filters=32, kernel_size=5, activation='relu'),
        keras.layers.MaxPooling1D(pool_size=2),
        keras.layers.LSTM(50, return_sequences=True),
        keras.layers.LSTM(50),
        keras.layers.Dense(1)
    ])

    # 4. Dilated 1D CNN (simplified WaveNet-style)
    models['Dilated CNN'] = keras.Sequential([
        keras.layers.Conv1D(filters=32, kernel_size=2, dilation_rate=1,
                           activation='relu', padding='causal', input_shape=input_shape),
        keras.layers.Conv1D(filters=32, kernel_size=2, dilation_rate=2,
                           activation='relu', padding='causal'),
        keras.layers.Conv1D(filters=32, kernel_size=2, dilation_rate=4,
                           activation='relu', padding='causal'),
        keras.layers.Conv1D(filters=32, kernel_size=2, dilation_rate=8,
                           activation='relu', padding='causal'),
        keras.layers.GlobalMaxPooling1D(),
        keras.layers.Dense(50, activation='relu'),
        keras.layers.Dense(1)
    ])

    # Compile all models
    for model in models.values():
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    return models

def analyze_receptive_fields():
    """
    Analyzes and visualizes receptive fields for different architectures.
    """
    print("\nRECEPTIVE FIELD ANALYSIS")
    print("=" * 40)

    # Calculate receptive fields for different architectures
    architectures = {
        'Single Conv (k=3)': [(3, 1, 1)],  # (kernel_size, dilation, stride)
        'Two Conv (k=3)': [(3, 1, 1), (3, 1, 1)],
        'Conv + Pool': [(3, 1, 1), (3, 1, 2)],  # pooling = stride 2
        'Dilated Conv': [(2, 1, 1), (2, 2, 1), (2, 4, 1), (2, 8, 1)]
    }

    def calculate_receptive_field(layers):
        """Calculate receptive field for a sequence of layers."""
        rf = 1
        jump = 1

        for kernel_size, dilation, stride in layers:
            rf += (kernel_size - 1) * jump * dilation
            jump *= stride

        return rf

    rf_results = {}
    for name, layers in architectures.items():
        rf = calculate_receptive_field(layers)
        rf_results[name] = rf
        print(f"{name:<20}: Receptive Field = {rf}")

    # Visualize receptive fields
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    axes = axes.flatten()

    sequence_length = 50
    center_pos = sequence_length // 2

    for i, (name, rf) in enumerate(rf_results.items()):
        ax = axes[i]

        # Create sequence
        sequence = np.zeros(sequence_length)
        sequence[center_pos] = 1  # Impulse at center

        # Show receptive field
        rf_start = max(0, center_pos - rf // 2)
        rf_end = min(sequence_length, center_pos + rf // 2 + 1)

        ax.plot(sequence, 'k-', linewidth=2, label='Input')
        ax.axvspan(rf_start, rf_end, alpha=0.3, color='red', label=f'Receptive Field (size {rf})')
        ax.axvline(center_pos, color='blue', linestyle='--', label='Output position')

        ax.set_title(f'{name}')
        ax.set_xlabel('Input Position')
        ax.set_ylabel('Value')
        ax.legend()
        ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return rf_results

def compare_cnn_rnn_performance():
    """
    Compares 1D CNN vs RNN performance on time series forecasting.
    """
    print("\nCOMPARING 1D CNN vs RNN PERFORMANCE")
    print("=" * 50)

    # Build comparison models
    cnn_models = build_1d_cnn_models()

    # Add RNN models for comparison
    rnn_models = {
        'Simple LSTM': keras.Sequential([
            keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.LSTM(32),
            keras.layers.Dense(1)
        ]),
        'Deep LSTM': keras.Sequential([
            keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.LSTM(32, return_sequences=True),
            keras.layers.LSTM(32),
            keras.layers.Dense(1)
        ])
    }

    for model in rnn_models.values():
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    # Combine all models
    all_models = {**cnn_models, **rnn_models}

    # Training and evaluation
    results = {}
    training_times = {}

    epochs = 10
    batch_size = 32

    for name, model in all_models.items():
        print(f"\nTraining {name}...")
        print(f"Parameters: {model.count_params():,}")

        # Measure training time
        import time
        start_time = time.time()

        history = model.fit(
            X_train, y_train,
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(X_valid, y_valid),
            verbose=0
        )

        training_time = time.time() - start_time
        training_times[name] = training_time

        # Evaluate
        test_loss = model.evaluate(X_test, y_test, verbose=0)[0]

        results[name] = {
            'test_mse': test_loss,
            'params': model.count_params(),
            'training_time': training_time,
            'final_val_loss': history.history['val_loss'][-1],
            'history': history
        }

        print(f"Test MSE: {test_loss:.6f}, Training time: {training_time:.1f}s")

    # Detailed comparison visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    # Separate CNN and RNN results
    cnn_names = list(cnn_models.keys())
    rnn_names = list(rnn_models.keys())

    # Performance comparison
    ax = axes[0, 0]
    cnn_mses = [results[name]['test_mse'] for name in cnn_names]
    rnn_mses = [results[name]['test_mse'] for name in rnn_names]

    x_cnn = np.arange(len(cnn_names))
    x_rnn = np.arange(len(rnn_names)) + len(cnn_names) + 0.5

    ax.bar(x_cnn, cnn_mses, alpha=0.7, label='CNN Models', color='blue')
    ax.bar(x_rnn, rnn_mses, alpha=0.7, label='RNN Models', color='red')

    all_names = cnn_names + rnn_names
    ax.set_xticks(list(x_cnn) + list(x_rnn))
    ax.set_xticklabels(all_names, rotation=45, ha='right')
    ax.set_title('Test MSE Comparison')
    ax.set_ylabel('MSE')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Training time comparison
    ax = axes[0, 1]
    cnn_times = [results[name]['training_time'] for name in cnn_names]
    rnn_times = [results[name]['training_time'] for name in rnn_names]

    ax.bar(x_cnn, cnn_times, alpha=0.7, label='CNN Models', color='blue')
    ax.bar(x_rnn, rnn_times, alpha=0.7, label='RNN Models', color='red')

    ax.set_xticks(list(x_cnn) + list(x_rnn))
    ax.set_xticklabels(all_names, rotation=45, ha='right')
    ax.set_title('Training Time Comparison')
    ax.set_ylabel('Time (seconds)')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Parameter count comparison
    ax = axes[0, 2]
    cnn_params = [results[name]['params'] for name in cnn_names]
    rnn_params = [results[name]['params'] for name in rnn_names]

    ax.bar(x_cnn, cnn_params, alpha=0.7, label='CNN Models', color='blue')
    ax.bar(x_rnn, rnn_params, alpha=0.7, label='RNN Models', color='red')

    ax.set_xticks(list(x_cnn) + list(x_rnn))
    ax.set_xticklabels(all_names, rotation=45, ha='right')
    ax.set_title('Parameter Count Comparison')
    ax.set_ylabel('Number of Parameters')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Efficiency plot (performance vs time)
    ax = axes[1, 0]
    all_times = [results[name]['training_time'] for name in all_names]
    all_mses = [results[name]['test_mse'] for name in all_names]
    colors = ['blue'] * len(cnn_names) + ['red'] * len(rnn_names)

    scatter = ax.scatter(all_times, all_mses, c=colors, s=100, alpha=0.7)
    for i, name in enumerate(all_names):
        ax.annotate(name, (all_times[i], all_mses[i]),
                   xytext=(5, 5), textcoords='offset points', fontsize=9)

    ax.set_xlabel('Training Time (seconds)')
    ax.set_ylabel('Test MSE')
    ax.set_title('Efficiency: Performance vs Training Time')
    ax.grid(True, alpha=0.3)

    # Training curves for best models
    ax = axes[1, 1]
    best_cnn = min(cnn_names, key=lambda x: results[x]['test_mse'])
    best_rnn = min(rnn_names, key=lambda x: results[x]['test_mse'])

    ax.plot(results[best_cnn]['history'].history['val_loss'],
           label=f'{best_cnn} (best CNN)', linewidth=2, color='blue')
    ax.plot(results[best_rnn]['history'].history['val_loss'],
           label=f'{best_rnn} (best RNN)', linewidth=2, color='red')

    ax.set_title('Best Model Training Curves')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Validation Loss')
    ax.legend()
    ax.set_yscale('log')
    ax.grid(True, alpha=0.3)

    # Model complexity vs performance
    ax = axes[1, 2]
    all_params = [results[name]['params'] for name in all_names]

    scatter = ax.scatter(all_params, all_mses, c=colors, s=100, alpha=0.7)
    for i, name in enumerate(all_names):
        ax.annotate(name, (all_params[i], all_mses[i]),
                   xytext=(5, 5), textcoords='offset points', fontsize=9)

    ax.set_xlabel('Number of Parameters')
    ax.set_ylabel('Test MSE')
    ax.set_title('Model Complexity vs Performance')
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Summary analysis
    print("\nSUMMARY ANALYSIS:")
    print("=" * 60)
    print(f"{'Model':<20} {'Type':<8} {'Test MSE':<12} {'Time (s)':<10} {'Parameters':<12} {'Efficiency':<10}")
    print("-" * 60)

    for name in all_names:
        model_type = 'CNN' if name in cnn_names else 'RNN'
        efficiency = results[name]['training_time'] / (1 / results[name]['test_mse'])
        print(f"{name:<20} {model_type:<8} {results[name]['test_mse']:<12.6f} {results[name]['training_time']:<10.1f} {results[name]['params']:<12,} {efficiency:<10.2f}")

    return results

# Run 1D CNN demonstrations
print("1D CONVOLUTION VISUALIZATION:")
visualize_1d_convolution()

print("\nRECEPTIVE FIELD ANALYSIS:")
rf_analysis = analyze_receptive_fields()

print("\nCNN vs RNN PERFORMANCE COMPARISON:")
cnn_rnn_results = compare_cnn_rnn_performance()

## 8. WaveNet Architecture

### Theoretical Foundation

WaveNet, introduced by DeepMind, uses **dilated convolutions** to achieve exponentially large receptive fields efficiently.

### Key Innovations

1. **Dilated Convolutions**: Exponentially increasing dilation rates
2. **Causal Convolutions**: No future information leakage
3. **Residual Connections**: Skip connections for gradient flow
4. **Gated Activation Units**: Similar to LSTM gates

### Mathematical Formulation

**Dilated Convolution:**
$$y_t = \sum_{i=0}^{k-1} w_i \cdot x_{t - i \cdot d}$$

Where $d$ is the dilation rate.

**Gated Activation Unit:**
$$z = \tanh(W_{f,k} * x) \odot \sigma(W_{g,k} * x)$$

Where $W_{f,k}$ and $W_{g,k}$ are filter and gate convolutions, $*$ denotes convolution, and $\odot$ is element-wise multiplication.

**Residual Connection:**
$$y = z + x$$

### Receptive Field Growth

With $L$ layers and dilation rates $d_l = 2^l$:
$$\text{Receptive Field} = 1 + \sum_{l=0}^{L-1} 2^l (k-1) = 1 + (k-1)(2^L - 1)$$

For $k=2$ and $L=10$: RF = $1 + (2-1)(2^{10} - 1) = 1024$

In [None]:
# WaveNet Implementation

def build_wavenet_block(inputs, filters, kernel_size, dilation_rate, name):
    """
    Builds a WaveNet block with dilated convolution, gated activation, and residual connection.

    Args:
        inputs: Input tensor
        filters: Number of filters
        kernel_size: Convolution kernel size
        dilation_rate: Dilation rate for convolution
        name: Block name prefix

    Returns:
        Output tensor and skip connection
    """
    # Dilated causal convolution for filter
    conv_filter = keras.layers.Conv1D(
        filters=filters,
        kernel_size=kernel_size,
        dilation_rate=dilation_rate,
        padding='causal',
        name=f'{name}_conv_filter'
    )(inputs)

    # Dilated causal convolution for gate
    conv_gate = keras.layers.Conv1D(
        filters=filters,
        kernel_size=kernel_size,
        dilation_rate=dilation_rate,
        padding='causal',
        name=f'{name}_conv_gate'
    )(inputs)

    # Gated activation unit
    tanh_out = keras.layers.Activation('tanh', name=f'{name}_tanh')(conv_filter)
    sigm_out = keras.layers.Activation('sigmoid', name=f'{name}_sigmoid')(conv_gate)
    gated = keras.layers.Multiply(name=f'{name}_gated')([tanh_out, sigm_out])

    # 1x1 convolution for residual connection
    residual = keras.layers.Conv1D(
        filters=filters,
        kernel_size=1,
        name=f'{name}_residual_conv'
    )(gated)

    # Skip connection (for final output)
    skip = keras.layers.Conv1D(
        filters=filters,
        kernel_size=1,
        name=f'{name}_skip_conv'
    )(gated)

    # Residual connection
    # Note: Need to match dimensions for residual connection
    if inputs.shape[-1] != filters:
        inputs_proj = keras.layers.Conv1D(
            filters=filters,
            kernel_size=1,
            name=f'{name}_input_proj'
        )(inputs)
    else:
        inputs_proj = inputs

    output = keras.layers.Add(name=f'{name}_residual_add')([inputs_proj, residual])

    return output, skip

def build_wavenet_model(input_shape, num_blocks=8, num_stacks=3, filters=32, kernel_size=2):
    """
    Builds a complete WaveNet model.

    Args:
        input_shape: Shape of input sequences
        num_blocks: Number of blocks per stack
        num_stacks: Number of stacks
        filters: Number of filters per layer
        kernel_size: Convolution kernel size

    Returns:
        Compiled Keras model
    """
    inputs = keras.layers.Input(shape=input_shape, name='wavenet_input')

    # Initial causal convolution
    x = keras.layers.Conv1D(
        filters=filters,
        kernel_size=kernel_size,
        padding='causal',
        name='initial_conv'
    )(inputs)

    # Collect skip connections
    skip_connections = []

    # Build stacks of dilated convolution blocks
    for stack in range(num_stacks):
        for block in range(num_blocks):
            dilation_rate = 2 ** block
            block_name = f'stack_{stack}_block_{block}_dil_{dilation_rate}'

            x, skip = build_wavenet_block(
                x, filters, kernel_size, dilation_rate, block_name
            )
            skip_connections.append(skip)

    # Sum all skip connections
    skip_sum = keras.layers.Add(name='skip_sum')(skip_connections)

    # Final processing
    x = keras.layers.Activation('relu', name='final_relu1')(skip_sum)
    x = keras.layers.Conv1D(filters=filters, kernel_size=1, name='final_conv1')(x)
    x = keras.layers.Activation('relu', name='final_relu2')(x)
    x = keras.layers.Conv1D(filters=1, kernel_size=1, name='final_conv2')(x)

    # Global pooling for sequence-to-vector output
    output = keras.layers.GlobalAveragePooling1D(name='global_pool')(x)

    model = keras.Model(inputs=inputs, outputs=output, name='WaveNet')
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    return model

def build_simplified_wavenet():
    """
    Builds the simplified WaveNet as described in the book.
    """
    model = keras.Sequential(name='Simplified_WaveNet')
    model.add(keras.layers.InputLayer(input_shape=[None, 1]))

    # Add dilated convolution layers as in the book
    for rate in (1, 2, 4, 8) * 2:  # Repeat the pattern twice
        model.add(keras.layers.Conv1D(
            filters=20,
            kernel_size=2,
            padding="causal",
            activation="relu",
            dilation_rate=rate
        ))

    # Output layer
    model.add(keras.layers.Conv1D(filters=1, kernel_size=1))

    # Global pooling to get single output
    model.add(keras.layers.GlobalAveragePooling1D())

    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

def analyze_wavenet_receptive_field():
    """
    Analyzes and visualizes WaveNet's receptive field growth.
    """
    print("\nWAVENET RECEPTIVE FIELD ANALYSIS")
    print("=" * 50)

    # Calculate receptive field for different WaveNet configurations
    configs = {
        'Book Example (1,2,4,8)×2': (4, 2, 2),  # (blocks, kernel_size, stacks)
        'Small WaveNet (1,2,4)×3': (3, 2, 3),
        'Standard WaveNet (1-512)×3': (10, 2, 3),
        'Large WaveNet (1-1024)×4': (11, 2, 4)
    }

    def calculate_wavenet_rf(num_blocks, kernel_size, num_stacks):
        """Calculate WaveNet receptive field."""
        # Each stack has dilation rates: 1, 2, 4, ..., 2^(num_blocks-1)
        rf = 1  # Initial point
        for stack in range(num_stacks):
            for block in range(num_blocks):
                dilation = 2 ** block
                rf += (kernel_size - 1) * dilation
        return rf

    rf_results = {}
    for name, (blocks, kernel_size, stacks) in configs.items():
        rf = calculate_wavenet_rf(blocks, kernel_size, stacks)
        rf_results[name] = rf
        print(f"{name:<25}: RF = {rf:>6} (blocks={blocks}, stacks={stacks})")

    # Visualize receptive field growth
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # 1. Receptive field by layer for book example
    ax = axes[0, 0]
    layers = []
    rfs = [1]  # Start with RF of 1

    # Book example: (1,2,4,8) × 2
    current_rf = 1
    layer_count = 0
    for stack in range(2):
        for dilation in [1, 2, 4, 8]:
            current_rf += (2 - 1) * dilation  # kernel_size = 2
            layer_count += 1
            layers.append(layer_count)
            rfs.append(current_rf)

    ax.plot(layers, rfs[1:], 'o-', linewidth=2, markersize=8)
    ax.set_title('Receptive Field Growth (Book Example)')
    ax.set_xlabel('Layer Number')
    ax.set_ylabel('Receptive Field Size')
    ax.grid(True, alpha=0.3)

    # Add dilation rate annotations
    dilations = [1, 2, 4, 8] * 2
    for i, (layer, rf, dil) in enumerate(zip(layers, rfs[1:], dilations)):
        ax.annotate(f'd={dil}', (layer, rf), xytext=(0, 10),
                   textcoords='offset points', ha='center', fontsize=9)

    # 2. Comparison of different configurations
    ax = axes[0, 1]
    names = list(rf_results.keys())
    rfs_values = list(rf_results.values())

    bars = ax.bar(range(len(names)), rfs_values, alpha=0.7, color='skyblue')
    ax.set_title('Receptive Field Comparison')
    ax.set_xlabel('Configuration')
    ax.set_ylabel('Receptive Field Size')
    ax.set_xticks(range(len(names)))
    ax.set_xticklabels(names, rotation=45, ha='right')
    ax.set_yscale('log')

    # Add value labels
    for bar, value in zip(bars, rfs_values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
                f'{value}', ha='center', va='bottom', fontsize=9)

    # 3. Exponential growth visualization
    ax = axes[1, 0]

    # Show how dilation creates exponential growth
    dilations = [2**i for i in range(8)]
    cumulative_rf = np.cumsum([1] + dilations)

    ax.semilogy(range(len(cumulative_rf)), cumulative_rf, 'ro-',
               linewidth=2, markersize=8, label='WaveNet (exponential)')

    # Compare with linear growth
    linear_rf = np.arange(1, len(cumulative_rf) + 1) * 3  # kernel_size = 3
    ax.semilogy(range(len(linear_rf)), linear_rf, 'bs-',
               linewidth=2, markersize=6, label='Standard CNN (linear)')

    ax.set_title('Receptive Field Growth: Exponential vs Linear')
    ax.set_xlabel('Layer Number')
    ax.set_ylabel('Receptive Field Size (log scale)')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # 4. Memory efficiency comparison
    ax = axes[1, 1]

    # Calculate parameters for different approaches to achieve RF=1024
    target_rf = 1024

    approaches = {
        'WaveNet\n(dilation)': 10 * 20 * 2 * 2,  # 10 layers, 20 filters, kernel=2, 2 feature maps
        'Standard CNN\n(large kernels)': 3 * 20 * 341 * 1,  # 3 layers, 20 filters, kernel=341
        'Deep CNN\n(small kernels)': 341 * 20 * 3 * 1,  # 341 layers, 20 filters, kernel=3
        'LSTM\n(sequential)': 1024 * 4 * 20 * 20  # LSTM parameters for 20 units
    }

    approach_names = list(approaches.keys())
    param_counts = list(approaches.values())

    bars = ax.bar(range(len(approach_names)), param_counts, alpha=0.7,
                 color=['green', 'red', 'orange', 'blue'])
    ax.set_title('Parameter Efficiency for RF=1024')
    ax.set_xlabel('Approach')
    ax.set_ylabel('Number of Parameters')
    ax.set_xticks(range(len(approach_names)))
    ax.set_xticklabels(approach_names)
    ax.set_yscale('log')

    # Add value labels
    for bar, value in zip(bars, param_counts):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
                f'{value:,}', ha='center', va='bottom', fontsize=9, rotation=90)

    plt.tight_layout()
    plt.show()

    return rf_results

def demonstrate_wavenet_performance():
    """
    Demonstrates WaveNet performance on time series forecasting.
    """
    print("\nWAVENET PERFORMANCE DEMONSTRATION")
    print("=" * 50)

    # Build different WaveNet variants
    models = {
        'Simplified WaveNet (Book)': build_simplified_wavenet(),
        'Small WaveNet': build_wavenet_model(
            input_shape=[None, 1],
            num_blocks=4,
            num_stacks=2,
            filters=16
        ),
        'Medium WaveNet': build_wavenet_model(
            input_shape=[None, 1],
            num_blocks=6,
            num_stacks=2,
            filters=32
        )
    }

    # Add baseline models for comparison
    models['LSTM Baseline'] = keras.Sequential([
        keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 1]),
        keras.layers.LSTM(32),
        keras.layers.Dense(1)
    ])

    models['CNN Baseline'] = keras.Sequential([
        keras.layers.Conv1D(32, 5, activation='relu', input_shape=[None, 1]),
        keras.layers.Conv1D(32, 5, activation='relu'),
        keras.layers.GlobalMaxPooling1D(),
        keras.layers.Dense(50, activation='relu'),
        keras.layers.Dense(1)
    ])

    # Compile baseline models
    models['LSTM Baseline'].compile(optimizer='adam', loss='mse', metrics=['mae'])
    models['CNN Baseline'].compile(optimizer='adam', loss='mse', metrics=['mae'])

    # Train and evaluate models
    results = {}
    histories = {}

    epochs = 15
    batch_size = 32

    for name, model in models.items():
        print(f"\nTraining {name}...")
        print(f"Parameters: {model.count_params():,}")

        # Display model architecture for WaveNet models
        if 'WaveNet' in name:
            print(f"Model summary for {name}:")
            model.summary()

        import time
        start_time = time.time()

        history = model.fit(
            X_train, y_train,
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(X_valid, y_valid),
            verbose=0
        )

        training_time = time.time() - start_time

        # Evaluate
        test_loss = model.evaluate(X_test, y_test, verbose=0)[0]

        results[name] = {
            'test_mse': test_loss,
            'params': model.count_params(),
            'training_time': training_time,
            'best_val_loss': min(history.history['val_loss'])
        }

        histories[name] = history

        print(f"Test MSE: {test_loss:.6f}, Training time: {training_time:.1f}s")

    # Comprehensive visualization
    fig, axes = plt.subplots(3, 2, figsize=(15, 15))

    # 1. Performance comparison
    ax = axes[0, 0]
    names = list(results.keys())
    test_mses = [results[name]['test_mse'] for name in names]
    colors = ['green' if 'WaveNet' in name else 'blue' if 'LSTM' in name else 'red' for name in names]

    bars = ax.bar(range(len(names)), test_mses, alpha=0.7, color=colors)
    ax.set_title('Test MSE Comparison')
    ax.set_xlabel('Model')
    ax.set_ylabel('Test MSE')
    ax.set_xticks(range(len(names)))
    ax.set_xticklabels(names, rotation=45, ha='right')

    # Add value labels
    for bar, value in zip(bars, test_mses):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
                f'{value:.4f}', ha='center', va='bottom', fontsize=9)

    # 2. Training curves
    ax = axes[0, 1]
    for name, history in histories.items():
        color = 'green' if 'WaveNet' in name else 'blue' if 'LSTM' in name else 'red'
        ax.plot(history.history['val_loss'], label=name, linewidth=2, color=color, alpha=0.8)

    ax.set_title('Validation Loss Curves')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Validation Loss')
    ax.legend()
    ax.set_yscale('log')
    ax.grid(True, alpha=0.3)

    # 3. Parameter efficiency
    ax = axes[1, 0]
    param_counts = [results[name]['params'] for name in names]

    scatter = ax.scatter(param_counts, test_mses, c=colors, s=100, alpha=0.7)
    for i, name in enumerate(names):
        ax.annotate(name, (param_counts[i], test_mses[i]),
                   xytext=(5, 5), textcoords='offset points', fontsize=9)

    ax.set_xlabel('Number of Parameters')
    ax.set_ylabel('Test MSE')
    ax.set_title('Parameter Efficiency')
    ax.grid(True, alpha=0.3)

    # 4. Training time efficiency
    ax = axes[1, 1]
    training_times = [results[name]['training_time'] for name in names]

    scatter = ax.scatter(training_times, test_mses, c=colors, s=100, alpha=0.7)
    for i, name in enumerate(names):
        ax.annotate(name, (training_times[i], test_mses[i]),
                   xytext=(5, 5), textcoords='offset points', fontsize=9)

    ax.set_xlabel('Training Time (seconds)')
    ax.set_ylabel('Test MSE')
    ax.set_title('Training Time Efficiency')
    ax.grid(True, alpha=0.3)

    # 5. Prediction examples
    ax = axes[2, 0]
    best_model_name = min(names, key=lambda x: results[x]['test_mse'])
    best_model = models[best_model_name]

    # Show predictions for a few test examples
    n_examples = 3
    for i in range(n_examples):
        input_seq = X_test[i, :, 0]
        true_val = y_test[i]
        pred_val = best_model.predict(X_test[i:i+1], verbose=0)[0]

        # Plot input sequence
        ax.plot(range(len(input_seq)), input_seq, 'b-', alpha=0.6, linewidth=1)

        # Plot true and predicted next values
        ax.scatter(len(input_seq), true_val, color='red', s=50, alpha=0.8,
                  label='True' if i == 0 else '')
        ax.scatter(len(input_seq), pred_val, color='green', s=50, marker='x', alpha=0.8,
                  label='Predicted' if i == 0 else '')

    ax.set_title(f'Predictions: {best_model_name}')
    ax.set_xlabel('Time Step')
    ax.set_ylabel('Value')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # 6. Architecture comparison summary
    ax = axes[2, 1]
    ax.axis('off')

    # Create a summary table
    table_data = []
    for name in names:
        table_data.append([
            name,
            f"{results[name]['test_mse']:.6f}",
            f"{results[name]['params']:,}",
            f"{results[name]['training_time']:.1f}s"
        ])

    table = ax.table(
        cellText=table_data,
        colLabels=['Model', 'Test MSE', 'Parameters', 'Training Time'],
        cellLoc='center',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1.2, 1.5)

    ax.set_title('Performance Summary', fontsize=14, fontweight='bold', pad=20)

    plt.tight_layout()
    plt.show()

    return results, models

# Run WaveNet demonstrations
print("WAVENET RECEPTIVE FIELD ANALYSIS:")
wavenet_rf_results = analyze_wavenet_receptive_field()

print("\nWAVENET PERFORMANCE DEMONSTRATION:")
wavenet_results, wavenet_models = demonstrate_wavenet_performance()

## 9. Comprehensive Exercises and Solutions

This section provides detailed solutions to all exercises from Chapter 15, with theoretical explanations and practical implementations.

### Exercise Framework

Each exercise solution includes:
1. **Theoretical Analysis**: Mathematical foundations and concepts
2. **Implementation**: Working code with detailed explanations
3. **Experimental Results**: Performance analysis and visualizations
4. **Extensions**: Additional considerations and improvements

---

### Exercise 1: Applications for Different RNN Architectures

**Question**: Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?

#### Theoretical Analysis

Different RNN architectures are suited for different types of sequential problems based on their input-output mapping:

**Sequence-to-Sequence RNNs:**
- **Mathematical Framework**: $f: \mathbb{R}^{T_{in} \times d_{in}} \rightarrow \mathbb{R}^{T_{out} \times d_{out}}$
- **Characteristics**: Both input and output are sequences, potentially of different lengths

**Sequence-to-Vector RNNs:**
- **Mathematical Framework**: $f: \mathbb{R}^{T \times d_{in}} \rightarrow \mathbb{R}^{d_{out}}$
- **Characteristics**: Input is a sequence, output is a fixed-size vector

**Vector-to-Sequence RNNs:**
- **Mathematical Framework**: $f: \mathbb{R}^{d_{in}} \rightarrow \mathbb{R}^{T \times d_{out}}$
- **Characteristics**: Input is a fixed-size vector, output is a sequence

In [None]:
# Exercise 1: RNN Architecture Applications

def demonstrate_rnn_applications():
    """
    Demonstrates different RNN architectures with practical examples.
    """
    print("EXERCISE 1: RNN ARCHITECTURE APPLICATIONS")
    print("=" * 60)

    applications = {
        'Sequence-to-Sequence': {
            'applications': [
                'Machine Translation (English → French)',
                'Speech Recognition (Audio → Text)',
                'Time Series Forecasting (Past → Future)',
                'Video Captioning (Video Frames → Description)',
                'Code Generation (Comments → Code)',
                'Chatbot Responses (Question → Answer)',
                'Music Generation (Melody → Harmony)',
                'DNA Sequence Analysis (Input Sequence → Output Sequence)'
            ],
            'mathematical_form': 'f: R^(T_in × d_in) → R^(T_out × d_out)',
            'key_features': [
                'Both input and output are sequences',
                'Can handle variable-length inputs and outputs',
                'Often uses encoder-decoder architecture',
                'Attention mechanisms commonly used'
            ]
        },

        'Sequence-to-Vector': {
            'applications': [
                'Sentiment Analysis (Text → Sentiment Score)',
                'Document Classification (Document → Category)',
                'Intent Recognition (Speech → Intent)',
                'Anomaly Detection (Time Series → Anomaly Score)',
                'Feature Extraction (Sequence → Embedding)',
                'Health Monitoring (Sensor Data → Health Status)',
                'Stock Market Prediction (Price History → Buy/Sell)',
                'Protein Function Prediction (Sequence → Function)'
            ],
            'mathematical_form': 'f: R^(T × d_in) → R^d_out',
            'key_features': [
                'Processes entire sequence to single output',
                'Global pooling or final hidden state used',
                'Good for classification/regression tasks',
                'Summarizes sequential information'
            ]
        },

        'Vector-to-Sequence': {
            'applications': [
                'Image Captioning (Image → Caption)',
                'Music Generation (Style Vector → Melody)',
                'Text Generation (Topic → Article)',
                'Data Augmentation (Seed → Synthetic Sequence)',
                'Story Generation (Theme → Story)',
                'Code Generation (Specification → Implementation)',
                'Weather Forecasting (Current State → Forecast)',
                'Drug Discovery (Molecule → Properties)'
            ],
            'mathematical_form': 'f: R^d_in → R^(T × d_out)',
            'key_features': [
                'Single input generates sequence output',
                'Input often repeated at each time step',
                'Useful for generation tasks',
                'Often combined with attention mechanisms'
            ]
        }
    }

    # Create comprehensive visualization
    fig, axes = plt.subplots(3, 2, figsize=(16, 12))

    for idx, (arch_type, info) in enumerate(applications.items()):
        # Architecture diagram
        ax1 = axes[idx, 0]
        ax1.set_title(f'{arch_type} Architecture', fontweight='bold', fontsize=12)

        if arch_type == 'Sequence-to-Sequence':
            # Draw seq2seq diagram
            # Input sequence
            for i in range(4):
                ax1.add_patch(plt.Rectangle((i*0.2, 0.3), 0.15, 0.2,
                                          facecolor='lightblue', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.4, f'x{i+1}', ha='center', va='center', fontsize=8)
                # Arrow to RNN
                ax1.arrow(i*0.2 + 0.075, 0.5, 0, 0.1, head_width=0.02, head_length=0.02, fc='black')
                # RNN cell
                ax1.add_patch(plt.Rectangle((i*0.2, 0.6), 0.15, 0.15,
                                          facecolor='yellow', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.675, 'RNN', ha='center', va='center', fontsize=7)
                # Arrow to output
                ax1.arrow(i*0.2 + 0.075, 0.75, 0, 0.1, head_width=0.02, head_length=0.02, fc='black')
                # Output
                ax1.add_patch(plt.Rectangle((i*0.2, 0.85), 0.15, 0.1,
                                          facecolor='lightgreen', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.9, f'y{i+1}', ha='center', va='center', fontsize=8)

            # Horizontal connections
            for i in range(3):
                ax1.arrow((i+1)*0.2 - 0.025, 0.675, 0.05, 0, head_width=0.02, head_length=0.02, fc='red')

        elif arch_type == 'Sequence-to-Vector':
            # Draw seq2vec diagram
            for i in range(4):
                ax1.add_patch(plt.Rectangle((i*0.2, 0.3), 0.15, 0.2,
                                          facecolor='lightblue', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.4, f'x{i+1}', ha='center', va='center', fontsize=8)
                ax1.arrow(i*0.2 + 0.075, 0.5, 0, 0.1, head_width=0.02, head_length=0.02, fc='black')
                ax1.add_patch(plt.Rectangle((i*0.2, 0.6), 0.15, 0.15,
                                          facecolor='yellow', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.675, 'RNN', ha='center', va='center', fontsize=7)
                if i < 3:
                    ax1.arrow((i+1)*0.2 - 0.025, 0.675, 0.05, 0, head_width=0.02, head_length=0.02, fc='red')

            # Final output
            ax1.arrow(0.6, 0.75, 0, 0.1, head_width=0.02, head_length=0.02, fc='black')
            ax1.add_patch(plt.Rectangle((0.5, 0.85), 0.2, 0.1,
                                      facecolor='lightgreen', edgecolor='black'))
            ax1.text(0.6, 0.9, 'y', ha='center', va='center', fontsize=10)

        else:  # Vector-to-Sequence
            # Draw vec2seq diagram
            # Single input
            ax1.add_patch(plt.Rectangle((0.4, 0.2), 0.2, 0.15,
                                      facecolor='lightblue', edgecolor='black'))
            ax1.text(0.5, 0.275, 'x', ha='center', va='center', fontsize=10)

            # Multiple outputs
            for i in range(4):
                ax1.arrow(0.5, 0.35, (i-1.5)*0.15, 0.25, head_width=0.02, head_length=0.02, fc='black')
                ax1.add_patch(plt.Rectangle((i*0.2, 0.6), 0.15, 0.15,
                                          facecolor='yellow', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.675, 'RNN', ha='center', va='center', fontsize=7)
                ax1.arrow(i*0.2 + 0.075, 0.75, 0, 0.1, head_width=0.02, head_length=0.02, fc='black')
                ax1.add_patch(plt.Rectangle((i*0.2, 0.85), 0.15, 0.1,
                                          facecolor='lightgreen', edgecolor='black'))
                ax1.text(i*0.2 + 0.075, 0.9, f'y{i+1}', ha='center', va='center', fontsize=8)
                if i < 3:
                    ax1.arrow((i+1)*0.2 - 0.025, 0.675, 0.05, 0, head_width=0.02, head_length=0.02, fc='red')

        ax1.set_xlim(-0.1, 1.0)
        ax1.set_ylim(0, 1)
        ax1.axis('off')

        # Applications list
        ax2 = axes[idx, 1]
        ax2.set_title(f'Applications', fontweight='bold', fontsize=12)

        # Create text summary
        text_content = f"Mathematical Form:\n{info['mathematical_form']}\n\n"
        text_content += "Key Features:\n"
        for feature in info['key_features']:
            text_content += f"• {feature}\n"

        text_content += "\nExample Applications:\n"
        for i, app in enumerate(info['applications'][:6]):  # Show first 6
            text_content += f"{i+1}. {app}\n"

        ax2.text(0.05, 0.95, text_content, transform=ax2.transAxes,
                fontsize=9, verticalalignment='top', fontfamily='monospace')
        ax2.axis('off')

    plt.tight_layout()
    plt.show()

    # Print detailed analysis
    print("\nDETAILED ANALYSIS:")
    print("=" * 50)

    for arch_type, info in applications.items():
        print(f"\n{arch_type.upper()}:")
        print(f"Mathematical Form: {info['mathematical_form']}")
        print("\nKey Characteristics:")
        for feature in info['key_features']:
            print(f"  • {feature}")
        print("\nPractical Applications:")
        for i, app in enumerate(info['applications'], 1):
            print(f"  {i:2d}. {app}")

    return applications

# Run Exercise 1
exercise_1_results = demonstrate_rnn_applications()

### Exercise 2: RNN Input and Output Dimensions

**Question**: How many dimensions must the inputs of an RNN layer have? What does each dimension represent? What about its outputs?

#### Theoretical Analysis

RNN layers in deep learning frameworks follow specific tensor dimension conventions:

**Input Tensor Dimensions**: `[batch_size, time_steps, features]`
- **batch_size**: Number of sequences processed simultaneously
- **time_steps**: Length of each sequence (can be variable with padding)
- **features**: Number of features at each time step

**Mathematical Representation**:
$$\mathbf{X} \in \mathbb{R}^{B \times T \times D}$$

Where:
- $B$ = batch size
- $T$ = time steps
- $D$ = input feature dimension

**Output Tensor Dimensions**:
- **Without `return_sequences=True`**: `[batch_size, units]`
- **With `return_sequences=True`**: `[batch_size, time_steps, units]`

In [None]:
# Exercise 2: RNN Input/Output Dimensions

def analyze_rnn_dimensions():
    """
    Comprehensive analysis of RNN input and output dimensions.
    """
    print("EXERCISE 2: RNN INPUT/OUTPUT DIMENSIONS")
    print("=" * 60)

    # Create sample data with different configurations
    configs = {
        'Univariate Time Series': {
            'batch_size': 32,
            'time_steps': 50,
            'features': 1,
            'description': 'Single feature per time step (e.g., stock price)'
        },
        'Multivariate Time Series': {
            'batch_size': 16,
            'time_steps': 100,
            'features': 5,
            'description': 'Multiple features per time step (e.g., weather data)'
        },
        'Text Sequences (Word Embeddings)': {
            'batch_size': 64,
            'time_steps': 20,
            'features': 300,
            'description': 'Word vectors in sentences'
        },
        'Audio Features': {
            'batch_size': 8,
            'time_steps': 1000,
            'features': 13,
            'description': 'MFCC features for speech recognition'
        }
    }

    # Demonstrate different RNN configurations
    print("\nRNN LAYER CONFIGURATIONS:")
    print("=" * 40)

    results = {}

    for config_name, config in configs.items():
        print(f"\n{config_name}:")
        print(f"Description: {config['description']}")

        batch_size = config['batch_size']
        time_steps = config['time_steps']
        features = config['features']

        # Create sample input data
        X = np.random.randn(batch_size, time_steps, features)
        print(f"Input shape: {X.shape}")
        print(f"  • Batch size: {batch_size} (number of sequences)")
        print(f"  • Time steps: {time_steps} (sequence length)")
        print(f"  • Features: {features} (features per time step)")

        # Test different RNN configurations
        rnn_configs = {
            'Basic RNN (last output only)': {
                'return_sequences': False,
                'units': 64
            },
            'RNN with all outputs': {
                'return_sequences': True,
                'units': 64
            },
            'Bidirectional RNN': {
                'return_sequences': True,
                'units': 32,
                'bidirectional': True
            }
        }

        config_results = {}

        for rnn_name, rnn_config in rnn_configs.items():
            print(f"\n  {rnn_name}:")

            # Build model
            if rnn_config.get('bidirectional', False):
                layer = keras.layers.Bidirectional(
                    keras.layers.LSTM(rnn_config['units'],
                                     return_sequences=rnn_config['return_sequences']),
                    input_shape=(time_steps, features)
                )
            else:
                layer = keras.layers.LSTM(
                    rnn_config['units'],
                    return_sequences=rnn_config['return_sequences'],
                    input_shape=(time_steps, features)
                )

            model = keras.Sequential([layer])

            # Get output shape
            output = model(X)
            print(f"    Output shape: {output.shape}")

            # Explain dimensions
            if len(output.shape) == 2:
                print(f"    • Batch size: {output.shape[0]}")
                print(f"    • Units: {output.shape[1]}")
                print(f"    • Returns: Final hidden state only")
            else:
                print(f"    • Batch size: {output.shape[0]}")
                print(f"    • Time steps: {output.shape[1]}")
                print(f"    • Units: {output.shape[2]}")
                print(f"    • Returns: Hidden state at each time step")

            config_results[rnn_name] = {
                'input_shape': X.shape,
                'output_shape': output.shape,
                'parameters': model.count_params()
            }

        results[config_name] = {
            'config': config,
            'rnn_results': config_results
        }

    # Create comprehensive visualization
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # 1. Input dimension analysis
    ax = axes[0, 0]
    config_names = list(configs.keys())
    batch_sizes = [configs[name]['batch_size'] for name in config_names]
    time_steps = [configs[name]['time_steps'] for name in config_names]
    features = [configs[name]['features'] for name in config_names]

    x_pos = np.arange(len(config_names))
    width = 0.25

    ax.bar(x_pos - width, batch_sizes, width, label='Batch Size', alpha=0.7)
    ax.bar(x_pos, time_steps, width, label='Time Steps', alpha=0.7)
    ax.bar(x_pos + width, features, width, label='Features', alpha=0.7)

    ax.set_title('Input Dimension Analysis')
    ax.set_xlabel('Configuration')
    ax.set_ylabel('Dimension Size')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(config_names, rotation=45, ha='right')
    ax.legend()
    ax.set_yscale('log')
    ax.grid(True, alpha=0.3)

    # 2. Memory usage analysis
    ax = axes[0, 1]
    memory_usage = []
    for config in configs.values():
        # Calculate memory usage (in MB, assuming float32)
        memory = config['batch_size'] * config['time_steps'] * config['features'] * 4 / (1024**2)
        memory_usage.append(memory)

    bars = ax.bar(config_names, memory_usage, alpha=0.7, color='orange')
    ax.set_title('Input Tensor Memory Usage')
    ax.set_xlabel('Configuration')
    ax.set_ylabel('Memory (MB)')
    ax.set_xticklabels(config_names, rotation=45, ha='right')

    # Add value labels
    for bar, value in zip(bars, memory_usage):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                f'{value:.2f}', ha='center', va='bottom')

    # 3. Output shape comparison
    ax = axes[1, 0]

    # Create a visual representation of different output shapes
    output_types = ['Last Output Only', 'All Outputs', 'Bidirectional']
    y_positions = [2, 1, 0]

    for i, output_type in enumerate(output_types):
        y = y_positions[i]

        if output_type == 'Last Output Only':
            # Draw 2D tensor
            rect = plt.Rectangle((1, y), 2, 0.5, facecolor='lightblue',
                               edgecolor='black', alpha=0.7)
            ax.add_patch(rect)
            ax.text(2, y+0.25, '[batch, units]', ha='center', va='center', fontsize=10)

        elif output_type == 'All Outputs':
            # Draw 3D tensor representation
            for j in range(3):
                rect = plt.Rectangle((1+j*0.1, y+j*0.1), 2, 0.5,
                                   facecolor='lightgreen', edgecolor='black', alpha=0.7)
                ax.add_patch(rect)
            ax.text(2.1, y+0.35, '[batch, time, units]', ha='center', va='center', fontsize=10)

        else:  # Bidirectional
            # Draw wider 3D tensor
            for j in range(3):
                rect = plt.Rectangle((1+j*0.1, y+j*0.1), 2.5, 0.5,
                                   facecolor='lightcoral', edgecolor='black', alpha=0.7)
                ax.add_patch(rect)
            ax.text(2.35, y+0.35, '[batch, time, 2×units]', ha='center', va='center', fontsize=10)

        ax.text(0.5, y+0.25, output_type, ha='right', va='center', fontsize=11, fontweight='bold')

    ax.set_xlim(0, 4.5)
    ax.set_ylim(-0.5, 3)
    ax.set_title('RNN Output Shape Types')
    ax.axis('off')

    # 4. Dimension explanation table
    ax = axes[1, 1]
    ax.axis('off')

    table_data = [
        ['Dimension', 'Symbol', 'Description', 'Example'],
        ['Batch Size', 'B', 'Number of sequences processed together', '32'],
        ['Time Steps', 'T', 'Length of each sequence', '50'],
        ['Input Features', 'D_in', 'Features per time step', '1-300'],
        ['RNN Units', 'D_out', 'Size of hidden state', '64-512'],
        ['Output (no seq)', '[B, D_out]', 'Final hidden state only', '[32, 64]'],
        ['Output (with seq)', '[B, T, D_out]', 'All hidden states', '[32, 50, 64]']
    ]

    table = ax.table(
        cellText=table_data[1:],
        colLabels=table_data[0],
        cellLoc='center',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1.2, 2)

    # Style the header
    for i in range(len(table_data[0])):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')

    ax.set_title('RNN Dimension Reference', fontsize=14, fontweight='bold', pad=20)

    plt.tight_layout()
    plt.show()

    # Mathematical formulation
    print("\nMATHEMATICAL FORMULATION:")
    print("=" * 40)
    print("Input Tensor: X ∈ ℝ^(B×T×D_in)")
    print("Where:")
    print("  • B = batch_size (number of sequences)")
    print("  • T = time_steps (sequence length)")
    print("  • D_in = input_features (features per time step)")
    print("\nOutput Tensor:")
    print("  • return_sequences=False: Y ∈ ℝ^(B×D_out)")
    print("  • return_sequences=True:  Y ∈ ℝ^(B×T×D_out)")
    print("Where:")
    print("  • D_out = units (RNN hidden size)")

    return results

# Run Exercise 2
exercise_2_results = analyze_rnn_dimensions()

### Exercise 3: Deep Sequence-to-Sequence RNN Configuration

**Question**: If you want to build a deep sequence-to-sequence RNN, which RNN layers should have `return_sequences=True`? What about a sequence-to-vector RNN?

#### Theoretical Analysis

The `return_sequences` parameter controls whether an RNN layer outputs:
- **`False`**: Only the final hidden state (2D tensor)
- **`True`**: Hidden states for all time steps (3D tensor)

**Deep Sequence-to-Sequence RNN:**
- All layers except the last should have `return_sequences=True`
- The last layer's `return_sequences` depends on desired output format

**Sequence-to-Vector RNN:**
- All layers should have `return_sequences=False` except intermediate layers
- Only the final layer needs the aggregated representation

**Mathematical Justification:**
For layer $l$ to receive input from layer $l-1$:
$$\mathbf{h}^{(l)}_{t} = f^{(l)}(\mathbf{h}^{(l-1)}_{t}, \mathbf{h}^{(l)}_{t-1})$$

This requires $\mathbf{h}^{(l-1)}_{t}$ for all $t$, hence `return_sequences=True` for layer $l-1$.

In [None]:
# Exercise 3: Deep RNN Configuration

def demonstrate_deep_rnn_configurations():
    """
    Demonstrates proper configuration of deep RNNs for different tasks.
    """
    print("EXERCISE 3: DEEP RNN CONFIGURATIONS")
    print("=" * 60)

    # Create sample data
    batch_size = 32
    input_time_steps = 20
    output_time_steps = 15
    input_features = 10

    X_seq2seq = np.random.randn(batch_size, input_time_steps, input_features)
    X_seq2vec = np.random.randn(batch_size, input_time_steps, input_features)

    # Target data
    y_seq2seq = np.random.randn(batch_size, output_time_steps, 1)  # Sequence output
    y_seq2vec = np.random.randn(batch_size, 5)  # Vector output

    print(f"Sample data shapes:")
    print(f"  Sequence-to-Sequence: X={X_seq2seq.shape}, y={y_seq2seq.shape}")
    print(f"  Sequence-to-Vector: X={X_seq2vec.shape}, y={y_seq2vec.shape}")

    # Define different architectures
    architectures = {
        'Deep Seq2Seq (Correct)': {
            'type': 'seq2seq',
            'layers': [
                {'units': 64, 'return_sequences': True, 'layer_num': 1},
                {'units': 64, 'return_sequences': True, 'layer_num': 2},
                {'units': 32, 'return_sequences': True, 'layer_num': 3},
                {'type': 'TimeDistributed', 'units': 1}
            ],
            'description': 'All RNN layers return sequences, TimeDistributed for output'
        },

        'Deep Seq2Seq (Incorrect)': {
            'type': 'seq2seq_wrong',
            'layers': [
                {'units': 64, 'return_sequences': False, 'layer_num': 1},  # Wrong!
                {'units': 64, 'return_sequences': True, 'layer_num': 2},
                {'units': 32, 'return_sequences': True, 'layer_num': 3}
            ],
            'description': 'WRONG: First layer does not return sequences'
        },

        'Deep Seq2Vec (Correct)': {
            'type': 'seq2vec',
            'layers': [
                {'units': 64, 'return_sequences': True, 'layer_num': 1},
                {'units': 64, 'return_sequences': True, 'layer_num': 2},
                {'units': 32, 'return_sequences': False, 'layer_num': 3},  # Last layer only
                {'type': 'Dense', 'units': 5}
            ],
            'description': 'Only last RNN layer returns final state, Dense for output'
        },

        'Deep Seq2Vec (Alternative)': {
            'type': 'seq2vec_alt',
            'layers': [
                {'units': 64, 'return_sequences': True, 'layer_num': 1},
                {'units': 64, 'return_sequences': True, 'layer_num': 2},
                {'units': 32, 'return_sequences': True, 'layer_num': 3},
                {'type': 'GlobalMaxPooling1D'},
                {'type': 'Dense', 'units': 5}
            ],
            'description': 'All layers return sequences, global pooling for aggregation'
        }
    }

    # Build and test models
    model_results = {}

    for arch_name, arch_config in architectures.items():
        print(f"\n{arch_name}:")
        print(f"Description: {arch_config['description']}")

        try:
            # Build model
            model = keras.Sequential(name=arch_name.replace(' ', '_'))

            for i, layer_config in enumerate(arch_config['layers']):
                if layer_config.get('type') == 'TimeDistributed':
                    model.add(keras.layers.TimeDistributed(
                        keras.layers.Dense(layer_config['units'])
                    ))
                elif layer_config.get('type') == 'Dense':
                    model.add(keras.layers.Dense(layer_config['units']))
                elif layer_config.get('type') == 'GlobalMaxPooling1D':
                    model.add(keras.layers.GlobalMaxPooling1D())
                else:
                    # RNN layer
                    if i == 0:  # First layer needs input_shape
                        model.add(keras.layers.LSTM(
                            layer_config['units'],
                            return_sequences=layer_config['return_sequences'],
                            input_shape=(None, input_features)
                        ))
                    else:
                        model.add(keras.layers.LSTM(
                            layer_config['units'],
                            return_sequences=layer_config['return_sequences']
                        ))

            # Test with appropriate data
            if 'seq2seq' in arch_config['type']:
                if arch_config['type'] != 'seq2seq_wrong':
                    test_output = model(X_seq2seq)
                    print(f"  ✓ Model built successfully")
                    print(f"  Input shape: {X_seq2seq.shape}")
                    print(f"  Output shape: {test_output.shape}")
                    print(f"  Expected output shape: {y_seq2seq.shape}")

                    model_results[arch_name] = {
                        'success': True,
                        'input_shape': X_seq2seq.shape,
                        'output_shape': test_output.shape,
                        'parameters': model.count_params()
                    }
                else:
                    # This should fail
                    print(f"  ✗ This configuration will cause errors!")
                    print(f"  Reason: Second layer expects 3D input but gets 2D")
                    model_results[arch_name] = {
                        'success': False,
                        'error': 'Dimension mismatch'
                    }

            else:  # seq2vec
                test_output = model(X_seq2vec)
                print(f"  ✓ Model built successfully")
                print(f"  Input shape: {X_seq2vec.shape}")
                print(f"  Output shape: {test_output.shape}")
                print(f"  Expected output shape: {y_seq2vec.shape}")

                model_results[arch_name] = {
                    'success': True,
                    'input_shape': X_seq2vec.shape,
                    'output_shape': test_output.shape,
                    'parameters': model.count_params()
                }

            # Print layer-by-layer analysis
            print(f"  Layer-by-layer analysis:")
            for j, layer in enumerate(model.layers):
                if hasattr(layer, 'return_sequences'):
                    print(f"    Layer {j+1} ({layer.__class__.__name__}): "
                          f"units={layer.units}, return_sequences={layer.return_sequences}")
                else:
                    print(f"    Layer {j+1} ({layer.__class__.__name__})")

        except Exception as e:
            print(f"  ✗ Error: {str(e)}")
            model_results[arch_name] = {
                'success': False,
                'error': str(e)
            }

    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # 1. Correct Seq2Seq Architecture
    ax = axes[0, 0]
    ax.set_title('Correct Deep Seq2Seq Architecture', fontweight='bold')

    # Draw the architecture
    layer_positions = [0.2, 0.4, 0.6, 0.8]
    layer_names = ['LSTM\n(return_seq=True)', 'LSTM\n(return_seq=True)',
                   'LSTM\n(return_seq=True)', 'TimeDistributed\nDense']
    colors = ['lightblue', 'lightblue', 'lightblue', 'lightgreen']

    for i, (pos, name, color) in enumerate(zip(layer_positions, layer_names, colors)):
        # Layer box
        rect = plt.Rectangle((0.1, pos-0.05), 0.8, 0.1,
                           facecolor=color, edgecolor='black', alpha=0.7)
        ax.add_patch(rect)
        ax.text(0.5, pos, name, ha='center', va='center', fontsize=9, fontweight='bold')

        # Arrow to next layer
        if i < len(layer_positions) - 1:
            ax.arrow(0.5, pos + 0.05, 0, 0.08, head_width=0.02, head_length=0.01,
                    fc='black', ec='black')
            # Show tensor shape
            ax.text(0.92, pos + 0.09, '[B,T,D]', ha='left', va='center', fontsize=8,
                   bbox=dict(boxstyle="round,pad=0.2", facecolor='yellow', alpha=0.7))

    # Input and output labels
    ax.text(0.5, 0.05, 'Input: [B, T_in, D_in]', ha='center', va='center',
           fontsize=10, fontweight='bold')
    ax.text(0.5, 0.95, 'Output: [B, T_out, D_out]', ha='center', va='center',
           fontsize=10, fontweight='bold')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 2. Incorrect Seq2Seq Architecture
    ax = axes[0, 1]
    ax.set_title('INCORRECT Deep Seq2Seq Architecture', fontweight='bold', color='red')

    layer_names_wrong = ['LSTM\n(return_seq=FALSE)', 'LSTM\n(return_seq=True)',
                         'LSTM\n(return_seq=True)', 'Dense']
    colors_wrong = ['lightcoral', 'lightblue', 'lightblue', 'lightgreen']

    for i, (pos, name, color) in enumerate(zip(layer_positions, layer_names_wrong, colors_wrong)):
        rect = plt.Rectangle((0.1, pos-0.05), 0.8, 0.1,
                           facecolor=color, edgecolor='black', alpha=0.7)
        ax.add_patch(rect)
        ax.text(0.5, pos, name, ha='center', va='center', fontsize=9, fontweight='bold')

        if i < len(layer_positions) - 1:
            if i == 0:  # First arrow (problematic)
                ax.arrow(0.5, pos + 0.05, 0, 0.08, head_width=0.02, head_length=0.01,
                        fc='red', ec='red', linewidth=3)
                ax.text(0.92, pos + 0.09, '[B,D]', ha='left', va='center', fontsize=8,
                       bbox=dict(boxstyle="round,pad=0.2", facecolor='red', alpha=0.7))
                ax.text(0.5, pos + 0.09, '✗ ERROR!', ha='center', va='center',
                       fontsize=8, color='red', fontweight='bold')
            else:
                ax.arrow(0.5, pos + 0.05, 0, 0.08, head_width=0.02, head_length=0.01,
                        fc='black', ec='black')

    ax.text(0.5, 0.05, 'Input: [B, T_in, D_in]', ha='center', va='center',
           fontsize=10, fontweight='bold')
    ax.text(0.5, 0.95, 'Cannot produce sequence output!', ha='center', va='center',
           fontsize=10, fontweight='bold', color='red')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 3. Correct Seq2Vec Architecture
    ax = axes[1, 0]
    ax.set_title('Correct Deep Seq2Vec Architecture', fontweight='bold')

    layer_names_vec = ['LSTM\n(return_seq=True)', 'LSTM\n(return_seq=True)',
                       'LSTM\n(return_seq=FALSE)', 'Dense']
    colors_vec = ['lightblue', 'lightblue', 'orange', 'lightgreen']

    for i, (pos, name, color) in enumerate(zip(layer_positions, layer_names_vec, colors_vec)):
        rect = plt.Rectangle((0.1, pos-0.05), 0.8, 0.1,
                           facecolor=color, edgecolor='black', alpha=0.7)
        ax.add_patch(rect)
        ax.text(0.5, pos, name, ha='center', va='center', fontsize=9, fontweight='bold')

        if i < len(layer_positions) - 1:
            ax.arrow(0.5, pos + 0.05, 0, 0.08, head_width=0.02, head_length=0.01,
                    fc='black', ec='black')
            if i < 2:
                ax.text(0.92, pos + 0.09, '[B,T,D]', ha='left', va='center', fontsize=8,
                       bbox=dict(boxstyle="round,pad=0.2", facecolor='yellow', alpha=0.7))
            else:
                ax.text(0.92, pos + 0.09, '[B,D]', ha='left', va='center', fontsize=8,
                       bbox=dict(boxstyle="round,pad=0.2", facecolor='orange', alpha=0.7))

    ax.text(0.5, 0.05, 'Input: [B, T, D_in]', ha='center', va='center',
           fontsize=10, fontweight='bold')
    ax.text(0.5, 0.95, 'Output: [B, D_out]', ha='center', va='center',
           fontsize=10, fontweight='bold')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 4. Summary table
    ax = axes[1, 1]
    ax.axis('off')

    # Create summary table
    summary_data = [
        ['Architecture', 'Hidden Layers', 'Output Layer', 'Use Case'],
        ['Deep Seq2Seq', 'return_sequences=True', 'TimeDistributed', 'Translation, Forecasting'],
        ['Deep Seq2Vec', 'return_sequences=True\n(except last RNN)', 'Dense', 'Classification, Regression'],
        ['Encoder-Decoder', 'Encoder: True\nDecoder: varies', 'Decoder dependent', 'Complex transformations']
    ]

    table = ax.table(
        cellText=summary_data[1:],
        colLabels=summary_data[0],
        cellLoc='center',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1.2, 2.5)

    # Style the header
    for i in range(len(summary_data[0])):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')

    ax.set_title('Configuration Summary', fontsize=14, fontweight='bold', pad=20)

    plt.tight_layout()
    plt.show()

    # Print rules
    print("\nCONFIGURATION RULES:")
    print("=" * 30)
    print("\nDEEP SEQUENCE-TO-SEQUENCE RNN:")
    print("• ALL hidden RNN layers: return_sequences=True")
    print("• Output layer: TimeDistributed(Dense) or return_sequences=True")
    print("• Reason: Each layer needs full sequence from previous layer")

    print("\nDEEP SEQUENCE-TO-VECTOR RNN:")
    print("• Hidden RNN layers: return_sequences=True")
    print("• LAST RNN layer: return_sequences=False")
    print("• Output layer: Dense")
    print("• Alternative: All layers True + GlobalPooling")

    print("\nCOMMON MISTAKES:")
    print("• Setting return_sequences=False in hidden layers")
    print("• Forgetting TimeDistributed for sequence outputs")
    print("• Mismatching input/output dimensions")

    return model_results

# Run Exercise 3
exercise_3_results = demonstrate_deep_rnn_configurations()

### Exercise 4: Time Series Forecasting Architecture

**Question**: Suppose you have a daily univariate time series, and you want to forecast the next seven days. Which RNN architecture should you use?

#### Theoretical Analysis

For forecasting the next 7 days from daily univariate time series, we have several architectural choices:

**Option 1: Sequence-to-Vector + Dense Output**
- Input: Historical sequence `[batch, time_steps, 1]`
- RNN: Process sequence, output final state
- Dense: Map to 7-dimensional output
- Mathematical form: $f: \mathbb{R}^{T \times 1} \rightarrow \mathbb{R}^7$

**Option 2: Sequence-to-Sequence**
- Input: Historical sequence
- Output: Sequence of 7 future values
- Each time step predicts multiple future steps

**Option 3: Autoregressive (Iterative)**
- Predict one step ahead, add to input, repeat
- More prone to error accumulation

**Option 4: Encoder-Decoder**
- Encoder: Compress historical data
- Decoder: Generate 7-day forecast

**Recommended**: Sequence-to-Vector with Dense output for its simplicity and effectiveness.

In [None]:
# Exercise 4: Time Series Forecasting Architecture

def design_forecasting_architectures():
    """
    Designs and compares different architectures for 7-day forecasting.
    """
    print("EXERCISE 4: TIME SERIES FORECASTING ARCHITECTURES")
    print("=" * 70)

    # Generate realistic daily time series data
    np.random.seed(42)

    def generate_realistic_daily_series(n_series=1000, n_days=100):
        """
        Generates realistic daily time series with trend, seasonality, and noise.
        """
        series_list = []

        for i in range(n_series):
            # Time axis
            t = np.arange(n_days)

            # Trend component
            trend = np.random.uniform(-0.01, 0.01) * t

            # Weekly seasonality (7-day cycle)
            weekly_pattern = np.random.uniform(0.5, 2.0) * np.sin(2 * np.pi * t / 7 + np.random.uniform(0, 2*np.pi))

            # Monthly seasonality (30-day cycle)
            monthly_pattern = np.random.uniform(0.2, 1.0) * np.sin(2 * np.pi * t / 30 + np.random.uniform(0, 2*np.pi))

            # Random walk component
            random_walk = np.cumsum(np.random.normal(0, 0.1, n_days))

            # Noise
            noise = np.random.normal(0, 0.2, n_days)

            # Combine components
            series = trend + weekly_pattern + monthly_pattern + random_walk + noise

            # Add a base level
            series += np.random.uniform(10, 50)

            series_list.append(series)

        return np.array(series_list)

    # Generate data
    all_series = generate_realistic_daily_series(1000, 100)

    # Prepare data for 7-day forecasting
    lookback_days = 30  # Use 30 days to predict next 7
    forecast_days = 7

    X, y = [], []
    for series in all_series:
        for i in range(len(series) - lookback_days - forecast_days + 1):
            X.append(series[i:i + lookback_days])
            y.append(series[i + lookback_days:i + lookback_days + forecast_days])

    X = np.array(X).reshape(-1, lookback_days, 1)
    y = np.array(y)

    print(f"Dataset prepared:")
    print(f"  Input shape: {X.shape} (samples, lookback_days, features)")
    print(f"  Output shape: {y.shape} (samples, forecast_days)")

    # Split data
    train_size = int(0.7 * len(X))
    val_size = int(0.2 * len(X))

    X_train, y_train = X[:train_size], y[:train_size]
    X_val, y_val = X[train_size:train_size + val_size], y[train_size:train_size + val_size]
    X_test, y_test = X[train_size + val_size:], y[train_size + val_size:]

    # Define different architectures
    architectures = {
        'Sequence-to-Vector + Dense (Recommended)': {
            'model': keras.Sequential([
                keras.layers.LSTM(64, return_sequences=True, input_shape=(lookback_days, 1)),
                keras.layers.LSTM(32, return_sequences=False),
                keras.layers.Dense(64, activation='relu'),
                keras.layers.Dropout(0.2),
                keras.layers.Dense(forecast_days)  # Direct 7-day output
            ]),
            'description': 'Process sequence, output vector of 7 forecasts',
            'pros': ['Simple', 'Direct optimization', 'No error accumulation'],
            'cons': ['Assumes independence of forecast days']
        },

        'Sequence-to-Sequence': {
            'model': keras.Sequential([
                keras.layers.LSTM(64, return_sequences=True, input_shape=(lookback_days, 1)),
                keras.layers.LSTM(32, return_sequences=True),
                keras.layers.TimeDistributed(keras.layers.Dense(32, activation='relu')),
                keras.layers.TimeDistributed(keras.layers.Dense(1))
            ]),
            'description': 'Each time step predicts corresponding future day',
            'pros': ['Models sequential dependencies', 'Rich training signal'],
            'cons': ['More complex', 'Requires reshaping']
        },

        'Encoder-Decoder': {
            'model': None,  # Will build custom model
            'description': 'Encode past, decode future',
            'pros': ['Handles variable lengths', 'Attention can be added'],
            'cons': ['Most complex', 'More parameters']
        },

        'CNN-LSTM Hybrid': {
            'model': keras.Sequential([
                keras.layers.Conv1D(32, 3, activation='relu', input_shape=(lookback_days, 1)),
                keras.layers.Conv1D(32, 3, activation='relu'),
                keras.layers.LSTM(64, return_sequences=False),
                keras.layers.Dense(32, activation='relu'),
                keras.layers.Dense(forecast_days)
            ]),
            'description': 'CNN for local patterns, LSTM for long-term dependencies',
            'pros': ['Captures both local and long-term patterns', 'Efficient'],
            'cons': ['More hyperparameters to tune']
        }
    }

    # Build Encoder-Decoder model
    # Encoder
    encoder_inputs = keras.layers.Input(shape=(lookback_days, 1))
    encoder_lstm = keras.layers.LSTM(64, return_state=True)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
    encoder_states = [state_h, state_c]

    # Decoder
    decoder_inputs = keras.layers.Input(shape=(forecast_days, 1))
    decoder_lstm = keras.layers.LSTM(64, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = keras.layers.TimeDistributed(keras.layers.Dense(1))
    decoder_outputs = decoder_dense(decoder_outputs)

    encoder_decoder_model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
    architectures['Encoder-Decoder']['model'] = encoder_decoder_model

    # Train and evaluate models
    results = {}

    for arch_name, arch_info in architectures.items():
        print(f"\nTraining {arch_name}...")
        model = arch_info['model']
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])

        print(f"Parameters: {model.count_params():,}")

        # Prepare training data based on architecture
        if arch_name == 'Sequence-to-Sequence':
            # Reshape y for sequence output
            y_train_arch = y_train.reshape(-1, forecast_days, 1)
            y_val_arch = y_val.reshape(-1, forecast_days, 1)
            y_test_arch = y_test.reshape(-1, forecast_days, 1)
        elif arch_name == 'Encoder-Decoder':
            # Create decoder inputs (shifted target)
            decoder_input_train = np.zeros((len(y_train), forecast_days, 1))
            decoder_input_val = np.zeros((len(y_val), forecast_days, 1))
            decoder_input_test = np.zeros((len(y_test), forecast_days, 1))

            y_train_arch = y_train.reshape(-1, forecast_days, 1)
            y_val_arch = y_val.reshape(-1, forecast_days, 1)
            y_test_arch = y_test.reshape(-1, forecast_days, 1)

            X_train_arch = [X_train, decoder_input_train]
            X_val_arch = [X_val, decoder_input_val]
            X_test_arch = [X_test, decoder_input_test]
        else:
            y_train_arch = y_train
            y_val_arch = y_val
            y_test_arch = y_test
            X_train_arch = X_train
            X_val_arch = X_val
            X_test_arch = X_test

        # Train model
        import time
        start_time = time.time()

        history = model.fit(
            X_train_arch, y_train_arch,
            epochs=15,
            batch_size=32,
            validation_data=(X_val_arch, y_val_arch),
            verbose=0
        )

        training_time = time.time() - start_time

        # Evaluate
        test_loss = model.evaluate(X_test_arch, y_test_arch, verbose=0)[0]

        # Get predictions
        predictions = model.predict(X_test_arch, verbose=0)
        if len(predictions.shape) == 3:  # Sequence output
            predictions = predictions.reshape(-1, forecast_days)

        # Calculate metrics for each forecast day
        day_mses = []
        for day in range(forecast_days):
            day_mse = np.mean((y_test[:, day] - predictions[:, day]) ** 2)
            day_mses.append(day_mse)

        results[arch_name] = {
            'test_mse': test_loss,
            'day_mses': day_mses,
            'parameters': model.count_params(),
            'training_time': training_time,
            'predictions': predictions[:100],  # Save first 100 predictions
            'history': history,
            'pros': arch_info['pros'],
            'cons': arch_info['cons']
        }

        print(f"Test MSE: {test_loss:.6f}, Training time: {training_time:.1f}s")
        print(f"MSE by day: {[f'{mse:.4f}' for mse in day_mses]}")

    # Comprehensive visualization
    fig, axes = plt.subplots(3, 2, figsize=(16, 15))

    # 1. Sample time series
    ax = axes[0, 0]
    sample_series = all_series[:3]
        # Sample time series visualization
        for i, series in enumerate(sample_series):
            ax.plot(series, alpha=0.7, linewidth=2, label=f'Series {i+1}')

        ax.set_title('Sample Daily Time Series')
        ax.set_xlabel('Day')
        ax.set_ylabel('Value')
        ax.legend()
        ax.grid(True, alpha=0.3)

        # Add annotations for patterns
        ax.annotate('Weekly Pattern', xy=(7, sample_series[0][7]), xytext=(15, sample_series[0][7]+5),
                   arrowprops=dict(arrowstyle='->', color='red', alpha=0.7),
                   fontsize=10, color='red')

        # 2. Architecture comparison
        ax = axes[0, 1]
        arch_names = list(results.keys())
        test_mses = [results[name]['test_mse'] for name in arch_names]

        bars = ax.bar(range(len(arch_names)), test_mses, alpha=0.7,
                     color=['skyblue', 'lightgreen', 'orange', 'lightcoral'])
        ax.set_title('7-Day Forecasting Performance')
        ax.set_xlabel('Architecture')
        ax.set_ylabel('Test MSE')
        ax.set_xticks(range(len(arch_names)))
        ax.set_xticklabels(arch_names, rotation=45, ha='right')

        # Add value labels
        for bar, value in zip(bars, test_mses):
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
                    f'{value:.4f}', ha='center', va='bottom', fontsize=9)

        # 3. Forecast examples
        ax = axes[1, 0]
        best_arch = min(arch_names, key=lambda x: results[x]['test_mse'])
        best_model = architectures[best_arch]['model']

        # Show forecast examples
        for i in range(3):
            if best_arch == 'Encoder-Decoder':
                sample_pred = best_model.predict(X_test_arch[i:i+1], verbose=0)
                if len(sample_pred.shape) == 3:
                    sample_pred = sample_pred.reshape(-1, forecast_days)[0]
                else:
                    sample_pred = sample_pred[0]
            else:
                sample_pred = best_model.predict(X_test[i:i+1], verbose=0)[0]

            input_seq = X_test[i, :, 0]
            true_forecast = y_test[i]

            # Plot input sequence
            days = range(len(input_seq))
            ax.plot(days, input_seq, 'b-', linewidth=2, alpha=0.7)

            # Plot forecasts
            forecast_days_range = range(len(input_seq), len(input_seq) + forecast_days)
            ax.plot(forecast_days_range, true_forecast, 'r-', linewidth=3,
                   label='True Future' if i == 0 else '', alpha=0.8)
            ax.plot(forecast_days_range, sample_pred, 'g--', linewidth=2,
                   label='Forecast' if i == 0 else '', alpha=0.8)

            # Add vertical line separating past and future
            ax.axvline(x=len(input_seq)-0.5, color='gray', linestyle=':', alpha=0.7)

        ax.set_title(f'7-Day Forecasts: {best_arch}')
        ax.set_xlabel('Day')
        ax.set_ylabel('Value')
        ax.legend()
        ax.grid(True, alpha=0.3)

        # 4. Error by forecast day
        ax = axes[1, 1]

        # Calculate MSE for each forecast day
        day_errors = {name: [] for name in arch_names}

        for name in arch_names:
            model = architectures[name]['model']
            if name == 'Encoder-Decoder':
                preds = model.predict(X_test_arch, verbose=0)
                if len(preds.shape) == 3:
                    preds = preds.reshape(-1, forecast_days)
            else:
                preds = model.predict(X_test, verbose=0)
                if len(preds.shape) == 3:
                    preds = preds.reshape(-1, forecast_days)

            for day in range(forecast_days):
                day_mse = np.mean((y_test[:, day] - preds[:, day]) ** 2)
                day_errors[name].append(day_mse)

        # Plot error by day
        for name, errors in day_errors.items():
            ax.plot(range(1, forecast_days + 1), errors, 'o-',
                   linewidth=2, label=name, alpha=0.8)

        ax.set_title('Forecast Error by Day')
        ax.set_xlabel('Forecast Day')
        ax.set_ylabel('MSE')
        ax.legend()
        ax.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

        # Print detailed recommendations
        print("\nRECOMMENDATION ANALYSIS:")
        print("=" * 50)

        # Rank architectures
        ranked_archs = sorted(arch_names, key=lambda x: results[x]['test_mse'])

        print(f"\nRanking by Performance:")
        for i, name in enumerate(ranked_archs, 1):
            print(f"{i}. {name}: MSE = {results[name]['test_mse']:.6f}")
            print(f"   Pros: {', '.join(results[name]['pros'])}")
            print(f"   Cons: {', '.join(results[name]['cons'])}")
            print(f"   Parameters: {results[name]['parameters']:,}")
            print(f"   Training time: {results[name]['training_time']:.1f}s\n")

        print("FINAL RECOMMENDATION:")
        print(f"For 7-day univariate time series forecasting, use: {ranked_archs[0]}")
        print(f"Reasoning:")
        print(f"• Best performance: {results[ranked_archs[0]]['test_mse']:.6f} MSE")
        print(f"• {', '.join(results[ranked_archs[0]]['pros'])}")
        print(f"• Manageable complexity: {results[ranked_archs[0]]['parameters']:,} parameters")

        return results

# Run Exercise 4
exercise_4_results = design_forecasting_architectures()

### Exercise 5: Main Difficulties Training RNNs

**Question**: What are the main difficulties when training RNNs? How can you handle them?

#### Theoretical Analysis

Training RNNs presents several fundamental challenges:

**1. Vanishing Gradient Problem**
- **Mathematical Cause**: Gradients are computed via chain rule across time steps
- **Formula**: $\frac{\partial L}{\partial W} = \sum_{t=1}^T \sum_{k=1}^t \frac{\partial L^{(t)}}{\partial h^{(t)}} \prod_{i=k+1}^t \frac{\partial h^{(i)}}{\partial h^{(i-1)}} \frac{\partial h^{(k)}}{\partial W}$
- **Problem**: Product term $\prod_{i=k+1}^t \frac{\partial h^{(i)}}{\partial h^{(i-1)}}$ vanishes exponentially

**2. Exploding Gradient Problem**
- **Cause**: Same chain rule, but gradients grow exponentially
- **Effect**: Unstable training, parameter updates too large

**3. Long-term Dependencies**
- **Issue**: Information from early time steps gets lost
- **Mathematical**: $h^{(t)} = f(W h^{(t-1)} + U x^{(t)})$ - early information diluted

**4. Computational Complexity**
- **Memory**: $O(T)$ for sequence length $T$
- **Time**: Sequential processing prevents parallelization

#### Solutions and Techniques

In [None]:
# Exercise 5: RNN Training Difficulties and Solutions

def demonstrate_rnn_training_challenges():
    """
    Comprehensive demonstration of RNN training challenges and solutions.
    """
    print("EXERCISE 5: RNN TRAINING DIFFICULTIES AND SOLUTIONS")
    print("=" * 70)

    # Define the main challenges and their solutions
    challenges = {
        '1. Vanishing Gradients': {
            'description': 'Gradients become exponentially smaller through time',
            'mathematical_cause': 'Product of Jacobians < 1 across time steps',
            'formula': '∏(∂h^(i)/∂h^(i-1)) → 0 as sequence length increases',
            'effects': [
                'Early layers learn very slowly',
                'Long-term dependencies not captured',
                'Training stagnation'
            ],
            'solutions': [
                'LSTM/GRU cells with gating mechanisms',
                'Better weight initialization (Xavier/He)',
                'Layer normalization',
                'Residual connections',
                'Gradient clipping (helps with exploding too)'
            ]
        },

        '2. Exploding Gradients': {
            'description': 'Gradients become exponentially larger through time',
            'mathematical_cause': 'Product of Jacobians > 1 across time steps',
            'formula': '∏(∂h^(i)/∂h^(i-1)) → ∞ causing numerical instability',
            'effects': [
                'Training becomes unstable',
                'Parameters change too rapidly',
                'Loss oscillates or diverges',
                'NaN values in computations'
            ],
            'solutions': [
                'Gradient clipping (norm or value clipping)',
                'Lower learning rates',
                'Better weight initialization',
                'Layer normalization',
                'LSTM/GRU architectures'
            ]
        },

        '3. Long-term Dependencies': {
            'description': 'Difficulty learning patterns across long sequences',
            'mathematical_cause': 'Information decay through recurrent transformations',
            'formula': 'h^(t) = f(Wh^(t-1) + Ux^(t)) - early info gets diluted',
            'effects': [
                'Cannot learn long-range patterns',
                'Poor performance on tasks requiring long memory',
                'Information bottleneck at hidden state'
            ],
            'solutions': [
                'LSTM/GRU with memory cells',
                'Attention mechanisms',
                'Transformer architectures',
                'Residual connections',
                'Memory networks'
            ]
        },

        '4. Computational Complexity': {
            'description': 'High memory and time complexity for long sequences',
            'mathematical_cause': 'Sequential nature prevents parallelization',
            'formula': 'Time: O(T), Memory: O(T×H) for T steps, H hidden units',
            'effects': [
                'Slow training and inference',
                'Memory bottlenecks for long sequences',
                'Difficulty scaling to large datasets'
            ],
            'solutions': [
                'Truncated backpropagation through time',
                '1D CNNs for parallelizable processing',
                'Attention mechanisms (Transformers)',
                'Model parallelism',
                'Mixed precision training'
            ]
        },

        '5. Overfitting': {
            'description': 'RNNs can easily overfit to training sequences',
            'mathematical_cause': 'High model capacity and sequential dependencies',
            'formula': 'Model learns specific patterns rather than generalizable features',
            'effects': [
                'Poor generalization to new sequences',
                'High variance in predictions',
                'Memorization instead of learning'
            ],
            'solutions': [
                'Dropout (regular and recurrent)',
                'Weight regularization (L1/L2)',
                'Early stopping',
                'Data augmentation',
                'Batch normalization/Layer normalization'
            ]
        }
    }

    # Create practical demonstrations

    # 1. Gradient flow analysis
    def analyze_gradient_flow_detailed():
        """
        Detailed analysis of how gradients flow through RNN layers.
        """
        print("\nGRADIENT FLOW ANALYSIS:")
        print("-" * 30)

        # Simulate different scenarios
        sequence_lengths = [10, 50, 100, 200]
        weight_scales = [0.5, 1.0, 1.5, 2.0]

        fig, axes = plt.subplots(2, 2, figsize=(15, 10))

        # Different weight scales
        ax = axes[0, 0]
        for weight_scale in weight_scales:
            gradients = [1.0]  # Initial gradient
            for step in range(50):
                # Simplified gradient computation
                activation_grad = 0.25  # tanh derivative
                new_grad = gradients[-1] * weight_scale * activation_grad
                gradients.append(new_grad)

            ax.semilogy(gradients, label=f'Weight scale = {weight_scale}', linewidth=2)

        ax.set_title('Gradient Magnitude vs Weight Scale')
        ax.set_xlabel('Steps Back in Time')
        ax.set_ylabel('Gradient Magnitude (log scale)')
        ax.legend()
        ax.grid(True, alpha=0.3)

        # Different activation functions
        ax = axes[0, 1]
        activations = {
            'tanh': 0.25,
            'sigmoid': 0.25,
            'ReLU': 1.0,
            'LeakyReLU': 0.95
        }

        weight_scale = 1.0
        for act_name, act_grad in activations.items():
            gradients = [1.0]
            for step in range(50):
                new_grad = gradients[-1] * weight_scale * act_grad
                gradients.append(new_grad)

            ax.semilogy(gradients, label=act_name, linewidth=2)

        ax.set_title('Gradient Flow by Activation Function')
        ax.set_xlabel('Steps Back in Time')
        ax.set_ylabel('Gradient Magnitude (log scale)')
        ax.legend()
        ax.grid(True, alpha=0.3)

        # Sequence length effect
        ax = axes[1, 0]
        for seq_len in sequence_lengths:
            gradients = [1.0]
            weight_scale = 0.9  # Slightly vanishing
            for step in range(seq_len):
                new_grad = gradients[-1] * weight_scale * 0.25
                gradients.append(new_grad)

            ax.semilogy(range(len(gradients)), gradients,
                       label=f'Seq length = {seq_len}', linewidth=2)

        ax.set_title('Vanishing Gradients vs Sequence Length')
        ax.set_xlabel('Steps Back in Time')
        ax.set_ylabel('Gradient Magnitude (log scale)')
        ax.legend()
        ax.grid(True, alpha=0.3)

        # Solutions effectiveness
        ax = axes[1, 1]
        solutions = {
            'Standard RNN': {'weight_scale': 0.9, 'clip': False, 'norm': False},
            'Gradient Clipping': {'weight_scale': 1.5, 'clip': True, 'norm': False},
            'Layer Norm': {'weight_scale': 1.2, 'clip': False, 'norm': True},
            'Both Solutions': {'weight_scale': 1.2, 'clip': True, 'norm': True}
        }

        for sol_name, params in solutions.items():
            gradients = [1.0]
            for step in range(30):
                new_grad = gradients[-1] * params['weight_scale'] * 0.25

                # Apply gradient clipping
                if params['clip'] and abs(new_grad) > 5.0:
                    new_grad = 5.0 * np.sign(new_grad)

                # Apply layer normalization (simplified)
                if params['norm']:
                    new_grad = new_grad * 0.8  # Stabilizing effect

                gradients.append(new_grad)

            ax.plot(gradients, label=sol_name, linewidth=2)

        ax.set_title('Solution Effectiveness')
        ax.set_xlabel('Steps Back in Time')
        ax.set_ylabel('Gradient Magnitude')
        ax.legend()
        ax.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

    # 2. Practical solution implementations
    def implement_solutions():
        """
        Implements and compares different solutions to RNN training problems.
        """
        print("\nPRACTICAL SOLUTION IMPLEMENTATIONS:")
        print("-" * 40)

        # Generate challenging sequence data (long sequences with long-term dependencies)
        def generate_long_term_dependency_data(n_samples=1000, seq_len=100):
            """
            Generates data where the target depends on information from early in the sequence.
            """
            X = np.random.randn(n_samples, seq_len, 1)
            # Target depends on sum of first 10 values and pattern in middle
            y = np.sum(X[:, :10, 0], axis=1) + np.mean(X[:, 40:50, 0], axis=1)
            y = y.reshape(-1, 1)
            return X.astype(np.float32), y.astype(np.float32)

        X_long, y_long = generate_long_term_dependency_data(1000, 100)

        # Split data
        train_size = int(0.7 * len(X_long))
        val_size = int(0.2 * len(X_long))

        X_train_long = X_long[:train_size]
        y_train_long = y_long[:train_size]
        X_val_long = X_long[train_size:train_size + val_size]
        y_val_long = y_long[train_size:train_size + val_size]
        X_test_long = X_long[train_size + val_size:]
        y_test_long = y_long[train_size + val_size:]

        print(f"Long-term dependency data: {X_long.shape}, {y_long.shape}")

        # Define models with different solutions
        models = {}

        # 1. Problematic RNN (baseline)
        models['Problematic RNN'] = keras.Sequential([
            keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.SimpleRNN(32),
            keras.layers.Dense(1)
        ])

        # 2. With gradient clipping
        models['RNN + Grad Clipping'] = keras.Sequential([
            keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.SimpleRNN(32),
            keras.layers.Dense(1)
        ])

        # 3. With dropout
        models['RNN + Dropout'] = keras.Sequential([
            keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1],
                                   dropout=0.2, recurrent_dropout=0.2),
            keras.layers.SimpleRNN(32, dropout=0.2, recurrent_dropout=0.2),
            keras.layers.Dense(1)
        ])

        # 4. LSTM solution
        models['LSTM Solution'] = keras.Sequential([
            keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.LSTM(32),
            keras.layers.Dense(1)
        ])

        # 5. GRU solution
        models['GRU Solution'] = keras.Sequential([
            keras.layers.GRU(32, return_sequences=True, input_shape=[None, 1]),
            keras.layers.GRU(32),
            keras.layers.Dense(1)
        ])

        # Compile with different optimizers
        optimizers = {
            'Problematic RNN': keras.optimizers.Adam(learning_rate=0.001),
            'RNN + Grad Clipping': keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0),
            'RNN + Dropout': keras.optimizers.Adam(learning_rate=0.001),
            'LSTM Solution': keras.optimizers.Adam(learning_rate=0.001),
            'GRU Solution': keras.optimizers.Adam(learning_rate=0.001)
        }

        for name, model in models.items():
            model.compile(optimizer=optimizers[name], loss='mse', metrics=['mae'])

        # Train and evaluate
        results = {}
        training_histories = {}

        epochs = 20

        for name, model in models.items():
            print(f"\nTraining {name}...")

            try:
                history = model.fit(
                    X_train_long, y_train_long,
                    epochs=epochs,
                    batch_size=32,
                    validation_data=(X_val_long, y_val_long),
                    verbose=0
                )

                # Evaluate
                test_loss = model.evaluate(X_test_long, y_test_long, verbose=0)[0]
                final_val_loss = min(history.history['val_loss'])
                convergence_epoch = np.argmin(history.history['val_loss'])

                results[name] = {
                    'test_mse': test_loss,
                    'best_val_mse': final_val_loss,
                    'convergence_epoch': convergence_epoch,
                    'training_stable': True,
                    'parameters': model.count_params()
                }

                training_histories[name] = history

                print(f"  Test MSE: {test_loss:.6f}")
                print(f"  Best Val MSE: {final_val_loss:.6f}")
                print(f"  Converged at epoch: {convergence_epoch}")

            except Exception as e:
                print(f"  Training failed: {str(e)}")
                results[name] = {
                    'test_mse': float('inf'),
                    'best_val_mse': float('inf'),
                    'convergence_epoch': -1,
                    'training_stable': False,
                    'error': str(e)
                }

        return results, training_histories

    # Run analyses
    analyze_gradient_flow_detailed()
    solution_results, solution_histories = implement_solutions()

    # Comprehensive summary
    print("\nCOMPREHENSIVE SUMMARY:")
    print("=" * 50)

    # Create summary visualization
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # 1. Challenge overview
    ax = axes[0, 0]
    challenge_names = list(challenges.keys())
    severity_scores = [5, 4, 5, 3, 3]  # Relative severity (subjective)

    bars = ax.barh(range(len(challenge_names)), severity_scores,
                   color=['red', 'orange', 'darkred', 'blue', 'green'], alpha=0.7)
    ax.set_yticks(range(len(challenge_names)))
    ax.set_yticklabels([name.split('.')[1].strip() for name in challenge_names])
    ax.set_xlabel('Relative Difficulty/Impact')
    ax.set_title('RNN Training Challenges by Severity')
    ax.grid(True, alpha=0.3)

    # 2. Solution effectiveness
    ax = axes[0, 1]
    if solution_results:
        stable_models = {name: res for name, res in solution_results.items()
                        if res['training_stable']}

        if stable_models:
            model_names = list(stable_models.keys())
            test_mses = [stable_models[name]['test_mse'] for name in model_names]

            bars = ax.bar(range(len(model_names)), test_mses, alpha=0.7,
                         color=['red', 'orange', 'yellow', 'lightgreen', 'green'])
            ax.set_xticks(range(len(model_names)))
            ax.set_xticklabels(model_names, rotation=45, ha='right')
            ax.set_ylabel('Test MSE')
            ax.set_title('Solution Effectiveness')

            # Add value labels
            for bar, value in zip(bars, test_mses):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
                        f'{value:.3f}', ha='center', va='bottom', fontsize=9)

    # 3. Training curves comparison
    ax = axes[1, 0]
    if solution_histories:
        for name, history in solution_histories.items():
            if 'val_loss' in history.history:
                ax.plot(history.history['val_loss'], label=name, linewidth=2)

        ax.set_title('Training Curves Comparison')
        ax.set_xlabel('Epoch')
        ax.set_ylabel('Validation Loss')
        ax.set_yscale('log')
        ax.legend()
        ax.grid(True, alpha=0.3)

    # 4. Solution summary table
    ax = axes[1, 1]
    ax.axis('off')

    # Create summary of solutions
    solution_summary = [
        ['Problem', 'Primary Solution', 'Secondary Solutions'],
        ['Vanishing Gradients', 'LSTM/GRU', 'Layer Norm, Residual Conn.'],
        ['Exploding Gradients', 'Gradient Clipping', 'Lower LR, Better Init'],
        ['Long-term Deps', 'LSTM/GRU', 'Attention, Memory Networks'],
        ['Complexity', '1D CNNs', 'Truncated BPTT, Parallelism'],
        ['Overfitting', 'Dropout', 'Regularization, Early Stop']
    ]

    table = ax.table(
        cellText=solution_summary[1:],
        colLabels=solution_summary[0],
        cellLoc='center',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1.2, 2)

    # Style the header
    for i in range(len(solution_summary[0])):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')

    ax.set_title('Solution Summary', fontsize=14, fontweight='bold', pad=20)

    plt.tight_layout()
    plt.show()

    # Print detailed explanation for each challenge
    for challenge_name, challenge_info in challenges.items():
        print(f"\n{challenge_name.upper()}")
        print("=" * len(challenge_name))
        print(f"Description: {challenge_info['description']}")
        print(f"Mathematical Cause: {challenge_info['mathematical_cause']}")
        print(f"Formula: {challenge_info['formula']}")
        print("\nEffects:")
        for effect in challenge_info['effects']:
            print(f"  • {effect}")
        print("\nSolutions:")
        for solution in challenge_info['solutions']:
            print(f"  • {solution}")

    return challenges, solution_results

# Run Exercise 5
exercise_5_results = demonstrate_rnn_training_challenges()

### Exercise 6: LSTM Cell Architecture

**Question**: Can you sketch the LSTM cell's architecture?

#### Theoretical Foundation

The LSTM (Long Short-Term Memory) cell is designed to solve the vanishing gradient problem through gating mechanisms.

#### Mathematical Components

**Gate Equations:**
$$\mathbf{f}^{(t)} = \sigma(\mathbf{W}_f \cdot [\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}] + \mathbf{b}_f)$$ (Forget Gate)
$$\mathbf{i}^{(t)} = \sigma(\mathbf{W}_i \cdot [\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}] + \mathbf{b}_i)$$ (Input Gate)
$$\mathbf{o}^{(t)} = \sigma(\mathbf{W}_o \cdot [\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}] + \mathbf{b}_o)$$ (Output Gate)

**Candidate Values:**
$$\tilde{\mathbf{C}}^{(t)} = \tanh(\mathbf{W}_C \cdot [\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}] + \mathbf{b}_C)$$

**State Updates:**
$$\mathbf{C}^{(t)} = \mathbf{f}^{(t)} * \mathbf{C}^{(t-1)} + \mathbf{i}^{(t)} * \tilde{\mathbf{C}}^{(t)}$$
$$\mathbf{h}^{(t)} = \mathbf{o}^{(t)} * \tanh(\mathbf{C}^{(t)})$$

Where $*$ denotes element-wise multiplication.

In [None]:
# Exercise 6: LSTM Cell Architecture Sketch and Implementation

def sketch_lstm_architecture():
    """
    Creates detailed visual sketch and explanation of LSTM architecture.
    """
    print("EXERCISE 6: LSTM CELL ARCHITECTURE")
    print("=" * 50)

    # Create comprehensive LSTM visualization
    fig, axes = plt.subplots(2, 2, figsize=(18, 12))

    # 1. Complete LSTM Cell Architecture
    ax = axes[0, 0]
    ax.set_title('LSTM Cell Architecture', fontsize=16, fontweight='bold')

    # Cell state line (top)
    ax.arrow(0.05, 0.8, 0.9, 0, head_width=0.02, head_length=0.02,
             fc='blue', ec='blue', linewidth=4)
    ax.text(0.5, 0.85, 'Cell State C(t-1) → C(t)', ha='center', fontsize=12,
           color='blue', fontweight='bold')

    # Gates and operations
    gates = [
        (0.15, 0.5, 'σ', 'Forget\nGate', 'lightcoral'),
        (0.35, 0.5, 'σ', 'Input\nGate', 'lightgreen'),
        (0.55, 0.5, 'tanh', 'Candidate\nValues', 'lightyellow'),
        (0.85, 0.5, 'σ', 'Output\nGate', 'lightblue')
    ]

    for x, y, symbol, label, color in gates:
        # Gate box
        rect = plt.Rectangle((x-0.04, y-0.08), 0.08, 0.16,
                           facecolor=color, edgecolor='black', linewidth=2)
        ax.add_patch(rect)
        ax.text(x, y+0.03, symbol, ha='center', va='center', fontsize=14, fontweight='bold')
        ax.text(x, y-0.05, label, ha='center', va='center', fontsize=8, fontweight='bold')

    # Multiplication and addition operations
    ax.text(0.25, 0.65, '×', ha='center', fontsize=24, fontweight='bold')  # Forget operation
    ax.text(0.45, 0.65, '×', ha='center', fontsize=24, fontweight='bold')  # Input operation
    ax.text(0.55, 0.8, '+', ha='center', fontsize=24, fontweight='bold', color='blue')  # Add to cell state
    ax.text(0.85, 0.65, '×', ha='center', fontsize=24, fontweight='bold')  # Output operation

    # Input and previous hidden state
    ax.arrow(0.15, 0.2, 0, 0.22, head_width=0.02, head_length=0.02, fc='black')
    ax.arrow(0.35, 0.2, 0, 0.22, head_width=0.02, head_length=0.02, fc='black')
    ax.arrow(0.55, 0.2, 0, 0.22, head_width=0.02, head_length=0.02, fc='black')
    ax.arrow(0.85, 0.2, 0, 0.22, head_width=0.02, head_length=0.02, fc='black')

    ax.text(0.45, 0.1, 'x(t), h(t-1)', ha='center', fontsize=12, fontweight='bold')

    # Output
    ax.arrow(0.85, 0.73, 0, 0.1, head_width=0.02, head_length=0.02, fc='red', linewidth=3)
    ax.text(0.9, 0.9, 'h(t)', ha='center', fontsize=12, color='red', fontweight='bold')

    # tanh for output
    ax.text(0.75, 0.65, 'tanh', ha='center', fontsize=10,
           bbox=dict(boxstyle="round,pad=0.2", facecolor='yellow', alpha=0.7))

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 2. Step-by-step computation flow
    ax = axes[0, 1]
    ax.set_title('LSTM Computation Steps', fontsize=14, fontweight='bold')

    steps = [
        '1. Forget Gate: f(t) = σ(Wf·[h(t-1), x(t)] + bf)',
        '2. Input Gate: i(t) = σ(Wi·[h(t-1), x(t)] + bi)',
        '3. Candidate: C̃(t) = tanh(WC·[h(t-1), x(t)] + bC)',
        '4. Update Cell: C(t) = f(t)⊙C(t-1) + i(t)⊙C̃(t)',
        '5. Output Gate: o(t) = σ(Wo·[h(t-1), x(t)] + bo)',
        '6. Hidden State: h(t) = o(t)⊙tanh(C(t))'
    ]

    colors = ['lightcoral', 'lightgreen', 'lightyellow', 'lightblue', 'lightcyan', 'lightpink']

    for i, (step, color) in enumerate(zip(steps, colors)):
        y_pos = 0.9 - i * 0.13
        rect = plt.Rectangle((0.05, y_pos-0.05), 0.9, 0.1,
                           facecolor=color, edgecolor='black', alpha=0.7)
        ax.add_patch(rect)
        ax.text(0.5, y_pos, step, ha='center', va='center', fontsize=10,
               fontweight='bold', fontfamily='monospace')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 3. Information flow diagram
    ax = axes[1, 0]
    ax.set_title('Information Flow in LSTM', fontsize=14, fontweight='bold')

    # Draw information pathways
    # Long-term memory (cell state)
    ax.arrow(0.1, 0.8, 0.8, 0, head_width=0.03, head_length=0.03,
             fc='blue', ec='blue', linewidth=6, alpha=0.7)
    ax.text(0.5, 0.85, 'Long-term Memory (Cell State)', ha='center',
           fontsize=12, color='blue', fontweight='bold')

    # Short-term memory (hidden state)
    ax.arrow(0.1, 0.5, 0.8, 0, head_width=0.03, head_length=0.03,
             fc='red', ec='red', linewidth=4, alpha=0.7)
    ax.text(0.5, 0.55, 'Short-term Memory (Hidden State)', ha='center',
           fontsize=12, color='red', fontweight='bold')

    # Input flow
    ax.arrow(0.5, 0.1, 0, 0.3, head_width=0.03, head_length=0.03,
             fc='green', ec='green', linewidth=3, alpha=0.7)
    ax.text(0.55, 0.25, 'Input x(t)', ha='left', fontsize=12,
           color='green', fontweight='bold')

    # Gate controls
    gate_positions = [0.2, 0.4, 0.6, 0.8]
    gate_names = ['Forget', 'Input', 'Candidate', 'Output']
    gate_colors = ['coral', 'lightgreen', 'yellow', 'lightblue']

    for pos, name, color in zip(gate_positions, gate_names, gate_colors):
        circle = plt.Circle((pos, 0.65), 0.03, color=color, alpha=0.8)
        ax.add_patch(circle)
        ax.text(pos, 0.7, name, ha='center', fontsize=8, fontweight='bold')

    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')

    # 4. Comparison with Simple RNN
    ax = axes[1, 1]
    ax.set_title('LSTM vs Simple RNN', fontsize=14, fontweight='bold')

    # Create comparison table
    comparison_data = [
        ['Aspect', 'Simple RNN', 'LSTM'],
        ['Memory', 'Only hidden state', 'Cell state + hidden state'],
        ['Gates', 'None', 'Forget, Input, Output'],
        ['Gradient Flow', 'Vanishing problem', 'Controlled by gates'],
        ['Long-term Deps', 'Poor', 'Excellent'],
        ['Parameters', 'Fewer', 'More (4x weight matrices)'],
        ['Computation', 'Faster', 'Slower but more capable']
    ]

    table = ax.table(
        cellText=comparison_data[1:],
        colLabels=comparison_data[0],
        cellLoc='center',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1.2, 2.5)

    # Style the table
    for i in range(len(comparison_data[0])):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')

    # Color code the comparison
    for i in range(1, len(comparison_data)):
        table[(i, 1)].set_facecolor('#ffcccc')  # Simple RNN - light red
        table[(i, 2)].set_facecolor('#ccffcc')  # LSTM - light green

    ax.axis('off')

    plt.tight_layout()
    plt.show()

    # Detailed mathematical explanation
    print("\nDETAILED MATHEMATICAL EXPLANATION:")
    print("=" * 50)

    print("\n1. GATE COMPUTATIONS:")
    print("   Forget Gate:    f(t) = σ(Wf·[h(t-1), x(t)] + bf)")
    print("   Input Gate:     i(t) = σ(Wi·[h(t-1), x(t)] + bi)")
    print("   Output Gate:    o(t) = σ(Wo·[h(t-1), x(t)] + bo)")
    print("   Candidate:      C̃(t) = tanh(WC·[h(t-1), x(t)] + bC)")

    print("\n2. STATE UPDATES:")
    print("   Cell State:     C(t) = f(t) ⊙ C(t-1) + i(t) ⊙ C̃(t)")
    print("   Hidden State:   h(t) = o(t) ⊙ tanh(C(t))")

    print("\n3. KEY INNOVATIONS:")
    print("   • Separate cell state preserves long-term information")
    print("   • Forget gate selectively removes irrelevant information")
    print("   • Input gate controls what new information to store")
    print("   • Output gate controls what parts of cell state to reveal")
    print("   • Gradient flows through cell state without modification")

    print("\n4. INTUITIVE UNDERSTANDING:")
    print("   • Cell state = long-term memory (conveyor belt)")
    print("   • Hidden state = working memory (what you're thinking about)")
    print("   • Gates = smart controllers deciding information flow")
    print("   • Solves vanishing gradients through controlled gradient flow")

def implement_custom_lstm():
    """
    Implements a custom LSTM cell to demonstrate the architecture.
    """
    print("\nCUSTOM LSTM IMPLEMENTATION:")
    print("=" * 40)

    class CustomLSTMCell(keras.layers.Layer):
        """
        Custom LSTM cell implementation for educational purposes.
        """
        def __init__(self, units, **kwargs):
            super().__init__(**kwargs)
            self.units = units
            self.state_size = [units, units]  # [hidden_state, cell_state]
            self.output_size = units

        def build(self, input_shape):
            input_dim = input_shape[-1]

            # Weight matrices for gates
            self.W_f = self.add_weight(shape=(input_dim + self.units, self.units),
                                      initializer='glorot_uniform', name='W_f')
            self.W_i = self.add_weight(shape=(input_dim + self.units, self.units),
                                      initializer='glorot_uniform', name='W_i')
            self.W_o = self.add_weight(shape=(input_dim + self.units, self.units),
                                      initializer='glorot_uniform', name='W_o')
            self.W_c = self.add_weight(shape=(input_dim + self.units, self.units),
                                      initializer='glorot_uniform', name='W_c')

            # Bias vectors
            self.b_f = self.add_weight(shape=(self.units,), initializer='zeros', name='b_f')
            self.b_i = self.add_weight(shape=(self.units,), initializer='zeros', name='b_i')
            self.b_o = self.add_weight(shape=(self.units,), initializer='zeros', name='b_o')
            self.b_c = self.add_weight(shape=(self.units,), initializer='zeros', name='b_c')

            super().build(input_shape)

        def call(self, inputs, states):
            h_prev, c_prev = states

            # Concatenate input and previous hidden state
            combined = tf.concat([inputs, h_prev], axis=1)

            # Compute gates
            f_gate = tf.sigmoid(tf.matmul(combined, self.W_f) + self.b_f)
            i_gate = tf.sigmoid(tf.matmul(combined, self.W_i) + self.b_i)
            o_gate = tf.sigmoid(tf.matmul(combined, self.W_o) + self.b_o)
            c_candidate = tf.tanh(tf.matmul(combined, self.W_c) + self.b_c)

            # Update cell state
            c_new = f_gate * c_prev + i_gate * c_candidate

            # Compute new hidden state
            h_new = o_gate * tf.tanh(c_new)

            return h_new, [h_new, c_new]

    # Test the custom LSTM
    print("Testing custom LSTM implementation...")

    # Create test data
    batch_size = 4
    seq_length = 10
    input_size = 5
    hidden_size = 8

    test_input = tf.random.normal((batch_size, seq_length, input_size))

    # Build model with custom LSTM
    model = keras.Sequential([
        keras.layers.RNN(CustomLSTMCell(hidden_size), return_sequences=True),
        keras.layers.Dense(1)
    ])

    # Test forward pass
    output = model(test_input)
    print(f"Custom LSTM output shape: {output.shape}")
    print(f"Expected shape: ({batch_size}, {seq_length}, 1)")

    # Compare with built-in LSTM
    builtin_model = keras.Sequential([
        keras.layers.LSTM(hidden_size, return_sequences=True),
        keras.layers.Dense(1)
    ])

    builtin_output = builtin_model(test_input)
    print(f"Built-in LSTM output shape: {builtin_output.shape}")

    print("\nCustom LSTM implementation successful!")
    print(f"Custom LSTM parameters: {model.count_params():,}")
    print(f"Built-in LSTM parameters: {builtin_model.count_params():,}")

    return model, builtin_model

# Run Exercise 6
sketch_lstm_architecture()
custom_lstm, builtin_lstm = implement_custom_lstm()

### Exercise 7: 1D Convolutional Layers in RNNs

**Question**: Why would you want to use 1D convolutional layers in an RNN?

#### Theoretical Justification

Combining 1D CNNs with RNNs provides several advantages:

**1. Computational Efficiency**
- **Parallelization**: CNNs process all time steps simultaneously
- **Speed**: Reduces sequence length before RNN processing
- **Memory**: Lower memory requirements for long sequences

**2. Feature Extraction**
- **Local Patterns**: CNNs excel at detecting local temporal patterns
- **Translation Invariance**: Same pattern detected regardless of position
- **Hierarchical Features**: Multiple layers build complex representations

**3. Preprocessing Benefits**
- **Noise Reduction**: Convolutional filters can smooth noisy sequences
- **Downsampling**: Reduces sequence length while preserving information
- **Feature Engineering**: Automatically learns relevant features

**Mathematical Framework:**
For a 1D convolution followed by RNN:
$$\mathbf{z}^{(t)} = \text{Conv1D}(\mathbf{x}^{(t-k+1:t)})$$
$$\mathbf{h}^{(t)} = \text{RNN}(\mathbf{z}^{(t)}, \mathbf{h}^{(t-1)})$$

Where the CNN extracts local features $\mathbf{z}^{(t)}$ that the RNN then processes.