In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from IPython.display import display, Markdown, Image

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

note("Environment initialized for LSTMs and GRUs.")

# Chapter 7.9: LSTMs and GRUs: Architectures for Long-Term Memory

---

### Table of Contents

1.  [**Introduction: Solving the Vanishing Gradient Problem**](#intro)
2.  [**Long Short-Term Memory (LSTM)**](#lstm)
    - [Intuition: The Cell State as a Conveyor Belt](#lstm-intuition)
    - [The Gating Mechanism: A Formal View](#lstm-math)
3.  [**Gated Recurrent Unit (GRU)**](#gru)
    - [A Simpler Gating Architecture](#gru-math)
4.  [**Practical Considerations for Training**](#practical)
5.  [**Applications**](#applications)
    - [Application 1: Sentiment Analysis of Financial News](#app-nlp)
    - [Application 2: Macroeconomic Forecasting](#app-macro)
6.  [**A From-Scratch LSTM Cell**](#scratch)
7.  [**Exercises**](#exercises)
8.  [**Summary and Key Takeaways**](#summary)

<a id='intro'></a>
## 1. Introduction: Solving the Vanishing Gradient Problem

Simple Recurrent Neural Networks (RNNs), while elegant, suffer from the **vanishing gradient problem**, which makes it nearly impossible for them to learn long-range dependencies in a sequence. The repeated multiplication by the recurrent weight matrix during backpropagation causes gradients from distant past time steps to shrink to zero, effectively giving the network a very short memory.

This chapter explores the two dominant architectures designed to solve this problem: **Long Short-Term Memory (LSTM)** and **Gated Recurrent Units (GRU)**. These models introduce **gating mechanisms**—neural networks within the recurrent cell that learn to control the flow of information. By learning when to remember, when to forget, and when to output information, these gated architectures can maintain a memory over much longer time horizons, revolutionizing sequence modeling in fields from natural language processing to financial econometrics.

<a id='lstm'></a>
## 2. Long Short-Term Memory (LSTM)

<a id='lstm-intuition'></a>
### Intuition: The Cell State as a Conveyor Belt

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem by introducing an explicit **cell state ($c_t$)**, which acts as a "conveyor belt" of information. The LSTM can read from, write to, and erase information from this cell state. The core innovation is that information flow along the cell state is primarily **additive**, which allows gradients to flow much more easily through time without being repeatedly diminished by matrix multiplication. This flow is controlled by three carefully designed **gates**.

![LSTM Diagram](../images/png/lstm_diagram.png)
*<center><b>Figure 1: The architecture of an LSTM cell.</b> The cell state ($C_t$) runs along the top, with gates controlling the information flow. (Source: Christopher Olah's Blog)</center>*

<a id='lstm-math'></a>
### The Gating Mechanism: A Formal View
Each gate is a sigmoid activation function, $\sigma(\cdot)$, which outputs a value between 0 (let nothing through) and 1 (let everything through), applied to the previous hidden state $\mathbf{h}_{t-1}$ and the current input $\mathbf{x}_t$.

1.  **Forget Gate ($f_t$):** Decides what fraction of the *previous* cell state, $c_{t-1}$, to forget. A value of 1 means "keep everything," and 0 means "forget everything."
    $$ f_t = \sigma(W_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_f) $$ 

2.  **Input Gate ($i_t$):** Decides what new information to store in the cell state. This has two parts: the sigmoid layer decides *which values* to update, and a `tanh` layer creates a vector of new candidate values, $\tilde{c}_t$.
    $$ i_t = \sigma(W_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_i) $$ 
    $$ \tilde{c}_t = \tanh(W_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_c) $$ 

3.  **Cell State Update:** The old cell state is updated to the new cell state. The old state is element-wise multiplied by the forget gate, and then we add the new candidate values, scaled by the input gate.
    $$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t $$ 
    This additive interaction is the key. It creates a direct path for gradients to flow through time, largely unimpeded.

4.  **Output Gate ($o_t$):** Decides what to output as the new hidden state, $h_t$. The output is a filtered version of the cell state, passed through a `tanh` to scale it between -1 and 1.
    $$ o_t = \sigma(W_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_o) $$ 
    $$ h_t = o_t \odot \tanh(c_t) $$

<a id='gru'></a>
## 3. Gated Recurrent Unit (GRU)

The Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014, is a popular and effective simplification of the LSTM. It merges the cell state and hidden state into a single state vector $\mathbf{h}_t$ and uses only two gates, making it computationally more efficient.

![GRU Diagram](../images/png/gru_diagram_1.png)
*<center><b>Figure 2: The architecture of a GRU cell.</b> It uses a reset gate and an update gate to control the information flow.</center>*

<a id='gru-math'></a>
### A Simpler Gating Architecture

The GRU's two gates are:

1.  **Reset Gate ($r_t$):** This gate determines how much of the past hidden state, $h_{t-1}$, to forget when proposing a new candidate state, $\tilde{h}_t$.
    $$ r_t = \sigma(W_r [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_r) $$
    $$ \tilde{h}_t = \tanh(W_h [r_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + b_h) $$

2.  **Update Gate ($z_t$):** This gate acts like a combination of the LSTM's forget and input gates. It decides how much of the previous hidden state $h_{t-1}$ to keep, and how much of the new candidate state $\tilde{h}_t$ to add.
    $$ z_t = \sigma(W_z [\mathbf{h}_{t-1}, \mathbf{x}_t] + b_z) $$
    $$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $$

While LSTMs are more expressive, GRUs have fewer parameters, train faster, and often perform just as well on many tasks, especially with smaller datasets.

<a id='practical'></a>
## 4. Practical Considerations for Training
Effectively training deep recurrent models often requires additional techniques beyond standard optimizers.

**Layer Normalization:** Standard batch normalization is problematic in RNNs because the statistics (mean/variance) change at each time step. **Layer Normalization** normalizes the inputs across the features dimension for each time step independently. This stabilizes the hidden-to-hidden dynamics and can significantly speed up training.

**Recurrent Dropout:** Applying standard dropout to the recurrent connections can harm the network's ability to retain long-term memory. A more effective technique is to apply the *same* dropout mask at every time step for the recurrent connections. This effectively drops out a consistent set of connections for the entire sequence, preventing information loss while still providing regularization.

<a id='applications'></a>\n## 5. Applications\n\n<a id='app-nlp'></a>\n### Application 1: Sentiment Analysis of Financial News

In [None]:
sec("Sentiment Analysis of Financial News with LSTMs")

# 1. Load and preprocess data
try:
    df = pd.read_csv('../data/SEntFiN.csv')
    df['sentiment'] = df['Decisions'].apply(lambda x: x.split('@')[0].strip().lower())
    df = df[['Title', 'sentiment']].rename(columns={'Title': 'headline'})
    df = df[df['sentiment'].isin(['positive', 'negative', 'neutral'])]
    note(f"Loaded {len(df)} headlines.")

    # 2. Prepare data for Keras
    sentences = df['headline'].values
    labels = df['sentiment'].values

    X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.2, random_state=42)

    le = LabelEncoder()
    y_train_enc = le.fit_transform(y_train)
    y_test_enc = le.transform(y_test)

    tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
    tokenizer.fit_on_texts(X_train)
    X_train_seq = tokenizer.texts_to_sequences(X_train)
    X_test_seq = tokenizer.texts_to_sequences(X_test)
    X_train_pad = pad_sequences(X_train_seq, maxlen=50, padding='post', truncating='post')
    X_test_pad = pad_sequences(X_test_seq, maxlen=50, padding='post', truncating='post')

    # 3. Build and train the LSTM model
    note("Building and training a robust sentiment classification model with LSTMs.")
    embedding_dim = 64
    vocab_size = len(tokenizer.word_index) + 1

    nlp_model = keras.models.Sequential([
        keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=50),
        keras.layers.Bidirectional(keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
        keras.layers.Dense(3, activation='softmax')
    ])
    nlp_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    history_nlp = nlp_model.fit(X_train_pad, y_train_enc, epochs=5, 
                                validation_data=(X_test_pad, y_test_enc), verbose=1)

except FileNotFoundError:
    note("SEntFiN.csv not found in 'data/' directory. Skipping NLP example.")

<a id='app-macro'></a>\n### Application 2: Macroeconomic Forecasting\nLSTMs are also powerful tools for multivariate time series forecasting, a common task in macroeconomics. Here, we'll build a model to forecast US GDP growth using other key indicators like inflation and unemployment.

In [None]:
sec("Macroeconomic Forecasting with LSTMs")

# This example requires the 'pandas_datareader' library
try:
    import pandas_datareader.data as web
except ImportError:
    print("Installing pandas_datareader...")
    !pip install pandas_datareader
    import pandas_datareader.data as web
note("Downloading quarterly macroeconomic data from FRED...")
    
    # Fetch data from FRED
    start_date = '1960-01-01'
    end_date = '2022-12-31'
    gdp = web.DataReader('GDPC1', 'fred', start_date, end_date)
    cpi = web.DataReader('CPIAUCSL', 'fred', start_date, end_date)
    unrate = web.DataReader('UNRATE', 'fred', start_date, end_date)
    
    # Calculate growth rates and combine data
    gdp_growth = gdp.pct_change(4).dropna() * 100 # YoY growth
    inflation = cpi.pct_change(12).dropna() * 100 # YoY inflation
    unrate_q = unrate.resample('QS').mean() # Convert to quarterly
    
    macro_data = pd.concat([gdp_growth, inflation, unrate_q], axis=1, join='inner').dropna()
    macro_data.columns = ['GDP_Growth', 'Inflation', 'Unemployment']
    
    # --- Prepare data for LSTM ---
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(macro_data)
    
    def create_sequences(data, n_past, n_future):
        X, y = [], []
        for i in range(n_past, len(data) - n_future + 1):
            X.append(data[i - n_past:i, 0:data.shape[1]])
            y.append(data[i:i + n_future, 0]) # Predict GDP growth
        return np.array(X), np.array(y)
        
    n_past = 8 # Use past 8 quarters
    n_future = 4 # Predict next 4 quarters
    X, y = create_sequences(data_scaled, n_past, n_future)
    
    # Train/test split
    split = int(len(X) * 0.8)
    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]
    
    # --- Build and Train LSTM Model ---
    note("Building and training a multi-step macro forecasting model.")
    macro_model = keras.models.Sequential([
        keras.layers.LSTM(50, activation='relu', input_shape=(n_past, X.shape[2]), return_sequences=True),
        keras.layers.LSTM(50, activation='relu'),
        keras.layers.Dense(n_future)
    ])
    macro_model.compile(optimizer='adam', loss='mse')
    history_macro = macro_model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), verbose=0)
    
    note(f"Training complete. Final validation loss (MSE): {history_macro.history['val_loss'][-1]:.4f}")

except ImportError:
    note("Skipping macro example: 'pandas_datareader' is not installed. You can install it with 'pip install pandas_datareader'.")
except Exception as e:
    note(f"Could not download or process FRED data. Skipping example. Error: {e}")

<a id='scratch'></a>\n## 6. A From-Scratch LSTM Cell

To fully demystify the LSTM, we can build a custom Keras layer that implements the gating logic from scratch. This makes the complex internal dynamics transparent and shows how the cell state and hidden state are updated at each step.

In [None]:
sec("Custom LSTM Cell Implementation")

class MyLSTMCell(keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.state_size = (units, units)

    def build(self, input_shape):
        self.kernel = self.add_weight(
            shape=(input_shape[-1], self.units * 4), initializer="glorot_uniform", name="kernel")
        self.recurrent_kernel = self.add_weight(
            shape=(self.units, self.units * 4), initializer="orthogonal", name="recurrent_kernel")
        self.bias = self.add_weight(
            shape=(self.units * 4,), initializer="zeros", name="bias")
        self.built = True

    def call(self, inputs, states):
        h_prev, c_prev = states
        
        z = tf.matmul(inputs, self.kernel) + tf.matmul(h_prev, self.recurrent_kernel) + self.bias
        
        i, g, f, o = tf.split(z, num_or_size_splits=4, axis=1)
        
        input_gate = tf.sigmoid(i)
        forget_gate = tf.sigmoid(f)
        output_gate = tf.sigmoid(o)
        candidate_cell = tf.tanh(g)
        
        c_new = forget_gate * c_prev + input_gate * candidate_cell
        h_new = output_gate * tf.tanh(c_new)
        
        return h_new, [h_new, c_new]

note("Building a model with our custom LSTM layer shows the same structure as the built-in Keras layer.")
vocab_size = 10000 # Placeholder from previous example
custom_lstm_model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, 128, input_shape=[None]),
    keras.layers.RNN(MyLSTMCell(64), return_sequences=True),
    keras.layers.Dense(1, activation="sigmoid")
])
custom_lstm_model.summary()

<a id='exercises'></a>\n## 7. Exercises\n\n1.  **GRU vs. LSTM:** What is the key difference in how the GRU and LSTM control the flow of information from the previous state to the next? Which one is more computationally expensive and why?\n2.  **The Forget Gate:** What would happen in an LSTM if the forget gate was permanently stuck at 1? What if it was permanently stuck at 0?\n3.  **Recurrent Dropout:** Explain why standard dropout is not typically applied to the recurrent connections in an LSTM or GRU. What problem does recurrent dropout (applying the same mask at each time step) solve?\n4.  **Application to Macroeconomics:** How might you use an LSTM to forecast GDP growth using a panel of leading economic indicators? What would be the inputs and outputs of such a model?

<a id='summary'></a>\n## 8. Summary and Key Takeaways\n\nThis chapter introduced LSTMs and GRUs, the workhorse architectures for modern sequence modeling, which were designed specifically to overcome the short-term memory limitations of simple RNNs.\n\n**Key Concepts**:\n- **Gating Mechanisms**: The core innovation of LSTMs and GRUs is the use of gates—sigmoid-activated neural networks—that learn to regulate the flow of information. They control when to forget old information and when to incorporate new information.\n- **LSTM Architecture**: The LSTM uses three gates (forget, input, output) and a separate **cell state** ($c_t$). The cell state acts as a conveyor belt, allowing information to flow through time via simple additive operations, which mitigates the vanishing gradient problem.\n- **GRU Architecture**: The GRU is a simplified alternative that merges the hidden state and cell state and uses only two gates (reset and update). It is computationally cheaper and often performs just as well.\n- **Long-Range Dependencies**: By learning to control their memory, LSTMs and GRUs can capture dependencies over much longer time horizons than simple RNNs, making them state-of-the-art for many NLP and time-series tasks.\n- **Practical Techniques**: Training deep recurrent models benefits from techniques like **Layer Normalization** (to stabilize hidden dynamics) and **Recurrent Dropout** (to regularize without destroying the memory).

### Solutions to Exercises\n\n---\n\n**1. GRU vs. LSTM:**\nThe key difference is that the LSTM has a separate cell state ($c_t$) that acts as the primary memory channel, and its flow is controlled by distinct forget and input gates. The GRU combines the cell state and hidden state into a single vector ($h_t$) and uses a single update gate ($z_t$) to control both forgetting the past and incorporating the new candidate state. The LSTM is more computationally expensive because it has three gates and two state vectors to compute at each step, whereas the GRU has only two gates and one state vector, resulting in fewer parameters and matrix multiplications.\n\n---\n\n**2. The Forget Gate:**\n- **Stuck at 1:** If $f_t=1$ always, the LSTM would never forget anything. The cell state $c_t$ would become an ever-growing sum of all past inputs. This would make it unable to adapt to new information or forget irrelevant past context, and the cell state values could grow uncontrollably.\n- **Stuck at 0:** If $f_t=0$ always, the LSTM would have no memory of the past. The cell state update would become $c_t = i_t \odot \tilde{c}_t$, meaning it would only depend on the current input. The network would degenerate into a stateless feedforward network, unable to learn any temporal dependencies.\n\n---\n\n**3. Recurrent Dropout:**\nStandard dropout randomly sets different neurons to zero at each time step. If applied to the recurrent (hidden-to-hidden) connection, the network would be trying to learn a sequence with a randomly changing memory structure at every step, making it nearly impossible to retain information over time. Recurrent dropout fixes this by applying the *same* dropout mask (i.e., dropping the same set of recurrent connections) for the entire sequence. This allows a stable, albeit reduced, memory to persist, providing regularization without destroying the network's ability to learn long-range dependencies.\n\n---\n\n**4. Application to Macroeconomics:**\nYou would structure this as a sequence-to-sequence or sequence-to-vector forecasting problem. \n- **Inputs:** The input at each time step $t$ would be a vector containing the values of all your leading indicators at that time (e.g., `[unemployment_rate_t, inflation_rate_t, interest_rate_t, ...]`). The full input would be a sequence of these vectors over a lookback window (e.g., the last 24 months).\n- **Outputs:** The output could be the GDP growth value for the next quarter ($t+1$). Alternatively, in a sequence-to-sequence model, you could predict the GDP growth for the next several quarters ($t+1, ..., t+4$). The LSTM would learn the complex, dynamic relationships between the indicators and future GDP growth.