# Day 2: Understanding LSTM Networks üß†

Welcome to Day 2 of 30 Papers in 30 Days!

Today we're diving into **Long Short-Term Memory (LSTM) networks** - one of the most important breakthroughs in deep learning. LSTMs solved a critical problem that plagued early neural networks: **vanishing gradients**.

## What You'll Learn

1. **The Problem**: Why vanilla RNNs struggle with long sequences
2. **The Solution**: How LSTMs use gates to control information flow
3. **The Architecture**: Understanding the 4 key components (forget, input, cell, output)
4. **The Implementation**: Building LSTMs from scratch in NumPy
5. **The Visualization**: Seeing what LSTMs "remember" and "forget"

## The Big Idea (in 30 seconds)

Imagine you're managing a **to-do list** throughout your day:
- **Forget gate**: Cross off completed tasks ‚úñÔ∏è
- **Input gate**: Add new tasks ‚ûï
- **Cell state**: The actual list (your memory) üìù
- **Output gate**: What you're focusing on right now üëÅÔ∏è

LSTMs do the same thing with information - they decide what to remember, what to forget, and what to output at each step!

Let's get started! üöÄ

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

# Add parent directory to path for imports
sys.path.insert(0, os.path.dirname(os.path.abspath('__file__')))

# Import our LSTM implementation
from implementation import LSTM
from visualization import (
    plot_gate_activations, 
    plot_cell_state_evolution,
    plot_gradient_flow_comparison,
    analyze_gate_patterns
)

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ All imports successful!")
print(f"NumPy version: {np.__version__}")

## 1. The Vanishing Gradient Problem üìâ

Before LSTMs, we had **vanilla RNNs**. They worked great for short sequences but failed miserably for long ones. Why?

### The Bucket Brigade Analogy ü™£

Imagine passing water down a line of people (a "bucket brigade"):
- Person 1 fills a bucket and passes it
- Person 2 receives it (but spills 10%) and passes it
- Person 3 receives it (spills another 10%) and passes it
- ...and so on

By the time the bucket reaches Person 50, **almost all the water is gone**! 

This is the **vanishing gradient problem**: as information flows backward through time during training, the gradient gets multiplied by values < 1 at each step, eventually vanishing to zero.

### The Math

In a vanilla RNN, gradients flow backward like this:

$$\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$

Each term in the product is typically < 1, so:

$$0.9 \times 0.9 \times 0.9 \times ... \times 0.9 \text{ (50 times)} \approx 0.005$$

**The gradient vanishes!** The network can't learn long-range dependencies.

In [None]:
# Demonstrate vanishing gradients
def simulate_gradient_flow(initial_grad=1.0, steps=50, factor=0.9):
    """Simulate gradient flowing backward through time."""
    gradients = [initial_grad]
    for _ in range(steps):
        gradients.append(gradients[-1] * factor)
    return gradients

# Compare different scenarios
steps = range(51)
vanilla_rnn = simulate_gradient_flow(1.0, 50, 0.9)
lstm_sim = simulate_gradient_flow(1.0, 50, 0.99)  # LSTMs preserve gradients better

plt.figure(figsize=(12, 5))
plt.plot(steps, vanilla_rnn, 'r-', linewidth=2, label='Vanilla RNN (0.9 factor)')
plt.plot(steps, lstm_sim, 'g-', linewidth=2, label='LSTM (0.99 factor)')
plt.axhline(y=0.1, color='orange', linestyle='--', alpha=0.5, label='Vanishing threshold')
plt.xlabel('Time Steps Backward', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Vanishing Gradients: RNN vs LSTM', fontsize=14, fontweight='bold')
plt.yscale('log')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Vanilla RNN after 50 steps: {vanilla_rnn[-1]:.6f}")
print(f"LSTM after 50 steps: {lstm_sim[-1]:.6f}")
print(f"\nLSTM preserves {lstm_sim[-1] / vanilla_rnn[-1]:.1f}x more gradient!")

## 2. The LSTM Solution: A Memory Highway üõ£Ô∏è

LSTMs solve vanishing gradients with a clever trick: **the cell state**.

### The Highway Analogy

Think of gradients traveling backward in time:

**Vanilla RNN** = Country road with stop signs every block
- Gradients must stop at each time step
- Subject to multiplication by < 1 values
- Gets slower and weaker over distance

**LSTM** = Highway with direct exit ramps
- Cell state provides a "highway" for gradients
- Gradients can flow almost unchanged
- Only need to exit (via gates) when needed

### The Key Insight

The cell state updates via **addition** (not multiplication):

$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

When gradients flow backward:

$$\frac{\partial C_t}{\partial C_{t-1}} = f_t$$

Since $f_t \approx 1$ (forget gate usually keeps most information), **gradients flow freely!**

This is why LSTMs can learn dependencies 100+ steps apart, while vanilla RNNs struggle beyond 10 steps.

## 3. LSTM Architecture: The 4 Gates üö™

An LSTM has **4 key components** that work together:

### 1. Forget Gate ($f_t$) - The Bouncer üö´
**Job**: Decide what to throw away from cell state

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

- Output: 0 (forget everything) to 1 (keep everything)
- Analogy: A bouncer deciding who gets to stay in the club

### 2. Input Gate ($i_t$) - The Security Guard ‚úÖ
**Job**: Decide what new information to add

$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$

- Output: 0 (ignore new info) to 1 (accept it all)
- Analogy: A security guard deciding what new items to let in

### 3. Cell Candidate ($\tilde{C}_t$) - The New Information üì¶
**Job**: Create potential new information

$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

- Output: -1 to 1 (new values to potentially add)
- Analogy: The actual items trying to enter

### 4. Output Gate ($o_t$) - The Librarian üìö
**Job**: Decide what to output from cell state

$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$

- Output: 0 (hide everything) to 1 (show everything)
- Analogy: A librarian deciding what books to show you

### How They Work Together

1. **Forget**: $C_t = f_t \odot C_{t-1}$ (throw away old info)
2. **Add**: $C_t = C_t + i_t \odot \tilde{C}_t$ (add new info)
3. **Output**: $h_t = o_t \odot \tanh(C_t)$ (decide what to reveal)

**The cell state ($C_t$)** is the memory. **The hidden state ($h_t$)** is what gets outputted.

In [None]:
# Let's build a tiny LSTM and see it in action!

# Create a small vocabulary
chars = list("hello")
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)

print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")

# Initialize a small LSTM
hidden_size = 10  # Small for visualization
lstm = LSTM(input_size=vocab_size, hidden_size=hidden_size, output_size=vocab_size)

print(f"\n‚úÖ LSTM created!")
print(f"   - Input size: {vocab_size}")
print(f"   - Hidden size: {hidden_size}")
print(f"   - Output size: {vocab_size}")
print(f"   - Total parameters: {sum(p.size for p in [lstm.Wf, lstm.Wi, lstm.Wc, lstm.Wo, lstm.Wy])}")

## 4. Forward Pass: Watching the Gates Work üëÅÔ∏è

Let's run a sequence through the LSTM and watch what the gates do!

We'll feed it the sequence **"hello"** and capture:
- Forget gate activations (what to keep)
- Input gate activations (what to add)
- Output gate activations (what to reveal)
- Cell state evolution (the memory)

In [None]:
# Prepare sequence
text = "hello"
inputs = [char_to_idx[ch] for ch in text]
print(f"Input sequence: {text}")
print(f"As indices: {inputs}")

# Storage for gate activations
gates_storage = {
    'forget': [],
    'input': [],
    'output': []
}
cell_states_storage = []

# Initialize hidden and cell states
h_prev = np.zeros(hidden_size)
C_prev = np.zeros(hidden_size)

# Forward pass through sequence
for t, idx in enumerate(inputs):
    # Create one-hot encoded input
    x = np.zeros(vocab_size)
    x[idx] = 1.0
    
    # Compute all gates (we'll manually compute to capture them)
    concat = np.concatenate([h_prev, x])
    
    # Forget gate
    f = lstm.sigmoid(np.dot(lstm.Wf, concat) + lstm.bf)
    gates_storage['forget'].append(f.copy())
    
    # Input gate
    i = lstm.sigmoid(np.dot(lstm.Wi, concat) + lstm.bi)
    gates_storage['input'].append(i.copy())
    
    # Cell candidate
    C_tilde = np.tanh(np.dot(lstm.Wc, concat) + lstm.bc)
    
    # Update cell state
    C_prev = f * C_prev + i * C_tilde
    cell_states_storage.append(C_prev.copy())
    
    # Output gate
    o = lstm.sigmoid(np.dot(lstm.Wo, concat) + lstm.bo)
    gates_storage['output'].append(o.copy())
    
    # Update hidden state
    h_prev = o * np.tanh(C_prev)
    
    print(f"\nStep {t} ('{text[t]}'):")
    print(f"  Forget gate avg: {f.mean():.3f} (1=keep, 0=forget)")
    print(f"  Input gate avg:  {i.mean():.3f} (1=add, 0=ignore)")
    print(f"  Output gate avg: {o.mean():.3f} (1=show, 0=hide)")

print("\n‚úÖ Forward pass complete!")

## 5. Visualizing Gate Activations üìä

Now let's visualize what the gates are doing! This helps us understand:
- **Which hidden units are active** (bright colors)
- **When gates open/close** (across time steps)
- **Patterns in gate behavior** (do they learn structure?)

In [None]:
# Visualize gate activations
plot_gate_activations(gates_storage, text)

# Visualize cell state evolution
plot_cell_state_evolution(cell_states_storage, text)

# Analyze patterns
analyze_gate_patterns(gates_storage, text)

## 6. Training on Real Text üìö

Let's train our LSTM on a real text dataset! We'll use a small training example to see how the LSTM learns to predict the next character.

For this demo, we'll use Shakespeare text (or any text you have).

In [None]:
# Simple training data (you can replace with Shakespeare or any text file)
training_text = """
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them.
"""

# Create vocabulary
chars_train = sorted(list(set(training_text)))
char_to_idx_train = {ch: i for i, ch in enumerate(chars_train)}
idx_to_char_train = {i: ch for i, ch in enumerate(chars_train)}
vocab_size_train = len(chars_train)

print(f"Training text length: {len(training_text)} characters")
print(f"Vocabulary size: {vocab_size_train}")
print(f"Unique characters: {''.join(chars_train)}")

# Create LSTM
lstm_train = LSTM(input_size=vocab_size_train, 
                  hidden_size=64, 
                  output_size=vocab_size_train)

print("\n‚úÖ Training LSTM created!")

In [None]:
# Training loop
seq_length = 25
learning_rate = 0.001
num_iterations = 1000

losses = []
h_prev = np.zeros(lstm_train.hidden_size)
C_prev = np.zeros(lstm_train.hidden_size)

print("Training LSTM...")
print("=" * 60)

for iteration in range(num_iterations):
    # Sample random starting point
    start_idx = np.random.randint(0, len(training_text) - seq_length - 1)
    
    # Get input and target sequences
    input_seq = training_text[start_idx:start_idx + seq_length]
    target_seq = training_text[start_idx + 1:start_idx + seq_length + 1]
    
    # Convert to indices
    inputs = [char_to_idx_train[ch] for ch in input_seq]
    targets = [char_to_idx_train[ch] for ch in target_seq]
    
    # Forward pass
    loss = lstm_train.forward(inputs, targets, h_prev, C_prev)
    losses.append(loss)
    
    # Backward pass
    dh_next, dC_next = lstm_train.backward()
    
    # Update weights
    lstm_train.update_weights(learning_rate)
    
    # Update states (with detachment to prevent gradient accumulation)
    h_prev = lstm_train.h_states[-1].copy()
    C_prev = lstm_train.C_states[-1].copy()
    
    # Print progress
    if iteration % 100 == 0:
        smooth_loss = np.mean(losses[-100:]) if len(losses) >= 100 else np.mean(losses)
        print(f"Iteration {iteration:4d} | Loss: {smooth_loss:.4f}")
        
        # Sample text
        if iteration % 500 == 0:
            sample = lstm_train.sample(idx_to_char_train, char_to_idx_train['T'], 100)
            print(f"Sample: {sample[:60]}...")
            print()

print("\n‚úÖ Training complete!")

In [None]:
# Plot training curve
plt.figure(figsize=(10, 5))
plt.plot(losses, alpha=0.3, label='Raw loss')
# Smooth curve
window = 50
if len(losses) > window:
    smoothed = np.convolve(losses, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(losses)), smoothed, linewidth=2, label='Smoothed loss', color='red')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('LSTM Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Final loss: {losses[-1]:.4f}")
print(f"Improvement: {(losses[0] - losses[-1]) / losses[0] * 100:.1f}%")

## 7. Temperature Sampling Experiments üå°Ô∏è

Temperature controls how "creative" vs "conservative" the model is when generating text:

- **Low temperature (0.5)**: Conservative, picks likely characters ‚Üí coherent but boring
- **Medium temperature (1.0)**: Balanced ‚Üí good mix
- **High temperature (1.5)**: Creative, picks unlikely characters ‚Üí diverse but chaotic

Think of it like adjusting the "randomness knob" on the model!

In [None]:
# Try different temperatures
temperatures = [0.3, 0.7, 1.0, 1.5]
seed_char = 'T'

print("Sampling with different temperatures:")
print("=" * 70)

for temp in temperatures:
    sample = lstm_train.sample(idx_to_char_train, 
                               char_to_idx_train[seed_char], 
                               length=150, 
                               temperature=temp)
    print(f"\nTemperature = {temp}:")
    print(f"{sample[:120]}...")
    print("-" * 70)

## 8. Key Takeaways üéØ

### What We Learned

1. **The Problem**: Vanilla RNNs suffer from vanishing gradients
   - Gradients multiply by < 1 at each step
   - Can't learn dependencies > 10 steps away

2. **The Solution**: LSTMs use a cell state "highway"
   - Cell state updates via **addition** (not multiplication)
   - Gradients flow almost unchanged backward in time
   - Can learn dependencies 100+ steps away

3. **The Architecture**: 4 components work together
   - **Forget gate**: What to remove from memory
   - **Input gate**: What new info to add
   - **Cell state**: The actual memory
   - **Output gate**: What to reveal

4. **The Intuition**: Think of it as a smart todo list
   - Cross off completed tasks (forget)
   - Add new tasks (input)
   - Keep the list (cell state)
   - Decide what's relevant now (output)

### When to Use LSTMs

‚úÖ **Use LSTMs when:**
- You have sequential data (text, time series, audio)
- Long-range dependencies matter (>10 steps)
- You need interpretable gates
- Dataset is small-to-medium sized

‚ùå **Don't use LSTMs when:**
- You have non-sequential data (images, tables)
- Very long sequences (>1000 steps) ‚Üí use Transformers
- You need maximum performance ‚Üí use Transformers
- You have huge datasets ‚Üí Transformers train better at scale

### Next Steps

- Try exercise 1: Build LSTM from scratch
- Experiment with different hyperparameters
- Train on your own text data
- Compare with vanilla RNN and GRU

**Tomorrow (Day 3)**: We'll explore another foundational paper!

---

**Congratulations!** üéâ You now understand how LSTMs work and why they revolutionized sequence modeling!