# Day 1: The Unreasonable Effectiveness of Recurrent Neural Networks

**Welcome to the 30u30 challenge!** ðŸš€

Today we'll build a **character-level RNN** from scratch and see how simple models can generate surprisingly good text.

---

## What You'll Learn

1. âœ… How RNNs process sequences
2. âœ… Forward propagation through time
3. âœ… Backpropagation through time (BPTT)
4. âœ… Why RNNs can generate coherent text
5. âœ… Temperature sampling

---

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys

# For pretty plots
plt.style.use('seaborn-v0_8-darkgrid')

print("Setup complete! âœ…")

---

## Part 1: Understanding the Problem

### The Task: Predict the Next Character

Given a sequence of characters, predict what comes next.

**Example:**
```
Input:  "hello wor"
Output: "l" (most likely)
```

### Why is this hard?

- Context matters: "h" after "wor" vs "h" after "t"
- Long-range dependencies: opening quote â†’ closing quote
- Multiple valid continuations

### Why RNNs?

RNNs have **memory** (hidden state) that remembers previous characters.

---

## Part 2: Prepare Data

Let's start with a tiny dataset so we can see what's happening.

In [None]:
# Tiny training data
data = "hello hello hello world world world"

# Get unique characters
chars = sorted(list(set(data)))
vocab_size = len(chars)

print(f"Data: {data}")
print(f"Unique characters: {chars}")
print(f"Vocabulary size: {vocab_size}")

# Create mappings
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"\nMappings:")
for ch in chars[:5]:
    print(f"  '{ch}' â†’ {char_to_idx[ch]}")

---

## Part 3: Build the RNN

### The Math

At each time step $t$:

1. **Hidden state update:**
   $$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$$

2. **Output:**
   $$y_t = W_{hy} h_t + b_y$$

3. **Probabilities:**
   $$p_t = \text{softmax}(y_t)$$

### Visualizing the Flow

```
Input:  "h"  "e"  "l"  "l"  "o"
         â†“    â†“    â†“    â†“    â†“
       [RNN][RNN][RNN][RNN][RNN]
         â†“    â†“    â†“    â†“    â†“
Output: "e"  "l"  "l"  "o"  " "
```

Each RNN cell:
- Takes current input + previous hidden state
- Produces new hidden state + output
- Hidden state = memory

In [None]:
# Hyperparameters
hidden_size = 25  # Size of hidden state vector
seq_length = 10   # Number of steps to unroll
learning_rate = 0.1

# Initialize weights (small random values)
Wxh = np.random.randn(hidden_size, vocab_size) * 0.01  # Input â†’ Hidden
Whh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden â†’ Hidden
Why = np.random.randn(vocab_size, hidden_size) * 0.01  # Hidden â†’ Output
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((vocab_size, 1))   # Output bias

print(f"Weight shapes:")
print(f"  Wxh: {Wxh.shape} (hidden_size Ã— vocab_size)")
print(f"  Whh: {Whh.shape} (hidden_size Ã— hidden_size)")
print(f"  Why: {Why.shape} (vocab_size Ã— hidden_size)")
print(f"  bh:  {bh.shape}")
print(f"  by:  {by.shape}")

---

## Part 4: Forward Pass

Let's process one sequence and see what happens.

In [None]:
def forward_pass(inputs, targets, h_prev):
    """
    Forward pass through RNN.
    
    Args:
        inputs: List of character indices (length = seq_length)
        targets: List of target character indices
        h_prev: Previous hidden state (hidden_size Ã— 1)
        
    Returns:
        loss: Cross-entropy loss
        h_last: Final hidden state
        cache: Values for backward pass
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(h_prev)
    loss = 0
    
    # Forward through time
    for t in range(len(inputs)):
        # 1. One-hot encode input
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1
        
        # 2. Hidden state
        hs[t] = np.tanh(Wxh @ xs[t] + Whh @ hs[t-1] + bh)
        
        # 3. Output
        ys[t] = Why @ hs[t] + by
        
        # 4. Softmax (numerically stable)
        ps[t] = np.exp(ys[t] - np.max(ys[t])) / np.sum(np.exp(ys[t] - np.max(ys[t])))
        
        # 5. Loss (cross-entropy)
        loss += -np.log(ps[t][targets[t], 0])
    
    return loss, hs[len(inputs)-1], (xs, hs, ys, ps)

# Test with first 10 characters
test_input = [char_to_idx[ch] for ch in data[:seq_length]]
test_target = [char_to_idx[ch] for ch in data[1:seq_length+1]]
h0 = np.zeros((hidden_size, 1))

loss, h_final, cache = forward_pass(test_input, test_target, h0)

print(f"Input sequence: '{data[:seq_length]}'")
print(f"Target sequence: '{data[1:seq_length+1]}'")
print(f"Loss: {loss:.4f}")
print(f"Final hidden state shape: {h_final.shape}")

### What just happened?

1. Each character was converted to a one-hot vector
2. RNN processed them one by one, updating hidden state
3. At each step, RNN predicted next character
4. Loss = how wrong the predictions were

**Initial loss is high** because weights are random! ðŸŽ²

---

## Part 5: Sampling (Before Training)

Let's see what the model generates with random weights.

In [None]:
def sample(h, seed_idx, n, temperature=1.0):
    """
    Generate text by sampling from the model.
    
    Args:
        h: Initial hidden state
        seed_idx: Starting character index
        n: Number of characters to generate
        temperature: Sampling temperature (higher = more random)
        
    Returns:
        indices: List of generated character indices
    """
    x = np.zeros((vocab_size, 1))
    x[seed_idx] = 1
    indices = []
    
    for t in range(n):
        h = np.tanh(Wxh @ x + Whh @ h + bh)
        y = Why @ h + by
        
        # Apply temperature
        y = y / temperature
        p = np.exp(y - np.max(y)) / np.sum(np.exp(y - np.max(y)))
        
        # Sample from distribution
        idx = np.random.choice(range(vocab_size), p=p.ravel())
        
        # Update input for next step
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        indices.append(idx)
    
    return indices

# Generate before training
h = np.zeros((hidden_size, 1))
seed = char_to_idx['h']
sample_indices = sample(h, seed, 50)
sample_text = ''.join([idx_to_char[i] for i in sample_indices])

print("Generated text (untrained):")
print(f"'{sample_text}'")
print("\n(Looks like gibberish, right? That's expected!)")

---

## Part 6: Backward Pass (BPTT)

Now we need to compute gradients to update weights.

**Backpropagation Through Time** = apply chain rule backwards through the sequence.

In [None]:
def backward_pass(inputs, targets, cache):
    """
    Backward pass: compute gradients via BPTT.
    """
    xs, hs, ys, ps = cache
    
    # Initialize gradients
    dWxh = np.zeros_like(Wxh)
    dWhh = np.zeros_like(Whh)
    dWhy = np.zeros_like(Why)
    dbh = np.zeros_like(bh)
    dby = np.zeros_like(by)
    dh_next = np.zeros_like(hs[0])
    
    # Backward through time
    for t in reversed(range(len(inputs))):
        # Gradient of loss w.r.t. output
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1  # Softmax + cross-entropy gradient
        
        # Output layer gradients
        dWhy += dy @ hs[t].T
        dby += dy
        
        # Backprop to hidden state
        dh = Why.T @ dy + dh_next
        
        # Backprop through tanh
        dh_raw = (1 - hs[t] ** 2) * dh
        
        # Hidden layer gradients
        dbh += dh_raw
        dWxh += dh_raw @ xs[t].T
        dWhh += dh_raw @ hs[t-1].T
        
        # Gradient for next iteration
        dh_next = Whh.T @ dh_raw
    
    # Clip gradients to prevent explosion
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)
    
    return dWxh, dWhh, dWhy, dbh, dby

# Test backward pass
grads = backward_pass(test_input, test_target, cache)
print("Gradients computed successfully! âœ…")
print(f"Gradient shapes: {[g.shape for g in grads]}")

---

## Part 7: Training Loop

Now let's train! We'll use **Adagrad** optimizer.

In [None]:
# Convert data to indices
data_indices = [char_to_idx[ch] for ch in data]
data_size = len(data_indices)

# Adagrad memory
mWxh = np.zeros_like(Wxh)
mWhh = np.zeros_like(Whh)
mWhy = np.zeros_like(Why)
mbh = np.zeros_like(bh)
mby = np.zeros_like(by)

# Training
losses = []
smooth_loss = -np.log(1.0/vocab_size) * seq_length
p = 0  # Data pointer
h_prev = np.zeros((hidden_size, 1))

for iteration in range(1000):
    # Reset if at end of data
    if p + seq_length + 1 >= data_size or iteration == 0:
        h_prev = np.zeros((hidden_size, 1))
        p = 0
    
    # Get batch
    inputs = data_indices[p:p+seq_length]
    targets = data_indices[p+1:p+seq_length+1]
    
    # Forward pass
    loss, h_prev, cache = forward_pass(inputs, targets, h_prev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001
    losses.append(smooth_loss)
    
    # Backward pass
    dWxh, dWhh, dWhy, dbh, dby = backward_pass(inputs, targets, cache)
    
    # Adagrad update
    for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
                                   [dWxh, dWhh, dWhy, dbh, dby],
                                   [mWxh, mWhh, mWhy, mbh, mby]):
        mem += dparam * dparam
        param -= learning_rate * dparam / np.sqrt(mem + 1e-8)
    
    p += seq_length
    
    # Print progress
    if iteration % 100 == 0:
        print(f"Iteration {iteration}, Loss: {smooth_loss:.4f}")
        
        # Generate sample
        sample_indices = sample(h_prev, inputs[0], 50, temperature=0.8)
        txt = ''.join(idx_to_char[i] for i in sample_indices)
        print(f"  Sample: '{txt}'\n")

print("Training complete! ðŸŽ‰")

---

## Part 8: Visualize Training

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(losses, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Final loss: {losses[-1]:.4f}")
print(f"Started at: {losses[0]:.4f}")
print(f"Improvement: {losses[0] - losses[-1]:.4f}")

---

## Part 9: Play with Temperature

In [None]:
print("Effect of Temperature on Sampling\n" + "="*50 + "\n")

temperatures = [0.2, 0.5, 0.8, 1.0, 1.5, 2.0]
h = np.zeros((hidden_size, 1))
seed = char_to_idx['h']

for temp in temperatures:
    sample_indices = sample(h, seed, 100, temperature=temp)
    txt = ''.join(idx_to_char[i] for i in sample_indices)
    print(f"Temperature = {temp}:")
    print(f"  {txt}\n")

### Observations:

- **Low temperature (0.2)**: Very conservative, repetitive
- **Medium temperature (0.8)**: Balanced creativity
- **High temperature (2.0)**: More random, less coherent

**Why?** Temperature scales the logits before softmax:
- Low T â†’ sharper distribution â†’ always pick most likely
- High T â†’ flatter distribution â†’ more randomness

---

## Part 10: What Did We Learn?

### Key Insights

1. **RNNs have memory** through hidden states
2. **They can learn patterns** from data
3. **Temperature controls creativity** vs coherence
4. **Gradient clipping is crucial** to prevent explosion
5. **Simple models can do surprising things**

### Why "Unreasonable Effectiveness"?

With just a few thousand parameters, we can:
- Generate Shakespeare
- Write code
- Compose music

The model **discovers structure** in the data:
- Words
- Grammar
- Style

All from predicting one character at a time!

### Connection to Modern AI

This is the foundation of:
- **LSTMs** (Day 2) â†’ better memory
- **Transformers** (Day 13+) â†’ attention mechanism
- **GPT** â†’ scale this up massively

---

## Next Steps

1. **Try larger datasets**: Shakespeare, Wikipedia, code
2. **Tune hyperparameters**: Hidden size, learning rate
3. **Complete the exercises** in `/exercises`
4. **Move to Day 2**: LSTMs!

---

## Resources

- ðŸ“– [Original blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- ðŸ“Š [Visualizing RNNs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- ðŸ’» [Implementation code](./implementation.py)
- ðŸŽ¯ [Exercises](./exercises/)

---

**Congratulations!** You've completed Day 1. ðŸŽ‰

Share your progress with **#30u30**!