# Lab 3: Sequence Models and Attention

In this lab, we'll explore sequence-to-sequence models and attention mechanisms that revolutionized NLP. These techniques enable neural machine translation, text summarization, and more.

## Learning Objectives

By the end of this lab, you will:
- Understand encoder-decoder architecture
- Implement sequence-to-sequence models
- Build attention mechanisms from scratch
- Understand self-attention
- Implement multi-head attention
- Build a neural machine translation system

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import re

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)

## Part 1: Sequence-to-Sequence (Seq2Seq) Architecture

**Seq2Seq** models transform one sequence into another:

### Architecture:
1. **Encoder**: Processes input sequence → context vector
2. **Context Vector**: Fixed-size representation of input
3. **Decoder**: Generates output sequence from context

### Problem with Basic Seq2Seq:
- Fixed-size context vector is a bottleneck
- Information loss for long sequences
- All input must be compressed into single vector

### Solution: Attention!

In [None]:
# Simple translation task: English numbers to French
data_pairs = [
    ('one', 'un'),
    ('two', 'deux'),
    ('three', 'trois'),
    ('four', 'quatre'),
    ('five', 'cinq'),
    ('six', 'six'),
    ('seven', 'sept'),
    ('eight', 'huit'),
    ('nine', 'neuf'),
    ('ten', 'dix'),
]

# Build vocabularies
input_texts = [pair[0] for pair in data_pairs]
target_texts = [pair[1] for pair in data_pairs]

# Add special tokens
input_chars = sorted(set(''.join(input_texts)))
target_chars = sorted(set(''.join(target_texts)))
target_chars = ['\t'] + target_chars + ['\n']  # Start and end tokens

input_char_to_idx = {char: idx for idx, char in enumerate(input_chars)}
target_char_to_idx = {char: idx for idx, char in enumerate(target_chars)}

print(f"Input vocabulary: {input_chars}")
print(f"Target vocabulary: {target_chars}")
print(f"\nInput vocab size: {len(input_chars)}")
print(f"Target vocab size: {len(target_chars)}")

In [None]:
# Prepare training data
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts]) + 2  # +2 for start/end

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, len(input_chars)),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, len(target_chars)),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, len(target_chars)),
    dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    # Encoder input
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_char_to_idx[char]] = 1.0
    
    # Decoder input (with start token)
    decoder_input_data[i, 0, target_char_to_idx['\t']] = 1.0
    for t, char in enumerate(target_text):
        decoder_input_data[i, t + 1, target_char_to_idx[char]] = 1.0
    
    # Decoder target (shifted by one)
    for t, char in enumerate(target_text):
        decoder_target_data[i, t, target_char_to_idx[char]] = 1.0
    decoder_target_data[i, len(target_text), target_char_to_idx['\n']] = 1.0

print(f"Encoder input shape: {encoder_input_data.shape}")
print(f"Decoder input shape: {decoder_input_data.shape}")
print(f"Decoder target shape: {decoder_target_data.shape}")

In [None]:
# Build Seq2Seq model
latent_dim = 32

# Encoder
encoder_inputs = layers.Input(shape=(None, len(input_chars)))
encoder = layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = layers.Input(shape=(None, len(target_chars)))
decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = layers.Dense(len(target_chars), activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

In [None]:
# Train
history = model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=2,
    epochs=100,
    validation_split=0.2,
    verbose=0
)

print(f"Training complete!")
print(f"Final accuracy: {history.history['accuracy'][-1]:.3f}")

## Part 2: Attention Mechanism

**Attention** allows the decoder to focus on relevant parts of the input.

### Key Idea:
Instead of compressing entire input into fixed vector:
- Keep all encoder hidden states
- At each decoding step, compute attention weights
- Create context vector as weighted sum of encoder states

### Attention Formula:
$$\text{attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$: Query (decoder state)
- $K$: Keys (encoder states)
- $V$: Values (encoder states)
- $d_k$: Dimension of keys

In [None]:
# Implement attention layer
class AttentionLayer(layers.Layer):
    def __init__(self, units):
        super(AttentionLayer, self).__init__()
        self.W1 = layers.Dense(units)
        self.W2 = layers.Dense(units)
        self.V = layers.Dense(1)
    
    def call(self, query, values):
        # query shape: (batch, hidden_dim)
        # values shape: (batch, seq_len, hidden_dim)
        
        # Expand query to match values shape
        query_with_time = tf.expand_dims(query, 1)
        
        # Compute attention scores
        score = self.V(tf.nn.tanh(
            self.W1(query_with_time) + self.W2(values)
        ))
        
        # Attention weights
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # Context vector
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        return context_vector, attention_weights

print("Attention layer implemented!")

In [None]:
# Build Seq2Seq with Attention
class Seq2SeqAttention(keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super(Seq2SeqAttention, self).__init__()
        self.units = units
        
        # Encoder
        self.encoder_embedding = layers.Embedding(vocab_size, embedding_dim)
        self.encoder_lstm = layers.LSTM(units, return_sequences=True, return_state=True)
        
        # Attention
        self.attention = AttentionLayer(units)
        
        # Decoder
        self.decoder_embedding = layers.Embedding(vocab_size, embedding_dim)
        self.decoder_lstm = layers.LSTM(units, return_sequences=True, return_state=True)
        self.decoder_dense = layers.Dense(vocab_size)
    
    def call(self, inputs):
        encoder_input, decoder_input = inputs
        
        # Encode
        encoder_emb = self.encoder_embedding(encoder_input)
        encoder_output, state_h, state_c = self.encoder_lstm(encoder_emb)
        
        # Decode
        decoder_emb = self.decoder_embedding(decoder_input)
        decoder_output, _, _ = self.decoder_lstm(
            decoder_emb, initial_state=[state_h, state_c]
        )
        
        # Apply attention (simplified for demonstration)
        # In full implementation, would apply at each decoder step
        output = self.decoder_dense(decoder_output)
        
        return output

print("Seq2Seq with Attention model defined!")

## Part 3: Self-Attention

**Self-attention** computes attention within the same sequence.

### Process:
1. Create Query (Q), Key (K), Value (V) from input
2. Compute attention scores: $QK^T$
3. Scale by $\sqrt{d_k}$
4. Apply softmax
5. Multiply by Values (V)

### Result:
Each position attends to all other positions!

In [None]:
# Implement Scaled Dot-Product Attention
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate attention weights.
    q, k, v: shape (batch, seq_len, d_k)
    """
    # Compute attention scores
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    # Scale
    d_k = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(d_k)
    
    # Apply mask (if provided)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    
    # Softmax
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    # Apply to values
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights

# Test
seq_len = 5
d_k = 8
batch_size = 2

q = tf.random.normal((batch_size, seq_len, d_k))
k = tf.random.normal((batch_size, seq_len, d_k))
v = tf.random.normal((batch_size, seq_len, d_k))

output, weights = scaled_dot_product_attention(q, k, v)

print(f"Input shape: {q.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

In [None]:
# Visualize attention weights
sample_weights = weights[0].numpy()

plt.figure(figsize=(8, 6))
sns.heatmap(sample_weights, annot=True, fmt='.2f', cmap='viridis',
           xticklabels=range(seq_len), yticklabels=range(seq_len))
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Self-Attention Weights')
plt.show()

print("Each row shows which positions a query attends to.")
print("Brighter colors = stronger attention.")

## Part 4: Multi-Head Attention

**Multi-head attention** runs multiple attention operations in parallel.

### Benefits:
- Attend to different aspects simultaneously
- Learn different relationships
- More expressive than single attention

### Process:
1. Linear projections for each head
2. Apply attention in parallel
3. Concatenate outputs
4. Final linear projection

In [None]:
# Implement Multi-Head Attention
class MultiHeadAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % num_heads == 0
        
        self.depth = d_model // num_heads
        
        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        
        self.dense = layers.Dense(d_model)
    
    def split_heads(self, x, batch_size):
        """Split last dimension into (num_heads, depth)."""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]
        
        # Linear projections
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # Split into multiple heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # Attention
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask
        )
        
        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, 
                                     (batch_size, -1, self.d_model))
        
        # Final linear
        output = self.dense(concat_attention)
        
        return output, attention_weights

# Test
d_model = 64
num_heads = 8
mha = MultiHeadAttention(d_model, num_heads)

x = tf.random.normal((2, 10, d_model))
output, weights = mha(x, x, x)

print(f"Multi-Head Attention output shape: {output.shape}")
print(f"Number of heads: {num_heads}")
print(f"Depth per head: {d_model // num_heads}")

## Part 5: Positional Encoding

**Positional encoding** adds position information to embeddings.

### Why needed?
- Attention has no notion of order
- Need to inject position information

### Formula:
$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})$$

In [None]:
def positional_encoding(position, d_model):
    """
    Create positional encoding.
    """
    angle_rads = np.arange(position)[:, np.newaxis] / np.power(
        10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model)
    )
    
    # Apply sin to even indices
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # Apply cos to odd indices
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

# Visualize
pos_encoding = positional_encoding(50, 128)

plt.figure(figsize=(12, 6))
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('Depth')
plt.ylabel('Position')
plt.colorbar()
plt.title('Positional Encoding')
plt.show()

print("Positional encoding adds unique patterns for each position.")

## Key Takeaways

1. **Seq2Seq** uses encoder-decoder architecture
2. **Attention** allows focusing on relevant input parts
3. **Self-attention** relates different positions in sequence
4. **Scaled dot-product** prevents gradients from vanishing
5. **Multi-head attention** captures different relationships
6. **Positional encoding** adds order information
7. **Attention is foundation** for transformers

## Exercises

1. **Bidirectional Encoder**: Implement bidirectional LSTM encoder
2. **Beam Search**: Add beam search decoding
3. **Attention Visualization**: Visualize attention for translation
4. **Different Attention**: Implement Luong attention
5. **Longer Sequences**: Train on longer translation pairs
6. **Transformer Block**: Combine MHA with feed-forward

## Next Steps

In Lab 4, we'll explore:
- Complete Transformer architecture
- BERT and GPT models
- Pre-training and fine-tuning
- Using Hugging Face Transformers

Excellent work! You now understand attention mechanisms that power modern NLP.