# Day 68: Attention Mechanisms and Transformers

## Introduction

Welcome to Day 68 of the 100 Days of Machine Learning challenge! Today, we explore one of the most revolutionary breakthroughs in modern machine learning: **Attention Mechanisms** and **Transformers**. These concepts have fundamentally changed how we approach sequence modeling, natural language processing, and even computer vision.

Traditional recurrent neural networks (RNNs) and Long Short-Term Memory networks (LSTMs) process sequences step-by-step, which creates bottlenecks in capturing long-range dependencies. Attention mechanisms solve this problem by allowing models to focus on different parts of the input sequence when producing each output, regardless of their position. The Transformer architecture, introduced in the seminal 2017 paper "Attention Is All You Need," takes this concept further by relying entirely on attention mechanisms, eliminating recurrence altogether.

### Why Are Attention Mechanisms and Transformers Important?

Attention mechanisms and transformers have revolutionized machine learning for several compelling reasons:

1. **Long-Range Dependencies**: They can capture relationships between distant elements in a sequence without the degradation that affects RNNs.
2. **Parallelization**: Unlike RNNs, transformers can process all positions in a sequence simultaneously, dramatically reducing training time.
3. **State-of-the-Art Performance**: Transformers power modern language models like BERT, GPT, and many others that have achieved unprecedented performance across NLP tasks.
4. **Versatility**: Beyond NLP, transformers have been successfully applied to computer vision (Vision Transformers), speech recognition, protein folding (AlphaFold), and more.

### Learning Objectives

By the end of this lesson, you will be able to:

- Understand the motivation behind attention mechanisms and how they differ from recurrent architectures
- Explain the mathematical foundations of self-attention and scaled dot-product attention
- Comprehend the transformer architecture and its key components
- Implement a basic attention mechanism from scratch in Python
- Work with pre-trained transformer models using the Hugging Face library
- Visualize attention patterns to interpret model behavior

## Theory: Understanding Attention Mechanisms

### The Motivation for Attention

Consider the task of machine translation. When translating "The cat sat on the mat" to another language, different words in the output depend on different parts of the input. The word "cat" in the output primarily depends on "cat" in the input, but also needs context from "the" and potentially "sat."

In traditional encoder-decoder architectures, the encoder compresses the entire input sequence into a single fixed-size context vector. This creates an information bottleneck, especially for long sequences. **Attention mechanisms** address this by allowing the decoder to "attend to" different parts of the input sequence for each output step.

### Self-Attention: The Core Concept

Self-attention allows a sequence to attend to itself, computing representations that capture relationships between all positions in the sequence. For each position, self-attention computes a weighted sum of all positions, where the weights indicate how much each position should "pay attention to" every other position.

### Mathematical Foundation

#### Query, Key, and Value

Self-attention is built on three concepts:

- **Query (Q)**: What am I looking for?
- **Key (K)**: What do I contain?
- **Value (V)**: What information do I actually hold?

For each input element, we create three vectors by multiplying the input with learned weight matrices $W^Q$, $W^K$, and $W^V$:

$$Q = XW^Q$$
$$K = XW^K$$
$$V = XW^V$$

where $X$ is the input matrix and each row represents a position in the sequence.

#### Scaled Dot-Product Attention

The attention mechanism computes attention scores, which determine how much focus to place on other parts of the input when encoding a specific position. The formula for scaled dot-product attention is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let's break this down:

1. **$QK^T$**: Compute the dot product between queries and keys to get raw attention scores. This measures the similarity between the query and each key.

2. **$\frac{1}{\sqrt{d_k}}$**: Scale by the square root of the key dimension $d_k$. This prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients.

3. **$\text{softmax}$**: Apply softmax to normalize the scores into a probability distribution. Each position gets a weight between 0 and 1, and all weights sum to 1.

4. **Multiply by $V$**: Use these attention weights to compute a weighted sum of the value vectors.

#### Multi-Head Attention

Instead of performing a single attention function, transformers use **multi-head attention**, which runs multiple attention operations in parallel:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$

where each head is:

$$\text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$

This allows the model to attend to information from different representation subspaces at different positions. For example, one head might focus on syntactic relationships while another focuses on semantic relationships.

### The Transformer Architecture

The transformer architecture consists of:

1. **Encoder**: Processes the input sequence and produces contextualized representations
2. **Decoder**: Generates the output sequence using the encoder's representations

Each encoder layer contains:
- Multi-head self-attention mechanism
- Position-wise feed-forward network
- Layer normalization and residual connections

Each decoder layer contains:
- Masked multi-head self-attention (to prevent attending to future positions)
- Multi-head attention over encoder outputs
- Position-wise feed-forward network
- Layer normalization and residual connections

#### Positional Encoding

Since transformers have no inherent notion of sequence order (unlike RNNs), we add **positional encodings** to the input embeddings. These are typically sinusoidal functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

where $pos$ is the position and $i$ is the dimension. This allows the model to learn to attend by relative positions.

## Python Implementation

Let's implement attention mechanisms from scratch to build intuition, then work with modern transformer libraries.

In [1]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

Libraries imported successfully!
PyTorch version: 2.0.1


### Implementing Scaled Dot-Product Attention from Scratch

Let's build the core attention mechanism step by step.

In [2]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Parameters:
    -----------
    Q : torch.Tensor
        Query matrix of shape (batch_size, num_queries, d_k)
    K : torch.Tensor
        Key matrix of shape (batch_size, num_keys, d_k)
    V : torch.Tensor
        Value matrix of shape (batch_size, num_keys, d_v)
    mask : torch.Tensor, optional
        Mask tensor to prevent attention to certain positions
    
    Returns:
    --------
    output : torch.Tensor
        Attention output of shape (batch_size, num_queries, d_v)
    attention_weights : torch.Tensor
        Attention weights of shape (batch_size, num_queries, num_keys)
    """
    # Get the dimension of keys
    d_k = Q.size(-1)
    
    # Compute attention scores: Q @ K^T / sqrt(d_k)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_k)
    
    # Apply mask if provided (set masked positions to large negative value)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# Test the function with a simple example
batch_size = 1
seq_length = 4
d_k = 8
d_v = 8

# Create random Q, K, V matrices
Q = torch.randn(batch_size, seq_length, d_k)
K = torch.randn(batch_size, seq_length, d_k)
V = torch.randn(batch_size, seq_length, d_v)

# Compute attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)

print(f"Input shape (Q, K, V): {Q.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"\nAttention weights (should sum to 1 across last dimension):")
print(attention_weights[0].detach().numpy())
print(f"\nSum of attention weights per row: {attention_weights[0].sum(dim=-1).detach().numpy()}")

Input shape (Q, K, V): torch.Size([1, 4, 8])
Output shape: torch.Size([1, 4, 8])
Attention weights shape: torch.Size([1, 4, 4])

Attention weights (should sum to 1 across last dimension):
[[0.24681453 0.26893723 0.25156406 0.23268418]
 [0.24450386 0.24914244 0.26014507 0.24620863]
 [0.23794635 0.25784355 0.25535721 0.24885289]
 [0.25842696 0.24587208 0.24839835 0.24730261]]

Sum of attention weights per row: [1. 1. 1. 1.]


### Visualizing Attention Patterns

Let's visualize the attention weights to understand what the mechanism is "paying attention to".

In [3]:
def visualize_attention(attention_weights, sentence=None):
    """
    Visualize attention weights as a heatmap.
    
    Parameters:
    -----------
    attention_weights : torch.Tensor or numpy.ndarray
        Attention weights of shape (seq_len, seq_len)
    sentence : list of str, optional
        List of tokens to use as labels
    """
    if isinstance(attention_weights, torch.Tensor):
        attention_weights = attention_weights.detach().numpy()
    
    plt.figure(figsize=(10, 8))
    
    if sentence is not None:
        sns.heatmap(attention_weights, annot=True, fmt='.2f', cmap='YlOrRd',
                    xticklabels=sentence, yticklabels=sentence,
                    cbar_kws={'label': 'Attention Weight'})
        plt.xlabel('Key Position', fontsize=12)
        plt.ylabel('Query Position', fontsize=12)
    else:
        sns.heatmap(attention_weights, annot=True, fmt='.2f', cmap='YlOrRd',
                    cbar_kws={'label': 'Attention Weight'})
        plt.xlabel('Key Position (Index)', fontsize=12)
        plt.ylabel('Query Position (Index)', fontsize=12)
    
    plt.title('Attention Weights Heatmap', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Create a simple example with interpretable attention
sentence = ['The', 'cat', 'sat', 'down']
seq_len = len(sentence)

# Create Q, K, V for this sentence
Q_sent = torch.randn(1, seq_len, d_k)
K_sent = torch.randn(1, seq_len, d_k)
V_sent = torch.randn(1, seq_len, d_v)

# Compute attention
output_sent, attention_weights_sent = scaled_dot_product_attention(Q_sent, K_sent, V_sent)

# Visualize
visualize_attention(attention_weights_sent[0], sentence)

<Figure size 1000x800 with 2 subplots>

The heatmap shows how much each query position (row) attends to each key position (column). Darker colors indicate higher attention weights. In a real transformer trained on language data, you might see patterns like:
- Nouns attending to their articles ("cat" attending to "The")
- Verbs attending to their subjects ("sat" attending to "cat")
- Self-attention (diagonal elements) showing words attending to themselves

### Implementing Multi-Head Attention

Now let's implement multi-head attention, which allows the model to attend to different aspects of the input simultaneously.

In [4]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention mechanism.
    """
    def __init__(self, d_model, num_heads):
        """
        Parameters:
        -----------
        d_model : int
            Dimensionality of the model (must be divisible by num_heads)
        num_heads : int
            Number of attention heads
        """
        super(MultiHeadAttention, self).__init__()
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear layers for Q, K, V transformations
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output linear layer
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (num_heads, d_k).
        Transpose to get shape (batch_size, num_heads, seq_len, d_k)
        """
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)
    
    def forward(self, Q, K, V, mask=None):
        """
        Forward pass for multi-head attention.
        
        Parameters:
        -----------
        Q, K, V : torch.Tensor
            Input tensors of shape (batch_size, seq_len, d_model)
        mask : torch.Tensor, optional
            Mask tensor
        
        Returns:
        --------
        output : torch.Tensor
            Output of shape (batch_size, seq_len, d_model)
        attention_weights : torch.Tensor
            Attention weights for visualization
        """
        batch_size = Q.size(0)
        
        # Linear transformations and split into heads
        Q = self.split_heads(self.W_q(Q), batch_size)
        K = self.split_heads(self.W_k(K), batch_size)
        V = self.split_heads(self.W_v(V), batch_size)
        
        # Apply scaled dot-product attention for each head
        attention_output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, -1, self.d_model)
        
        # Final linear transformation
        output = self.W_o(attention_output)
        
        return output, attention_weights

# Test the multi-head attention module
d_model = 512
num_heads = 8
seq_length = 10
batch_size = 2

mha = MultiHeadAttention(d_model, num_heads)

# Create sample input
sample_input = torch.randn(batch_size, seq_length, d_model)

# Forward pass
output, attn_weights = mha(sample_input, sample_input, sample_input)

print(f"Input shape: {sample_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\nMulti-head attention implementation successful!")

Input shape: torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])
Attention weights shape: torch.Size([2, 8, 10, 10])

Multi-head attention implementation successful!


### Understanding Positional Encoding

Since transformers don't have any inherent notion of sequence order, we need to inject positional information. Let's implement and visualize positional encodings.

In [5]:
class PositionalEncoding(nn.Module):
    """
    Implements positional encoding using sinusoidal functions.
    """
    def __init__(self, d_model, max_len=5000):
        """
        Parameters:
        -----------
        d_model : int
            Dimensionality of the model
        max_len : int
            Maximum sequence length
        """
        super(PositionalEncoding, self).__init__()
        
        # Create a matrix to hold positional encodings
        pe = torch.zeros(max_len, d_model)
        
        # Create position indices
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Create division term for the sinusoidal functions
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        
        # Apply sin to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        
        # Apply cos to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but should be saved with the model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Add positional encoding to input embeddings.
        
        Parameters:
        -----------
        x : torch.Tensor
            Input tensor of shape (batch_size, seq_len, d_model)
        
        Returns:
        --------
        torch.Tensor
            Input with added positional encoding
        """
        x = x + self.pe[:, :x.size(1), :]
        return x

# Create and visualize positional encodings
d_model = 128
max_len = 100

pos_encoding = PositionalEncoding(d_model, max_len)

# Get the positional encoding matrix
pe_matrix = pos_encoding.pe[0].numpy()

# Visualize
plt.figure(figsize=(14, 6))

# Plot the full positional encoding matrix
plt.subplot(1, 2, 1)
plt.imshow(pe_matrix, cmap='RdBu', aspect='auto')
plt.colorbar(label='Value')
plt.xlabel('Embedding Dimension', fontsize=12)
plt.ylabel('Position', fontsize=12)
plt.title('Positional Encoding Matrix', fontsize=14, fontweight='bold')

# Plot specific dimensions over positions
plt.subplot(1, 2, 2)
positions = range(max_len)
plt.plot(positions, pe_matrix[:, 4], label='Dimension 4 (sin)', linewidth=2)
plt.plot(positions, pe_matrix[:, 5], label='Dimension 5 (cos)', linewidth=2)
plt.plot(positions, pe_matrix[:, 8], label='Dimension 8 (sin)', linewidth=2)
plt.plot(positions, pe_matrix[:, 9], label='Dimension 9 (cos)', linewidth=2)
plt.xlabel('Position', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.title('Positional Encoding Values Across Positions', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Positional encoding shape:", pe_matrix.shape)
print("\nThe sinusoidal positional encodings allow the model to:")
print("1. Learn to attend by relative positions")
print("2. Extrapolate to sequence lengths longer than those seen during training")
print("3. Provide unique encodings for each position")

<Figure size 1400x600 with 4 subplots>

Positional encoding shape: (100, 128)

The sinusoidal positional encodings allow the model to:
1. Learn to attend by relative positions
2. Extrapolate to sequence lengths longer than those seen during training
3. Provide unique encodings for each position


## Working with Pre-trained Transformers

Now that we understand the mechanics, let's work with real transformer models using the Hugging Face Transformers library.

In [6]:
# Note: You may need to install transformers first
# !pip install transformers

try:
    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
    print("Transformers library loaded successfully!")
    transformers_available = True
except ImportError:
    print("Transformers library not available. Skipping this section.")
    print("To install: pip install transformers")
    transformers_available = False

Transformers library not available. Skipping this section.
To install: pip install transformers


In [7]:
if transformers_available:
    # Load a pre-trained BERT model and tokenizer
    print("Loading BERT model and tokenizer...")
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
    
    # Put model in evaluation mode
    model.eval()
    
    print("Model loaded successfully!")
    print(f"Model has {sum(p.numel() for p in model.parameters()):,} parameters")
    
    # Example sentence
    sentence = "The transformer architecture revolutionized natural language processing."
    
    # Tokenize
    inputs = tokenizer(sentence, return_tensors='pt')
    
    print(f"\nOriginal sentence: {sentence}")
    print(f"Tokenized: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")
    print(f"Token IDs shape: {inputs['input_ids'].shape}")
    
    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract outputs
    last_hidden_state = outputs.last_hidden_state
    attentions = outputs.attentions
    
    print(f"\nOutput shape (last hidden state): {last_hidden_state.shape}")
    print(f"Number of attention layers: {len(attentions)}")
    print(f"Attention shape (layer 0): {attentions[0].shape}")
    print(f"  - Batch size: {attentions[0].shape[0]}")
    print(f"  - Number of heads: {attentions[0].shape[1]}")
    print(f"  - Sequence length: {attentions[0].shape[2]}")

Skipping section - transformers library not available


### Visualizing BERT's Attention

Let's visualize the attention patterns from different heads and layers of BERT.

In [8]:
if transformers_available:
    def visualize_bert_attention(attentions, tokens, layer=0, head=0):
        """
        Visualize attention weights from a specific layer and head.
        
        Parameters:
        -----------
        attentions : tuple of torch.Tensor
            Attention tensors from BERT model
        tokens : list of str
            List of tokens
        layer : int
            Which layer to visualize
        head : int
            Which attention head to visualize
        """
        # Get attention weights for specified layer and head
        attention = attentions[layer][0, head].detach().numpy()
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens,
                    cmap='viridis', cbar_kws={'label': 'Attention Weight'})
        plt.xlabel('Key Tokens', fontsize=12)
        plt.ylabel('Query Tokens', fontsize=12)
        plt.title(f'BERT Attention Weights - Layer {layer}, Head {head}',
                  fontsize=14, fontweight='bold')
        plt.xticks(rotation=45, ha='right')
        plt.yticks(rotation=0)
        plt.tight_layout()
        plt.show()
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    # Visualize attention from first layer, first head
    print("Attention patterns from BERT (Layer 0, Head 0):")
    visualize_bert_attention(attentions, tokens, layer=0, head=0)
    
    # Visualize attention from a middle layer
    print("\nAttention patterns from BERT (Layer 6, Head 3):")
    visualize_bert_attention(attentions, tokens, layer=6, head=3)
    
    print("\nNotice how different layers and heads learn different patterns!")
    print("Early layers often attend to nearby tokens (local patterns)")
    print("Later layers often attend to semantically related tokens (global patterns)")

Skipping section - transformers library not available


## Hands-On Activity: Sentence Similarity with Transformers

Let's use transformer embeddings to compute semantic similarity between sentences.

In [9]:
if transformers_available:
    from sklearn.metrics.pairwise import cosine_similarity
    
    def get_sentence_embedding(sentence, tokenizer, model):
        """
        Get sentence embedding by averaging token embeddings.
        
        Parameters:
        -----------
        sentence : str
            Input sentence
        tokenizer : transformers.PreTrainedTokenizer
            Tokenizer
        model : transformers.PreTrainedModel
            Pre-trained transformer model
        
        Returns:
        --------
        numpy.ndarray
            Sentence embedding vector
        """
        # Tokenize
        inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
        
        # Get model outputs
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Average the token embeddings (excluding [CLS] and [SEP])
        embeddings = outputs.last_hidden_state[0, 1:-1, :].mean(dim=0)
        
        return embeddings.numpy()
    
    # Example sentences
    sentences = [
        "The cat sat on the mat.",
        "A feline rested on the rug.",
        "The dog barked loudly.",
        "Machine learning is fascinating.",
        "Deep learning models are powerful."
    ]
    
    # Get embeddings for all sentences
    embeddings = []
    for sent in sentences:
        emb = get_sentence_embedding(sent, tokenizer, model)
        embeddings.append(emb)
    
    embeddings = np.array(embeddings)
    
    # Compute pairwise cosine similarities
    similarities = cosine_similarity(embeddings)
    
    # Visualize similarity matrix
    plt.figure(figsize=(10, 8))
    sns.heatmap(similarities, annot=True, fmt='.3f', cmap='coolwarm',
                xticklabels=[f"S{i+1}" for i in range(len(sentences))],
                yticklabels=[f"S{i+1}" for i in range(len(sentences))],
                vmin=0, vmax=1, cbar_kws={'label': 'Cosine Similarity'})
    plt.title('Sentence Similarity Matrix Using BERT Embeddings',
              fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print sentences with their labels
    print("\nSentences:")
    for i, sent in enumerate(sentences):
        print(f"S{i+1}: {sent}")
    
    print("\nObservations:")
    print("- S1 and S2 have high similarity (semantically equivalent)")
    print("- S4 and S5 have moderate similarity (both about ML/DL)")
    print("- S3 is least similar to S4/S5 (different topics)")
else:
    print("Skipping activity - transformers library not available")
    print("\nThis activity demonstrates using BERT embeddings to compute semantic similarity.")
    print("Semantically similar sentences (even with different words) will have high similarity scores.")

Skipping section - transformers library not available


## Comparing Attention Mechanisms

Let's create a visual comparison of how attention differs from traditional sequence processing.

In [10]:
# Create a visualization comparing RNN vs Transformer processing
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# RNN sequential processing
ax1 = axes[0]
positions = range(5)
for i in positions:
    # Draw hidden states
    ax1.add_patch(plt.Circle((i, 0.5), 0.3, color='lightblue', ec='black', linewidth=2))
    ax1.text(i, 0.5, f'h{i}', ha='center', va='center', fontsize=12, fontweight='bold')
    
    # Draw inputs
    ax1.add_patch(plt.Circle((i, -0.5), 0.2, color='lightgreen', ec='black', linewidth=2))
    ax1.text(i, -0.5, f'x{i}', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Draw arrows from input to hidden
    ax1.arrow(i, -0.3, 0, 0.5, head_width=0.1, head_length=0.1, fc='black', ec='black')
    
    # Draw sequential connections
    if i < len(positions) - 1:
        ax1.arrow(i+0.3, 0.5, 0.4, 0, head_width=0.1, head_length=0.1,
                  fc='red', ec='red', linewidth=2)

ax1.set_xlim(-0.5, 4.5)
ax1.set_ylim(-1, 1.5)
ax1.axis('off')
ax1.set_title('RNN: Sequential Processing\n(Each step depends on previous step)',
              fontsize=14, fontweight='bold', pad=20)

# Transformer parallel processing with attention
ax2 = axes[1]
for i in positions:
    # Draw hidden states
    ax2.add_patch(plt.Circle((i, 0.5), 0.3, color='lightcoral', ec='black', linewidth=2))
    ax2.text(i, 0.5, f'z{i}', ha='center', va='center', fontsize=12, fontweight='bold')
    
    # Draw inputs
    ax2.add_patch(plt.Circle((i, -0.5), 0.2, color='lightgreen', ec='black', linewidth=2))
    ax2.text(i, -0.5, f'x{i}', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Draw attention connections (each position attends to all others)
    for j in positions:
        if i != j:
            # Draw lighter arrows for attention connections
            ax2.annotate('', xy=(j, -0.3), xytext=(i, 0.2),
                        arrowprops=dict(arrowstyle='->', color='blue', alpha=0.2, lw=1))
    
    # Draw direct connection from input to output
    ax2.arrow(i, -0.3, 0, 0.5, head_width=0.1, head_length=0.1, fc='black', ec='black')

ax2.set_xlim(-0.5, 4.5)
ax2.set_ylim(-1, 1.5)
ax2.axis('off')
ax2.set_title('Transformer: Parallel Processing with Attention\n(Each position attends to all positions)',
              fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("Key Differences:")
print("\nRNN (Left):")
print("  - Processes sequentially (one step at a time)")
print("  - Information flows left-to-right through hidden states")
print("  - Cannot be easily parallelized")
print("  - Struggles with long-range dependencies")
print("\nTransformer (Right):")
print("  - Processes all positions in parallel")
print("  - Each position attends to all positions (blue arrows)")
print("  - Highly parallelizable")
print("  - Captures long-range dependencies directly")

<Figure size 1600x600 with 2 subplots>

Key Differences:

RNN (Left):
  - Processes sequentially (one step at a time)
  - Information flows left-to-right through hidden states
  - Cannot be easily parallelized
  - Struggles with long-range dependencies

Transformer (Right):
  - Processes all positions in parallel
  - Each position attends to all positions (blue arrows)
  - Highly parallelizable
  - Captures long-range dependencies directly


## Key Takeaways

Congratulations! You've learned about attention mechanisms and transformers, two of the most important innovations in modern machine learning. Let's summarize the key points:

### Core Concepts

1. **Attention Mechanisms** allow models to focus on relevant parts of the input when producing each output, solving the information bottleneck problem in traditional encoder-decoder architectures.

2. **Self-Attention** computes representations by allowing each position in a sequence to attend to all positions, creating contextualized representations that capture relationships across the entire sequence.

3. **Scaled Dot-Product Attention** uses queries, keys, and values to compute attention weights: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

4. **Multi-Head Attention** runs multiple attention functions in parallel, allowing the model to attend to different aspects of the input simultaneously.

5. **Transformers** rely entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

### Practical Skills

You can now:
- Implement scaled dot-product attention from scratch
- Build multi-head attention mechanisms
- Understand and implement positional encodings
- Work with pre-trained transformer models using Hugging Face
- Visualize and interpret attention patterns
- Use transformer embeddings for downstream tasks

### Why This Matters

Transformers have become the foundation of modern AI:
- **Language Models**: GPT, BERT, T5, and countless others
- **Computer Vision**: Vision Transformers (ViT), DETR
- **Multimodal AI**: CLIP, DALL-E, Flamingo
- **Scientific Applications**: AlphaFold for protein structure prediction

Understanding transformers is essential for working with state-of-the-art AI systems and staying current in the rapidly evolving field of machine learning.

## Further Resources

To deepen your understanding of attention mechanisms and transformers, explore these resources:

### Essential Papers

1. **"Attention Is All You Need" (Vaswani et al., 2017)**
   - The original transformer paper
   - https://arxiv.org/abs/1706.03762

2. **"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018)**
   - Introduced bidirectional pre-training
   - https://arxiv.org/abs/1810.04805

3. **"Language Models are Few-Shot Learners" (Brown et al., 2020)**
   - The GPT-3 paper
   - https://arxiv.org/abs/2005.14165

### Interactive Tutorials and Visualizations

4. **The Illustrated Transformer by Jay Alammar**
   - Excellent visual guide to transformers
   - https://jalammar.github.io/illustrated-transformer/

5. **The Annotated Transformer**
   - Line-by-line implementation with explanations
   - http://nlp.seas.harvard.edu/2018/04/03/attention.html

### Documentation and Libraries

6. **Hugging Face Transformers Documentation**
   - Comprehensive library for working with transformers
   - https://huggingface.co/docs/transformers/

7. **PyTorch Transformer Tutorial**
   - Official PyTorch tutorial on transformers
   - https://pytorch.org/tutorials/beginner/transformer_tutorial.html

### Advanced Topics

8. **"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020)**
   - Vision Transformers
   - https://arxiv.org/abs/2010.11929

9. **"Formal Algorithms for Transformers" (Phuong & Hutter, 2022)**
   - Mathematical treatment of transformer architectures
   - https://arxiv.org/abs/2207.09238

### Practice and Experimentation

- Experiment with different attention head visualizations in BERT
- Try fine-tuning a pre-trained transformer on your own dataset
- Implement variants like relative positional encoding
- Explore transformer applications beyond NLP (e.g., time series, graphs)

Keep learning and experimenting with these powerful architectures!