# PyTorch Tutorial: Transformers and Attention

The Transformer architecture has revolutionized AI, powering everything from ChatGPT to Stable Diffusion. In this notebook, we'll understand the core mechanism behind it: **Self-Attention**.

## Learning Objectives
- Understand the Self-Attention mechanism
- Implement a single Self-Attention head from scratch
- Use PyTorch's built-in Transformer modules
- Understand Tokenization


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

torch.manual_seed(42)

## 1. The Core Idea: Self-Attention

In a sentence like "The animal didn't cross the street because **it** was too tired", what does "it" refer to? The street or the animal?

Self-attention allows the model to look at other words in the sentence to figure this out. It computes a weighted sum of all other words.

Formula:
$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

Where:
- **Q (Query)**: What I'm looking for
- **K (Key)**: What I have to offer
- **V (Value)**: What I actually contain

In [None]:
def scaled_dot_product_attention(query, key, value):
    d_k = query.size(-1)
    # 1. Compute scores: Q @ K.T
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    # 2. Apply Softmax to get probabilities
    attention_weights = F.softmax(scores, dim=-1)
    
    # 3. Multiply by V
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights

# Example
d_model = 4  # Embedding dimension
seq_len = 3  # "I love AI"

# Random embeddings for Q, K, V
q = torch.randn(1, seq_len, d_model)
k = torch.randn(1, seq_len, d_model)
v = torch.randn(1, seq_len, d_model)

output, weights = scaled_dot_product_attention(q, k, v)

print("Attention Weights:")
print(weights)
print("\nOutput:")
print(output)

## 2. PyTorch's Transformer Modules

PyTorch provides optimized implementations so you don't have to write everything from scratch.

In [None]:
# Single Multi-Head Attention Layer
multihead_attn = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)

# Create dummy input: (Batch, Seq_Len, Embed_Dim)
x = torch.randn(32, 10, 256)

# Self-attention: Q=x, K=x, V=x
attn_output, _ = multihead_attn(x, x, x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {attn_output.shape}")

## 3. Full Transformer Encoder

A Transformer Encoder consists of multiple layers of Self-Attention + Feed Forward networks.

In [None]:
encoder_layer = nn.TransformerEncoderLayer(d_model=256, nhead=8, batch_first=True)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

output = transformer_encoder(x)
print(f"Encoder Output shape: {output.shape}")

## 4. Tokenization (Concept)

Transformers don't understand text; they understand numbers. Tokenization converts text to numbers.

1. Text: "I love AI"
2. Tokens: ["I", "love", "AI"]
3. IDs: [101, 204, 505]

*(In practice, we use libraries like `huggingface/tokenizers`)*

## Key Takeaways

1. **Self-Attention**: The mechanism that relates different positions of a sequence.
2. **Q, K, V**: The three projections used to compute attention.
3. **Multi-Head Attention**: Running multiple attention mechanisms in parallel.
4. **Transformer**: A stack of attention and feed-forward layers.