# Transformer Basics

The **Transformer architecture** revolutionized Natural Language Processing (NLP) by replacing recurrent models (RNNs and LSTMs) with a fully **attention-based mechanism**.

Introduced in 2017 in the paper _“Attention Is All You Need”_, it is now the foundation of powerful models like **BERT**, **GPT**, **T5**, and **Vision Transformers (ViT)**.

---

### Objectives
- Understand the core components of the Transformer.
- Learn how Attention works.
- Implement a simple Transformer model using PyTorch.
- Visualize how input sequences are transformed.

## 1. Transformer Architecture Overview

The Transformer consists of two main parts:

```
Input → [Encoder Stack] → [Decoder Stack] → Output
```

### Components
| Component | Function |
|------------|-----------|
| **Encoder** | Reads and encodes input text into contextual representations. |
| **Decoder** | Generates translated or predicted text using encoder outputs. |
| **Attention Mechanism** | Helps the model focus on important words. |
| **Feed Forward Network** | Adds non-linearity and complexity. |
| **Positional Encoding** | Adds information about word order (since there’s no recurrence). |

## 2. Understanding Self-Attention

Each word attends to every other word in the sentence to understand the full context.

The **Self-Attention** mechanism calculates three vectors for every word:
- **Q (Query)**
- **K (Key)**
- **V (Value)**

The formula for attention is:

$$ Attention(Q, K, V) = Softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is the dimension of the Key vector.

In [None]:
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# Example tensors
Q = torch.rand(1, 3, 4)
K = torch.rand(1, 3, 4)
V = torch.rand(1, 3, 4)

output, attn = scaled_dot_product_attention(Q, K, V)
print("Attention Output:\n", output)
print("\nAttention Weights:\n", attn)

## 3. Multi-Head Attention
Instead of one attention head, the Transformer uses multiple heads to learn **different relationships** between words in parallel.

This allows the model to attend to different parts of the sentence simultaneously.

In [None]:
from torch import nn

multihead_attn = nn.MultiheadAttention(embed_dim=8, num_heads=2, batch_first=True)

x = torch.rand(2, 5, 8)  # batch_size=2, seq_len=5, embedding_dim=8
attn_output, attn_weights = multihead_attn(x, x, x)

print("Multi-Head Attention Output Shape:", attn_output.shape)

## ⚙️ 4. Building a Simple Transformer Block

In [None]:
class SimpleTransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden_dim):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden_dim),
            nn.ReLU(),
            nn.Linear(ff_hidden_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_output)
        ff_output = self.ff(x)
        x = self.norm2(x + ff_output)
        return x

# Test the block
block = SimpleTransformerBlock(embed_dim=8, num_heads=2, ff_hidden_dim=32)
inp = torch.rand(1, 5, 8)
out = block(inp)
print("Transformer Block Output Shape:", out.shape)

## 5. Positional Encoding
Since Transformers don’t have recurrence, we add **positional encodings** to represent word order.

A simple sinusoidal encoding can be used to embed positions into vectors.

In [None]:
import math

def positional_encoding(seq_len, d_model):
    PE = torch.zeros(seq_len, d_model)
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            PE[pos, i] = math.sin(pos / (10000 ** (i / d_model)))
            if i + 1 < d_model:
                PE[pos, i + 1] = math.cos(pos / (10000 ** ((i + 1) / d_model)))
    return PE

pos_encoding = positional_encoding(10, 8)
print(pos_encoding[:5])

## Summary
- Transformers rely entirely on **self-attention** (no recurrence).
- **Encoder** extracts meaning; **Decoder** generates text.
- **Multi-head attention** allows the model to focus on multiple parts of the sequence.
- **Positional encodings** retain order information.

Modern NLP models (like **BERT**, **GPT**, and **T5**) are built on top of this architecture.

---
**Next:** `13-BERT_Text_Classification.ipynb` → Understanding how BERT uses Transformers for downstream NLP tasks.