# Text Transformers: Encoders, Decoders, and Encoder–Decoder

This notebook introduces transformer architectures for text: encoder-only (BERT-style), decoder-only (GPT-style), and encoder–decoder (T5/BART-style). It provides intuition, key equations, and a minimal runnable demo.

## Learning objectives
- Understand self-attention and multi-head attention
- Compare encoder-only, decoder-only, and encoder–decoder designs
- Run a tiny forward pass to solidify concepts

## Outline
1. Recap: Scaled dot-product attention
2. Encoder-only: contextual understanding (BERT family)
3. Decoder-only: autoregressive generation (GPT family)
4. Encoder–decoder: seq2seq (T5/BART)
5. Minimal demo

## References (Papers)
- Vaswani et al., 2017 — "Attention Is All You Need" (arXiv:1706.03762)
- Devlin et al., 2018 — "BERT: Pre-training of Deep Bidirectional Transformers" (arXiv:1810.04805)
- Radford et al., 2019 — "Language Models are Unsupervised Multitask Learners" (OpenAI)
- Raffel et al., 2019 — "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (arXiv:1910.10683)
- Lewis et al., 2019 — "BART: Denoising Sequence-to-Sequence Pre-training" (arXiv:1910.13461)


In [None]:
# Minimal demo: attention + encoder-only forward
import torch
from attention_mechanism import scaled_dot_product_attention
from model_architectures import EncoderOnlyTransformer

# Attention on toy tensors
B, T, dk, dv = 2, 5, 8, 8
Q = torch.randn(B, T, dk)
K = torch.randn(B, T, dk)
V = torch.randn(B, T, dv)
attn_out, attn_w = scaled_dot_product_attention(Q, K, V)
print('Attention output:', attn_out.shape, 'Weights:', attn_w.shape)

# Tiny encoder-only forward
vocab_size = 1000
model = EncoderOnlyTransformer(vocab_size=vocab_size, d_model=64, num_heads=4, num_layers=2, d_ff=256, max_seq_len=32)
input_ids = torch.randint(0, vocab_size, (B, T))
enc = model(input_ids)
print('Encoder-only output:', enc.shape)
