# Advanced Concepts: Attention Variants, Positional Biases, Norms, and Optimizers

This notebook surveys advanced techniques: linear/sparse/local attention, rotary and ALiBi positional methods, normalization beyond LayerNorm (RMSNorm), and recent optimizers (Lion, Sophia). Includes a small runnable comparison snippet.

## Learning objectives
- Know when to use alternative attention mechanisms
- Understand modern positional encodings and their benefits
- Try practical snippets to compare behaviors

## Outline
1. Linear vs standard attention
2. Local and sparse attention
3. Rotary (RoPE) and ALiBi
4. RMSNorm vs LayerNorm
5. Lion and Sophia optimizers

## References (Papers)
- Tay et al., 2020 — "Efficient Transformers: A Survey" (arXiv:2009.06732)
- Su et al., 2021 — "RoFormer: Rotary Position Embedding" (arXiv:2104.09864)
- Press et al., 2021 — "ALiBi: Train Short, Test Long" (arXiv:2108.12409)
- Dao et al., 2022 — "FlashAttention" (arXiv:2205.14135)
- Chen et al., 2023 — "Lion Optimizer" (arXiv:2302.06675)
- Liu et al., 2023 — "Sophia Optimizer" (arXiv:2305.14342)


In [None]:
# Compare standard vs linear attention on toy data
import torch
from attention_variants import LinearAttention

B, T, C = 2, 64, 128
x = torch.randn(B, T, C)

# Standard multihead (for reference): expects (seq, batch, embed)
std_attn = torch.nn.MultiheadAttention(embed_dim=C, num_heads=4)
std_out, _ = std_attn(x.transpose(0,1), x.transpose(0,1), x.transpose(0,1))
std_out = std_out.transpose(0,1)
print('Standard attention out:', std_out.shape)

# Linear attention
lin_attn = LinearAttention(d_model=C, num_heads=4)
lin_out = lin_attn(x)
print('Linear attention out:', lin_out.shape)
