The main advantage of the transformer architecture compared to previous approaches in NLP is that transformers simultaneously look from every word to every other word. The order is preserved through positional encoding.
There are 3 types of architectures:
- Encoder - processes input and builds understanding (e.g., BERT)
- Decoder - generates output autoregressively (e.g., GPT)
- Encoder-Decoder - reads one sequence, writes another (e.g., translation)
The K, Q, V projection matrices take the embedding vectors and transform them into a representation space for a given attention head.
K(Key) - what I look like in this space (in this head)Q(Query) - what I'm looking forV(Value) - if you attend to me, here is what I will give you
Consider the sequence "The cat sat":
- Sequence length:
n = 3tokens - Model dimension:
d_model = 512 - Number of heads:
h = 8 - Per-head dimension:
d_k = 512 / 8 = 64
graph TD
A["Input X (n × d_model)"] --> B[Project to Q]
A --> C[Project to K]
A --> D[Project to V]
B --> E["Compute Q·K^T<br/>(n × n)"]
C --> E
E --> F["Scale by √d_k"]
F --> G["Softmax per row"]
G --> H["Attention Weights<br/>(n × n)"]
H --> I["Weighted sum:<br/>Attention·V"]
D --> I
I --> J["Output (n × d_model)"]
-
Project to Q, K, V: Input X is projected through learned weight matrices
$W_Q$ ,$W_K$ ,$W_V$ to create Query, Key, and Value matrices. -
Compute attention scores: We multiply
Q * K^T, which produces raw attention scores for every pair of words in the sequence. -
Scale the scores: To keep these scores from getting too large, each one is divided by
√d_k. 📘 Detailed explanation: DIVIDE_BY_SQRT_D.md -
Apply softmax: This converts scores into a probability distribution (each row sums to 1).
Attention Weights Example:
The cat sat The 0.6 0.3 0.1 cat 0.2 0.5 0.3 sat 0.1 0.2 0.7Each row shows how much each token attends to others (probabilities sum to 1).
-
Weighted sum of values: Multiply the attention weights by
Vto get a context-aware representation of each token.Formula:
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$
📐 Dimension compatibility details: ATTENTION_DIMENTIONS.md
The final function has the form mha(Q, K, V) (multi-head attention):
mha(x, x, x)- self-attention (a token attends to its own sequence)mha(x, y, y)- cross-attention (tokens from one sequence attend to another sequence)
graph TD
A[Input X] --> B[Multi-Head Attention]
A -.Skip Connection.-> C[Add]
B --> C
C --> D[LayerNorm]
D --> E[Feed-Forward Network]
D -.Skip Connection.-> F[Add]
E --> F
F --> G[LayerNorm]
G --> H[Output to next layer]
style A fill:#e1f5ff
style C fill:#ffe1e1
style F fill:#ffe1e1
style H fill:#e1ffe1
Note: This describes the Post-LN architecture from the original "Attention is All You Need" paper, where LayerNorm comes after the residual addition. This is PyTorch's default (
norm_first=False). The formula for each sublayer is:x = LayerNorm(x + Sublayer(x)).
-
Multi-Head Attention (MHA)- first sublayer that allows tokens to exchange information across the sequence. -
Add- element-wise addition combining the sublayer output with the input (residual/skip connection). -
LayerNorm- normalizes every embedding separately from others. In general, this keeps training balanced and avoids explosion of values.Strict normalization ($\mathcal{N}(0,1)$) can be too strict, because valuable information can be encoded in distribution differences and biases. So, there are learnable parameters per layer:
$\gamma_i$ and$\beta_i$ , where$i$ is the layer number (not the embedding number).Example:
Before LayerNorm: [0.1, 10.5, -5.2, 8.3, ...] After normalize: [0.0, 1.2, -0.8, 0.9, ...] (mean≈0, std≈1) After scale/shift: γ·value + β (per dimension)Two steps:
- Normalize across the embedding dimension
- Rescale according to per-feature dimension (so we don't lose the information about relations between features)
-
Feed-Forward Network (FFN)- second sublayer that transforms each token independently.
The encoder follows this pattern twice (once for MHA, once for FFN):
- Apply sublayer (MHA or FFN)
- Add residual connection:
x_out = x_in + sublayer_output - Normalize:
x = LayerNorm(x_out)
But how does the MHA output dimension match the input dimension? Here's the Multi-Head Attention dimensionality trick:
Each of the
$\text{Concat}(\text{head}_1, \ldots, \text{head}h) \in \mathbb{R}^{h \cdot d_k} = \mathbb{R}^{d{\text{model}}}$
Multi-Head Parallel Processing:
graph TD
A["Input: (n × d_model)"] --> B["Split into h heads"]
B --> C1["Head 1<br/>(n × d_k)"]
B --> C2["Head 2<br/>(n × d_k)"]
B --> C3["..."]
B --> C4["Head h<br/>(n × d_k)"]
C1 --> D1["Attention<br/>(n × d_v)"]
C2 --> D2["Attention<br/>(n × d_v)"]
C3 --> D3["..."]
C4 --> D4["Attention<br/>(n × d_v)"]
D1 --> E["Concatenate"]
D2 --> E
D3 --> E
D4 --> E
E --> F["Linear W_O"]
F --> G["Output: (n × d_model)"]
Each head processes the input in parallel, learning different aspects of the relationships. Output shape is preserved. A final projection
📐 See ATTENTION_DIMENTIONS.md for the full step-by-step breakdown.
After attention has mixed information across tokens, each token now carries context from the others. But attention alone doesn't give the model much power to transform each token's features independently. That's why every encoder block also includes a feed-forward network.
Example (typical 4× expansion):
graph TD
A["Token vector: [512 dims]"] -->|W_1| B["Expanded: [2048 dims] (4× larger)"]
B -->|ReLU| C["Activated: [2048 dims]"]
C -->|W_2| D["Compressed: [512 dims] (back to original)"]
Think of it as giving each token more "thinking space" before compressing the insights back.
📘 Details: FFN.md
Just like with attention, the FFN output is added to its input (residual connection) and then normalized.
A single encoder block: Multi-Head Attention → Add & Norm → FFN → Add & Norm.
nn.TransformerEncoderLayer(
d_model, # embedding dimension
nhead, # number of attention heads (d_model must be divisible by nhead)
dim_feedforward=2048, # d_ff, inner FFN dimension
dropout=0.1,
activation='relu', # or 'gelu'
batch_first=False, # if True, input shape is (batch, seq, d_model)
norm_first=False, # if True, uses Pre-LN instead of Post-LN
)Stacks N encoder layers with an optional final LayerNorm.
nn.TransformerEncoder(
encoder_layer, # an instance of TransformerEncoderLayer
num_layers, # N — how many times to repeat the layer
norm=None, # optional final LayerNorm applied after all layers
)batch_first |
Input shape | Output shape |
|---|---|---|
False (default) |
(seq, batch, d_model) |
(seq, batch, d_model) |
True |
(batch, seq, d_model) |
(batch, seq, d_model) |
src_key_padding_mask— boolean mask of shape(batch, seq), marks padding tokens to ignoremask— additive attention mask of shape(seq, seq), used to block certain positions (e.g. causal masking)
The key differences of the decoder compared to the encoder:
-
Causal masking: During inference, each token only has access to what comes before it (not future tokens). This is enforced by masks in the decoder.
-
Training with causal masks: The causal mask is incorporated during training (not just inference).
Example sequence: "Il gatto mangia pesce"
Attention mask (1=allowed, 0=blocked):
Il gatto mangia pesce
Il 1 0 0 0
gatto 1 1 0 0
mangia 1 1 1 0
pesce 1 1 1 1
Each token can only attend to itself and previous tokens (lower triangular matrix).
The decoder has an additional cross-attention sublayer compared to the encoder:
graph TD
A[Decoder Input] --> B[Masked Self-Attention]
A -.Skip.-> C1[Add]
B --> C1
C1 --> D1[LayerNorm]
D1 --> E[Cross-Attention]
ENC["Encoder Output<br/>(K, V)"] -.->|Keys & Values| E
D1 -.Skip.-> C2[Add]
E --> C2
C2 --> D2[LayerNorm]
D2 --> F[Feed-Forward Network]
D2 -.Skip.-> C3[Add]
F --> C3
C3 --> D3[LayerNorm]
D3 --> G[Output]
style A fill:#e1f5ff
style ENC fill:#ffe1d4
style C1 fill:#ffe1e1
style C2 fill:#ffe1e1
style C3 fill:#ffe1e1
style G fill:#e1ffe1
Three sublayers in each decoder block:
- Masked Self-Attention - attends only to previous positions in the decoder
- Cross-Attention - queries the encoder output (Keys and Values from encoder, Query from decoder)
- Feed-Forward Network - same as in encoder
Each sublayer has a residual connection followed by LayerNorm (Post-LN architecture).
You can define the decoder using the same classes as the encoder with subtle additions:
Optional parameters for causal behavior:
is_causal=TrueforTransformerEncoderandTransformerEncoderLayersrc_mask=custom_maskforTransformerEncoderLayer(at forward time)mask=custom_maskforTransformerEncoder(at forward time)
Example:
decoder_layer = nn.TransformerEncoderLayer(
d_model=512,
nhead=8,
is_causal=True # Enables causal masking
)graph LR
A["Source Sequence<br/>(English)"] --> B[Encoder Stack]
B --> C["Encoder Output<br/>(K, V memory)"]
D["Target Sequence<br/>(Italian so far)"] --> E[Decoder Stack]
C -.Cross-Attention.-> E
E --> F["Linear Layer +<br/>Softmax"]
F --> G["Next Token<br/>Probabilities"]
The encoder reads and memorizes the English sentence. The decoder generates Italian word by word, consulting that memory at every step.
Think of it as a human translator:
- First they read the full English sentence and build an understanding of it
- Then they write the Italian translation one word at a time, glancing back at the English as needed
| Step | Type | Q | K | V | Intuition |
|---|---|---|---|---|---|
| 1 | Masked self-attention | Italian so far | Italian so far | Italian so far | "What have I written so far?" |
| 2 | Cross-attention | Italian so far | English memory | English memory | "Which part of the English is relevant right now?" |
| 3 | FFN | — | — | — | Per-token transformation |
Q comes from the decoder (Italian), K and V come from the encoder (English).
graph LR
A[Encoder<br/>English: 'The cat'] -->|K, V| C[Cross-Attention]
B[Decoder<br/>Italian: 'Il'] -->|Q| C
C --> D[Attended Output]
The key intuition: Q is the searcher, K/V is the database.
- The Italian decoder state asks: "I'm about to generate the next word — what English context is relevant?"
- The English memory just sits there, frozen, waiting to be queried
Timeline:
- English is done — it ran once through the encoder and produced a fixed set of vectors
- Italian is live — it grows one token at a time, and at each step reaches back into the English memory
It wouldn't make sense the other way around: the English has nothing to search for, it's already fully processed.
Example query: When generating "gatto" (cat), the decoder's Q vector searches the English K/V memory and finds high attention to "cat".
"The cat eats fish"
↓
Encoder → fixed English memory (K, V)
Decoder step by step:
<BOS> → asks English memory → "Il"
"Il" → asks English memory → "gatto"
"gatto" → asks English memory → "mangia"
"mangia" → asks English memory → "pesce"
"pesce" → asks English memory → <EOS>
At each step, the decoder has two sources of information:
- Its own past output (via masked self-attention) — what Italian it has written so far
- The English memory (via cross-attention) — what the source sentence says
The decoder output is just a vector of size
Pipeline:
graph TD
A["Decoder output (d_model)"] --> B[Linear projection]
B --> C["30,000 logits (one per Italian word)"]
C --> D[Softmax]
D --> E[Probability distribution]
E --> F[argmax]
F --> G[Next token]
The transformer architecture revolutionized NLP by replacing sequential processing with parallel attention mechanisms:
Key insights:
- Attention is communication - tokens exchange information by asking "what should I attend to?"
- FFN is computation - each token transforms its representation independently
- Residuals preserve information - skip connections ensure gradients flow and information persists
- LayerNorm keeps training stable - normalization prevents value explosion while preserving meaningful distributions
The three architectures serve different purposes:
- Encoder (BERT-style) - bidirectional understanding, best for classification and understanding tasks
- Decoder (GPT-style) - causal generation, best for text generation and completion
- Encoder-Decoder (T5-style) - sequence-to-sequence, best for translation and summarization
Dimension trick: Multi-head attention splits
Why it works: By allowing every token to look at every other token simultaneously (with positional encoding for order), transformers capture long-range dependencies that RNNs struggled with, while remaining fully parallelizable for efficient training.
📚 For deeper dives:
- ATTENTION_DIMENTIONS.md - Complete dimensional analysis
- DIVIDE_BY_SQRT_D.md - Mathematical justification for scaling
- FFN.md - Feed-forward network details
