Skip to content

Arseni1919/TransformerDescription

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 Transformer Description

📖 Intro

The main advantage of the transformer architecture compared to previous approaches in NLP is that transformers simultaneously look from every word to every other word. The order is preserved through positional encoding.

There are 3 types of architectures:

  • Encoder - processes input and builds understanding (e.g., BERT)
  • Decoder - generates output autoregressively (e.g., GPT)
  • Encoder-Decoder - reads one sequence, writes another (e.g., translation)

📚 Contents

🔍 Attention

The K, Q, V projection matrices take the embedding vectors and transform them into a representation space for a given attention head.

  • K (Key) - what I look like in this space (in this head)
  • Q (Query) - what I'm looking for
  • V (Value) - if you attend to me, here is what I will give you

Attention Mechanism Dimensions

Example with Numbers

Consider the sequence "The cat sat":

  • Sequence length: n = 3 tokens
  • Model dimension: d_model = 512
  • Number of heads: h = 8
  • Per-head dimension: d_k = 512 / 8 = 64

How Attention Works

graph TD
    A["Input X (n × d_model)"] --> B[Project to Q]
    A --> C[Project to K]
    A --> D[Project to V]
    B --> E["Compute Q·K^T<br/>(n × n)"]
    C --> E
    E --> F["Scale by √d_k"]
    F --> G["Softmax per row"]
    G --> H["Attention Weights<br/>(n × n)"]
    H --> I["Weighted sum:<br/>Attention·V"]
    D --> I
    I --> J["Output (n × d_model)"]
Loading
  1. Project to Q, K, V: Input X is projected through learned weight matrices $W_Q$, $W_K$, $W_V$ to create Query, Key, and Value matrices.

  2. Compute attention scores: We multiply Q * K^T, which produces raw attention scores for every pair of words in the sequence.

  3. Scale the scores: To keep these scores from getting too large, each one is divided by √d_k. 📘 Detailed explanation: DIVIDE_BY_SQRT_D.md

  4. Apply softmax: This converts scores into a probability distribution (each row sums to 1).

    Attention Weights Example:

         The   cat   sat
    The  0.6   0.3   0.1
    cat  0.2   0.5   0.3
    sat  0.1   0.2   0.7
    

    Each row shows how much each token attends to others (probabilities sum to 1).

  5. Weighted sum of values: Multiply the attention weights by V to get a context-aware representation of each token.

    Formula: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

📐 Dimension compatibility details: ATTENTION_DIMENTIONS.md

Multi-Head Attention Function

The final function has the form mha(Q, K, V) (multi-head attention):

  • mha(x, x, x) - self-attention (a token attends to its own sequence)
  • mha(x, y, y) - cross-attention (tokens from one sequence attend to another sequence)

🏗️ Encoder

Encoder Block Flow (Post-LN Architecture)

graph TD
    A[Input X] --> B[Multi-Head Attention]
    A -.Skip Connection.-> C[Add]
    B --> C
    C --> D[LayerNorm]
    D --> E[Feed-Forward Network]
    D -.Skip Connection.-> F[Add]
    E --> F
    F --> G[LayerNorm]
    G --> H[Output to next layer]

    style A fill:#e1f5ff
    style C fill:#ffe1e1
    style F fill:#ffe1e1
    style H fill:#e1ffe1
Loading

Note: This describes the Post-LN architecture from the original "Attention is All You Need" paper, where LayerNorm comes after the residual addition. This is PyTorch's default (norm_first=False). The formula for each sublayer is: x = LayerNorm(x + Sublayer(x)).

Components Explained

  • Multi-Head Attention (MHA) - first sublayer that allows tokens to exchange information across the sequence.

  • Add - element-wise addition combining the sublayer output with the input (residual/skip connection).

  • LayerNorm - normalizes every embedding separately from others. In general, this keeps training balanced and avoids explosion of values.

    Strict normalization ($\mathcal{N}(0,1)$) can be too strict, because valuable information can be encoded in distribution differences and biases. So, there are learnable parameters per layer: $\gamma_i$ and $\beta_i$, where $i$ is the layer number (not the embedding number).

    Example:

    Before LayerNorm: [0.1, 10.5, -5.2, 8.3, ...]
    After normalize:  [0.0,  1.2, -0.8, 0.9, ...]  (mean≈0, std≈1)
    After scale/shift: γ·value + β (per dimension)
    

    Two steps:

    1. Normalize across the embedding dimension
    2. Rescale according to per-feature dimension (so we don't lose the information about relations between features)
  • Feed-Forward Network (FFN) - second sublayer that transforms each token independently.

The encoder follows this pattern twice (once for MHA, once for FFN):

  1. Apply sublayer (MHA or FFN)
  2. Add residual connection: x_out = x_in + sublayer_output
  3. Normalize: x = LayerNorm(x_out)

Multi-Head Attention Dimensionality

But how does the MHA output dimension match the input dimension? Here's the Multi-Head Attention dimensionality trick:

Each of the $h$ heads operates in a reduced subspace $d_k = d_{\text{model}} / h$, so concatenating all heads yields:

$\text{Concat}(\text{head}_1, \ldots, \text{head}h) \in \mathbb{R}^{h \cdot d_k} = \mathbb{R}^{d{\text{model}}}$

Multi-Head Parallel Processing:

graph TD
    A["Input: (n × d_model)"] --> B["Split into h heads"]
    B --> C1["Head 1<br/>(n × d_k)"]
    B --> C2["Head 2<br/>(n × d_k)"]
    B --> C3["..."]
    B --> C4["Head h<br/>(n × d_k)"]

    C1 --> D1["Attention<br/>(n × d_v)"]
    C2 --> D2["Attention<br/>(n × d_v)"]
    C3 --> D3["..."]
    C4 --> D4["Attention<br/>(n × d_v)"]

    D1 --> E["Concatenate"]
    D2 --> E
    D3 --> E
    D4 --> E

    E --> F["Linear W_O"]
    F --> G["Output: (n × d_model)"]
Loading

Each head processes the input in parallel, learning different aspects of the relationships. Output shape is preserved. A final projection $W_O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$ mixes the heads.

📐 See ATTENTION_DIMENTIONS.md for the full step-by-step breakdown.


Feed-Forward Network Details

After attention has mixed information across tokens, each token now carries context from the others. But attention alone doesn't give the model much power to transform each token's features independently. That's why every encoder block also includes a feed-forward network.

Example (typical 4× expansion):

graph TD
    A["Token vector: [512 dims]"] -->|W_1| B["Expanded: [2048 dims] (4× larger)"]
    B -->|ReLU| C["Activated: [2048 dims]"]
    C -->|W_2| D["Compressed: [512 dims] (back to original)"]
Loading

Think of it as giving each token more "thinking space" before compressing the insights back.

📘 Details: FFN.md

Just like with attention, the FFN output is added to its input (residual connection) and then normalized.


Main PyTorch Transformer Encoder Classes

nn.TransformerEncoderLayer

A single encoder block: Multi-Head Attention → Add & Norm → FFN → Add & Norm.

nn.TransformerEncoderLayer(
    d_model,       # embedding dimension
    nhead,         # number of attention heads (d_model must be divisible by nhead)
    dim_feedforward=2048,  # d_ff, inner FFN dimension
    dropout=0.1,
    activation='relu',     # or 'gelu'
    batch_first=False,     # if True, input shape is (batch, seq, d_model)
    norm_first=False,      # if True, uses Pre-LN instead of Post-LN
)

nn.TransformerEncoder

Stacks N encoder layers with an optional final LayerNorm.

nn.TransformerEncoder(
    encoder_layer,   # an instance of TransformerEncoderLayer
    num_layers,      # N — how many times to repeat the layer
    norm=None,       # optional final LayerNorm applied after all layers
)

Input/Output shape

batch_first Input shape Output shape
False (default) (seq, batch, d_model) (seq, batch, d_model)
True (batch, seq, d_model) (batch, seq, d_model)

Notable optional arguments at forward time

  • src_key_padding_mask — boolean mask of shape (batch, seq), marks padding tokens to ignore
  • mask — additive attention mask of shape (seq, seq), used to block certain positions (e.g. causal masking)

🎯 Decoder

The key differences of the decoder compared to the encoder:

  1. Causal masking: During inference, each token only has access to what comes before it (not future tokens). This is enforced by masks in the decoder.

  2. Training with causal masks: The causal mask is incorporated during training (not just inference).

Causal Mask Visualization

Example sequence: "Il gatto mangia pesce"

Attention mask (1=allowed, 0=blocked):

         Il  gatto  mangia  pesce
Il       1    0      0       0
gatto    1    1      0       0
mangia   1    1      1       0
pesce    1    1      1       1

Each token can only attend to itself and previous tokens (lower triangular matrix).

Decoder Block Flow

The decoder has an additional cross-attention sublayer compared to the encoder:

graph TD
    A[Decoder Input] --> B[Masked Self-Attention]
    A -.Skip.-> C1[Add]
    B --> C1
    C1 --> D1[LayerNorm]

    D1 --> E[Cross-Attention]
    ENC["Encoder Output<br/>(K, V)"] -.->|Keys & Values| E
    D1 -.Skip.-> C2[Add]
    E --> C2
    C2 --> D2[LayerNorm]

    D2 --> F[Feed-Forward Network]
    D2 -.Skip.-> C3[Add]
    F --> C3
    C3 --> D3[LayerNorm]
    D3 --> G[Output]

    style A fill:#e1f5ff
    style ENC fill:#ffe1d4
    style C1 fill:#ffe1e1
    style C2 fill:#ffe1e1
    style C3 fill:#ffe1e1
    style G fill:#e1ffe1
Loading

Three sublayers in each decoder block:

  1. Masked Self-Attention - attends only to previous positions in the decoder
  2. Cross-Attention - queries the encoder output (Keys and Values from encoder, Query from decoder)
  3. Feed-Forward Network - same as in encoder

Each sublayer has a residual connection followed by LayerNorm (Post-LN architecture).

How to Define in PyTorch

You can define the decoder using the same classes as the encoder with subtle additions:

Optional parameters for causal behavior:

  • is_causal=True for TransformerEncoder and TransformerEncoderLayer
  • src_mask=custom_mask for TransformerEncoderLayer (at forward time)
  • mask=custom_mask for TransformerEncoder (at forward time)

Example:

decoder_layer = nn.TransformerEncoderLayer(
    d_model=512,
    nhead=8,
    is_causal=True  # Enables causal masking
)

🔄 Encoder-Decoder Transformer (English → Italian)

Architecture Overview

graph LR
    A["Source Sequence<br/>(English)"] --> B[Encoder Stack]
    B --> C["Encoder Output<br/>(K, V memory)"]
    D["Target Sequence<br/>(Italian so far)"] --> E[Decoder Stack]
    C -.Cross-Attention.-> E
    E --> F["Linear Layer +<br/>Softmax"]
    F --> G["Next Token<br/>Probabilities"]
Loading

The big picture intuition

The encoder reads and memorizes the English sentence. The decoder generates Italian word by word, consulting that memory at every step.

Think of it as a human translator:

  • First they read the full English sentence and build an understanding of it
  • Then they write the Italian translation one word at a time, glancing back at the English as needed

The three operations in each decoder layer

Step Type Q K V Intuition
1 Masked self-attention Italian so far Italian so far Italian so far "What have I written so far?"
2 Cross-attention Italian so far English memory English memory "Which part of the English is relevant right now?"
3 FFN Per-token transformation

Cross-attention: the core of translation

Q comes from the decoder (Italian), K and V come from the encoder (English).

graph LR
    A[Encoder<br/>English: 'The cat'] -->|K, V| C[Cross-Attention]
    B[Decoder<br/>Italian: 'Il'] -->|Q| C
    C --> D[Attended Output]
Loading

The key intuition: Q is the searcher, K/V is the database.

  • The Italian decoder state asks: "I'm about to generate the next word — what English context is relevant?"
  • The English memory just sits there, frozen, waiting to be queried

Timeline:

  • English is done — it ran once through the encoder and produced a fixed set of vectors
  • Italian is live — it grows one token at a time, and at each step reaches back into the English memory

It wouldn't make sense the other way around: the English has nothing to search for, it's already fully processed.

Example query: When generating "gatto" (cat), the decoder's Q vector searches the English K/V memory and finds high attention to "cat".


Autoregressive generation loop

"The cat eats fish"
↓
Encoder  →  fixed English memory (K, V)

Decoder step by step:
<BOS>    → asks English memory → "Il"
"Il"     → asks English memory → "gatto"
"gatto"  → asks English memory → "mangia"
"mangia" → asks English memory → "pesce"
"pesce"  → asks English memory → <EOS>

At each step, the decoder has two sources of information:

  1. Its own past output (via masked self-attention) — what Italian it has written so far
  2. The English memory (via cross-attention) — what the source sentence says

Final step: picking the next word

The decoder output is just a vector of size $d_{\text{model}}$. A linear layer projects it to vocabulary size $|V_{\text{Italian}}|$ (e.g., 30,000), softmax gives probabilities, and you pick the most likely next token.

Pipeline:

graph TD
    A["Decoder output (d_model)"] --> B[Linear projection]
    B --> C["30,000 logits (one per Italian word)"]
    C --> D[Softmax]
    D --> E[Probability distribution]
    E --> F[argmax]
    F --> G[Next token]
Loading

🎓 Conclusion

The transformer architecture revolutionized NLP by replacing sequential processing with parallel attention mechanisms:

Key insights:

  • Attention is communication - tokens exchange information by asking "what should I attend to?"
  • FFN is computation - each token transforms its representation independently
  • Residuals preserve information - skip connections ensure gradients flow and information persists
  • LayerNorm keeps training stable - normalization prevents value explosion while preserving meaningful distributions

The three architectures serve different purposes:

  • Encoder (BERT-style) - bidirectional understanding, best for classification and understanding tasks
  • Decoder (GPT-style) - causal generation, best for text generation and completion
  • Encoder-Decoder (T5-style) - sequence-to-sequence, best for translation and summarization

Dimension trick: Multi-head attention splits $d_{\text{model}}$ into $h$ smaller subspaces of size $d_k = d_{\text{model}} / h$. Each head learns different patterns, then outputs are concatenated and mixed via $W_O$.

Why it works: By allowing every token to look at every other token simultaneously (with positional encoding for order), transformers capture long-range dependencies that RNNs struggled with, while remaining fully parallelizable for efficient training.

📚 For deeper dives:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors