# Understanding Decoder‑Only Transformers: A Token‑Level Journey

While there are many excellent overviews of decoder‑only architectures, I found most leave out the deeper, token‑level intuition (or at least I didn’t get it while reading them). I’d been following each multiplication and decision mechanically without fully grasping what was happening under the hood since many tutorials don’t show shapes explicily or think matrix multiplications are intuitive.

This notebook aims to change that. It dives into every token-level transformation, showing shapes, intermediate vectors, and how information flows through a decoder‑only stack. So if you think you already know the high level stuff but catch yourself beeing usure when taking stuff a level deeper this might be just right for you. 

Hopefully, this walk‑through will spark the same “aha!” moments it did for me. If not and you still don’t fully grasp it, i strongly suggest you make such a notebook yourself as this was a great way of learning the stuff!

Like any other insight out there, take it with a grain of salt! 

Feel free to openup an issue if i got something wrong im myself not an expert but trying to get somewhere decent! 

___
___

# Design Parameters of the GPT

___
___

- **dim embd/dim model** (denote as $d$): The size of our embeddings also called the model’s (hidden) size
- **vocab size** (denote as $v$): The size of our vocabulary (number of different input tokens)
- **context length** (denote as $c$): The maximal context length (input length in tokens) we can feed our model
- **n heads** (denote as $H$): The number of attention heads in each transformer block
- **n layers** (denote as $N$): The number of transformer blocks
___
___

# From Token IDs to Input Embeddings
___
___

*Note:* We will omit the batch dimension for simplicity thereby all tensors represent a single sequence!

### What do we have and what does it look like?

- **Input token sequence**: $\mathbf{I}_{\mathrm{ids}} \in \mathbb{R}^{c}$ — this is our input vector of (maximal) length $c$ containing the token IDs.
- **Token embedding matrix**: $\mathbf{E}_{\text{token}} \in \mathbb{R}^{v \times d}$ - this is our learned embedding matrix which we use as lookup table to map each of the $v$ vocabulary items to a $d$-dimensional embedding vector.
- **Positional embedding matrix**: $\mathbf{E}_{\text{pos}} \in \mathbb{R}^{c \times d}$ - this is our learned positional embedding matrix. It provides a $d$-dimensional vector that encodes positional information for each of the $c$ positions in our input.

### What happens here?

1. Use $\mathbf{I}_{\mathrm{ids}}$ to index into $\mathbf{E}_{\text{token}}$, resulting in:  
   $I_{\mathrm{emb}} \in \mathbb{R}^{c \times d}$  
2. Add the positional embeddings elementwise:  
   $I_{\mathrm{emb}} = I_{\mathrm{emb}} + \mathbf{E}_{\text{pos}}$

### What do we end up with?

After this step, each token is represented by a vector that encodes both **its identity and position**:  
$I_{\mathrm{emb}} \in \mathbb{R}^{c \times d}$

### Why add position information this way?

It may seem unintuitive to **add** positional vectors instead of appending them. After all, the sum of a token embedding and its positional vector could, in theory, be identical to the sum of a different token and a different position — leading to ambiguous representations. *(That was my initial concern.)*

But actually, spreading positional information across **all $d$ dimensions** allows every attention head to access positional context. Appending would isolate position into a subspace — which weakens its utility. Additionally as the model learns both the token embeddings and the positional embeddings it can itself make sure that it can seperate each of these combinations in a meaningful way.
___
___

# The $N$ Transformer Blocks

Now that we have our embedded input, we move into the **transformer blocks**, the core component of the architecture. In the original GPT-1 paper [<cite>Radford, Narasimhan, Salimans & Sutskever (2018)</cite>](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) they used 12 such transformer blocks. We will discuss the block structure they proposed which places the LayerNorm different than originally proposed by [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762).

___
___

each of these blocks has the same structure. We will dive into each of them in order such that we can follow the manipulations in detail. 
*Note:* In the subsequent descriptions i will use a notation that only makes sense if we talk about the first block right after embedding the tokens. That way i think its more intuitive and easier to follow but when looking at this more general you have to keep in mind that the input is of couse not the initially embedded input but the output of the previous block!

1. **Masked Multi-Head Self-Attention**
2. **Residual Connection**  
3. **LayerNorm**  
4. **Feedforward Network (MLP)**  
5. **Residual Connection**
6. **LayerNorm**  

---
---

# 1: Masked Multihead Self Attention (MHA)

[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)

---
---

Recall: after the embedding step, our input is:  
$I_{\mathrm{emb}} \in \mathbb{R}^{c \times d}$


### First step – Query $Q$, Key $K$, and Value $V$ calculation

The goal of this step is to transform each token’s embedding into three distinct representations *query*, *key*, and *value* which serve complementary roles. The *query* and *key* allow us to measure how relevant other tokens are, and the *value* contains the content to be pooled if that relevance is high. At this stage, the transformations are still applied token-wise and no interactions between tokens have occurred yet.

1. **Linear projections and head splitting**

    In Multi-Head Attention, we split the total model dimension $d$ into $H$ smaller parts, one per head, each of size $d_{head}$, typically with $d_{head} = d / H$.

    There are two equivalent ways to compute the  *query*, *key*, and *value* projections for each head:

    - **Single large projection + slicing** (commonly used in practice):  
        A single learned projection $W^Q \in \mathbb{R}^{d \times (H \cdot d_{head})}$ transforms the input into one large matrix, which is then reshaped into $H$ individual heads:

        $$
        Q = I_{\mathrm{emb}} W^Q \in \mathbb{R}^{c \times (H \cdot d_{head})} \quad \Rightarrow \quad Q \in \mathbb{R}^{c \times H \times d_{head}}
        $$

        Even though this projection is implemented as a single large matrix, each head operates only on its own slice of the output. During training, the attention computed within each head only influences the corresponding slice of $W^Q$, so each part of the projection matrix is still **effectively head-specific** in how it learns and updates. The same structure applies to $K$ and $V$.


    - **Separate projection per head** (conceptually clearer):  
        Each head has its own smaller projection matrix, $W^Q_h \in \mathbb{R}^{d \times d_{head}}$, applied independently to the input to compute $Q_h \in \mathbb{R}^{c \times d_{head}}$ for head $h$.

        Both approaches produce the same mathematical result: a tensor of shape $\mathbb{R}^{c \times H \times d_{head}}$ for each of $Q, K, V$, where each token now has multiple versions of itself—one per head.

2. **Choosing the per-head dimension $d_{head}$**

    - **Balanced capacity**: Often $d_{head} = d / H$, keeping the total number of parameters and computation constant across different numbers of heads.
    - **Trade-offs**: Larger $d_{head}$ per head can model more complex relationships, but increases cost; smaller $d_{head}$ is cheaper but may underrepresent fine-grained patterns.

3. **Token-level intuition**

    - Each row of $Q$, $K$, and $V$ corresponds to a token representation—now split across multiple heads.
    - The **query** vector encodes what this token is "looking for" in the sequence.
    - The **key** vector encodes what this token "offers" to other queries.
    - The **value** vector contains the actual content to be passed along if the match (query–key similarity) is strong.


### Second step – Raw Attention score calculation

Once we have the projected representations (per head) $Q, K \in \mathbb{R}^{c \times d_{head}}$, we compute the attention scores by taking the dot product between each query and all keys, followed by scaling. This results in a score matrix $A \in \mathbb{R}^{c \times c}$, where each entry $A_{i,j}$ represents the unnormalized relevance of token $j$ to token $i$.

Formally, the attention score matrix is computed as:

$$
A = \frac{Q K^{\top}}{\sqrt{d_{head}}} \in \mathbb{R}^{c \times c}
$$

Since the dot product can produce pretty big numbers if the dimensionality $d_{head}$ is large, we divide by $\sqrt{d_{head}}$ to counteract this growth. This both stabilizes the gradients and thereby learning but also makes the later applied softmax outcomes less peaky and helps distribute the attention better over the different tokens. 

By now each row in $A$ contains the (scaled) attention logits for a single token’s query vector against all other tokens' key vectors. For example, *A[2][5]* is the scaled dot product between the query vector of token $2$ and the key vector of token $5$. These scores are still unnormalized and are sometimes referred to as the "raw" attention scores.

### Third step – Causal Masking

In decoder-only architectures like GPT, masking is a required step. It ensures that **each token can only attend to itself and preceding tokens** never to future ones. This preserves the autoregressive property of the model, where predictions are generated left-to-right without *peeking* ahead.

To enforce this, we apply a **causal mask** to the attention score matrix $A$ before the softmax. The mask suppresses all attention to future tokens by adding large negative values (effectively $-\infty$) to those positions:

Define the mask:

$$
M_{i,j} = \begin{cases}
0, & j \le i; \\\\
-\infty, & j > i,
\end{cases}
$$

and set:

$$
A_{\text{masked}} = A + M \in \mathbb{R}^{c \times c}
$$

This ensures that when softmax is applied, all entries corresponding to future tokens have probability zero, and each token attends only to itself and earlier tokens in the sequence.

### Fourth step - Softmax

After masking, we apply the softmax function to each row of the attention score matrix  $A_{\text{masked}}$. This converts the raw (and masked) scores into a probability distribution over the input tokens:

$$
\alpha_{i,j} = \frac{\exp(A_{\text{masked}, i,j})}{\sum_{k=1}^{c} \exp(A_{\text{masked}, i,k})}
$$

This gives us the **attention weight matrix** $\alpha \in \mathbb{R}^{c \times c}$, where each row $\alpha_i$ represents how much token $i$ attends to every other token.

### Fifth step - Context calculation

These attention weights are then used to compute a **weighted sum of the value vectors**:

$$
\text{Attention}(Q, K, V) = \alpha V \in \mathbb{R}^{c \times d_{head}}
$$

The result is a new representation for each token, enriched by selectively aggregating information from the other tokens based on relevance.

Each token now has access to context-dependent information, shaped by the learned query-key matching and the value vectors it attends to.

### Final step – Combining the heads

So far, each of the $h$ attention heads has produced its own context output of shape $\mathbb{R}^{c \times d_{head}}$. To merge these into a single representation per token, we **concatenate** the outputs along the feature dimension:

$$
\text{ConcatHead} \in \mathbb{R}^{c \times (H \cdot d_{head})} = \text{Concat}(\text{head}_1, \ldots, \text{head}_H)
$$

Then, we apply a final learned linear transformation $W^O \in \mathbb{R}^{(H \cdot d_{head}) \times d}$ to project this concatenated representation back into the model’s embedding space:

$$
\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) \cdot W^O \in \mathbb{R}^{c \times d}
$$

The final projection matrix $W^O \in \mathbb{R}^{(H \cdot d_{head}) \times d}$ is applied independently to each token’s concatenated head output. It does **not** mix information across different tokens—each token is processed separately.

What it does do is mix information **across heads**: since each token’s vector includes all $H$ heads stacked into a single $\mathbb{R}^{H \cdot d_{head}}$ vector, the projection learns how to combine and weight the contributions from each head. This is the only step where information from the different heads is fused into a single representation per token.

### Some final thoughts on multihead self attention

I initially questioned the need for explicitly splitting attention into multiple heads. From a mathematical standpoint, it seemed plausible that a single-head could learn to capture everything needed by selectively focusing on different aspects of the input. However, a key limitation emerges when we consider how attention behaves in practice.

Imagine one token is highly relevant to another; in a single-head setup, this dominant relevance would likely cause the attention mechanism to concentrate most, if not all, of the attention weight on that single token, effectively zeroing out contributions from others. This creates a bottleneck: the model is forced to choose one narrow interpretation or interaction per layer.

Multi-head attention addresses this by allowing the model to attend to information from different representation subspaces simultaneously. Each head learns to focus on different types of relationships (syntactic, semantic, positional, or otherwise) without competing for a single attention budget.

---
---

# 2: Residual Connection
[He et al., 2015](https://arxiv.org/abs/1512.03385)

---
---

Recall: after the Multihead self attention, our input is still in the same shape but now the representations of the individual tokens are not solely based on our learned embeddings and positions but the representation of each token is enriched with information of all the other tokens that come before it. We will use the following notation for the output of the Multihead self attention:
$I_{\mathrm{MHA}} \in \mathbb{R}^{c \times d}$

### What do we do here?

The residual connection is conceptually and mathematically quite simple (but as we later see has multiple nice effects and rationales). We just add the input of the previous layer to the output of the previous layer.

So more generally stated:

If $x$ is the input to a sub-layer and $\mathrm{Sublayer}(x)$ its transformation, the output becomes:


$$
y = x + \mathrm{Sublayer}(x)
$$

### So what are the effects and rationales?

- **Improved gradient flow**: Residual connections **create a shortcut path** that **allows gradients to bypass** the inner operations of a sub-layer—such as multi-head attention or a feedforward block—during backpropagation. More precisely, they make it possible to compute gradient updates for earlier layers without the gradients having to pass entirely through the potentially unstable or compressive operations inside the sub-layer. This **reduces the risk of gradients becoming extremely small** due to repeated matrix multiplications, making it easier to train deep networks effectively.

- **Modeling residual functions**: Instead of learning the full output mapping  $\mathcal{F}(x) = y$, residual connections reframe the problem as learning the *difference* between the input and the desired output:

$$
\mathcal{F}(x) = y - x \quad \Rightarrow \quad y = x + \mathcal{F}(x)
$$

This means that the sub-layer only needs to learn how to adjust or refine the input $x$, rather than construct $y$ from scratch. If no change is needed, $\mathcal{F}(x)$ can simply output zero, and the identity mapping is preserved. Which of couse is much easier to learn than forcing a layer to reconstruct the input through transformations.

___
___

# 3: LayerNorm
[<cite>Ba, Kiros & Hinton (2016)</cite>](https://arxiv.org/abs/1607.06450) 
___
___

Recall: after the Multihead self attention and the residual connection, our input is still in the following shape:
$I_{\mathrm{MHA}} \in \mathbb{R}^{c \times d}$

This is a matrix with one $d$-dimensional row vector per token.

### What do we do here?

The Layer Normalization is applied rowwise, so **independently to each token vector**. It:

- Normalizes that vector to have **zero mean and unit variance**  
- Applies a learned scale ($\gamma$) and shift ($\beta$)

### Token-Level

Let $\mathbf{x}_i = I_{\mathrm{MHA_i}} \in \mathbb{R}^d$ be the $i$ -th token vector (the $i$ -th row of $I_{\mathrm{MHA}}\in\mathbb{R}^{c\times d}$).

Then:

$$
\mathrm{LayerNorm}(\mathbf{x}_i)
=
\boldsymbol\gamma
\;\odot\;
\frac{\mathbf{x}_i - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}
\;+\;
\boldsymbol\beta
$$

Where:
- $\displaystyle \mu_i=\frac1d\sum_{j=1}^d x_{i,j}$  
- $\displaystyle \sigma_i^2=\frac1d\sum_{j=1}^d (x_{i,j}-\mu_i)^2$  
- $\epsilon>0$ is a small scalar for stability  
- $\boldsymbol\gamma,\boldsymbol\beta\in\mathbb{R}^d$ are _shared_ across all tokens  
- “$\odot$” denote element‐wise (Hadamard) product
- The subtraction and division are applied element-wise (broadcasting the scalar $\mu_i$ and $\sqrt{\sigma_i^2 + \epsilon}$ over all $d$ dimensions),


### Matrix-Level

You can equivalently write the same per‐row normalization in one shot on the full matrix, using broadcasting.

1. **Compute row‐means and variances**

   Let  
   - $I_{\mathrm{MHA}}\in\mathbb{R}^{c\times d}$  
   - $\mathbf{1}_d\in\mathbb{R}^d$ be a vector of all ones  
   - “$\odot$” denote element‐wise (Hadamard) product  

   Then the **row‐means** $\boldsymbol\mu\in\mathbb{R}^{c\times1}$ are
   $$
     \boldsymbol\mu
     = \frac{1}{d}\;I_{\mathrm{MHA}}\;\mathbf{1}_d
     \,,
   $$
   and the **row‐variances** $\boldsymbol\sigma^2\in\mathbb{R}^{c\times1}$ are
   $$
     \boldsymbol\sigma^2
     = \frac{1}{d}\,\bigl(I_{\mathrm{MHA}}\odot I_{\mathrm{MHA}}\bigr)\,\mathbf{1}_d
       \;-\;\boldsymbol\mu\odot\boldsymbol\mu
     \,.
   $$

2. **Normalize & apply scale/shift**

   Let  
   - $\boldsymbol\gamma,\boldsymbol\beta\in\mathbb{R}^{d}$ be the shared learnable scale and shift vectors,  
   - $\epsilon>0$ a small constant for numerical stability.

   Then
   $$
     \mathrm{LayerNorm}\bigl(I_{\mathrm{MHA}}\bigr)
     = \;
     \boldsymbol\gamma
     \;\odot\;
     \frac{\,I_{\mathrm{MHA}}
           \;-\;\boldsymbol\mu\,\mathbf{1}_d^{\!\top}\,}
          {\sqrt{\boldsymbol\sigma^2 + \epsilon}\,\mathbf{1}_d^{\!\top}}
     \;+\;
     \boldsymbol\beta\,\mathbf{1}_d^{\!\top}
     \,.
   $$
   - The subtraction and division of the $c\times1$ vectors $\boldsymbol\mu$ and $\sqrt{\boldsymbol\sigma^2+\epsilon}$ are **broadcast** across the $d$ columns.  
   - The vectors $\boldsymbol\gamma,\boldsymbol\beta\in\mathbb{R}^d$ are **broadcast** down to shape $c\times d$ before the element‐wise multiply/add.

### And why all that?

Layer normalization applies a per-token normalization by subtracting each vector’s own mean and dividing by its standard deviation, which directly targets **internal covariate shift**, the phenomenon where changing parameters in earlier layers cause shifts in the distribution of later-layer inputs—thus keeping gradient magnitudes well conditioned across depth and varying sequence lengths [<cite>Ioffe & Szegedy (2015)</cite>](https://arxiv.org/abs/1502.03167). 

The subsequent learned affine parameters $\boldsymbol\gamma,\boldsymbol\beta$ then reintroduce per-feature scaling and bias, ensuring that the model’s representational power is fully preserved and that the network can recover any necessary distributional shape [<cite>Ba, Kiros & Hinton (2016)</cite>](https://arxiv.org/abs/1607.06450). 

Empirical studies have shown that this row-wise normalization smooths the loss surface and accelerates convergence, even with very small batch sizes, making *LayerNorm* especially effective in architectures such as the Transformer, where it underpins stable and efficient training of deep attention layers [<cite>Vaswani et al. (2017)</cite>](https://arxiv.org/abs/1706.03762).

---
---

# 4: Feed Forward Network (FFN)
This more of a standard component derived from classical neural network architectures, rather than a novel mechanism introduced in a single paper. Nevertheless we will discuss the structure proposed by [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762). 

---
---

Recall: at this point our input is the output of the MHA plus the input of this MHA (due to the residual connection) and then normalized with the LayerNorm.

We will use the following notation for the input of this layer:
$I_{\mathrm{FFN}} \in \mathbb{R}^{c \times d}$

### What do we do and how do we do it?

The FFN consists of two linear transformations with a non-linearity in between:

$$
\text{FFN}(x) = \mathrm{Activation}(x W_1 + b_1) W_2 + b_2
$$

with:
- $ W_1 \in \mathbb{R}^{d \times d_{\text{ff}}} $
- $ W_2 \in \mathbb{R}^{d_{\text{ff}} \times d} $

Typically, $ d_{\text{ff}} = 4d $, expanding the internal dimension significantly before projecting it back down.

This transformation is applied **independently to each token**—no interaction between tokens occurs in this layer. Each row in $I_{\mathrm{FFN}}$ is passed through the same FFN.

The original Transformer used *ReLU* as the activation function, but many modern variants (e.g., GPT) use *GELU*, which is smoother and often yields better performance. The *GELU* function was introduced by [Hendrycks & Gimpel (2016)](https://arxiv.org/abs/1606.08415).

### Why all that?

While attention mixes information across tokens, the FFN allows the model to **process each token's enriched representation further**, giving it more capacity to model nonlinear transformations at each position.

---
---

# 5: Residual Connection
[He et al., 2015](https://arxiv.org/abs/1512.03385)

---
---

Just like the first Masked Multihead Self Attention layer the FFN is also wrapped in a residual connection.

---
---

# 6: LayerNorm
[<cite>Ba, Kiros & Hinton (2016)</cite>](https://arxiv.org/abs/1607.06450) 

---
---

Just like the first LayerNorm we again normalize to zero mean and unit variance. 

---
---

# Final Composition: Stacking $N$ Transformer Blocks

---
---

Each Transformer block we've discussed operates on a full sequence of token embeddings, applying LayerNorm, attention, MLP, and residuals. In practice, the model stacks $N$ such blocks sequentially, where the output of one block becomes the input to the next.

If the input to the first block is $I_0 = I_{\text{emb}}$, then the output of block $i$ is:

$$
I_i = \text{TransformerBlock}_i(I_{i-1})
$$

This stacking allows the model to iteratively refine each token’s representation, integrating broader context and more complex dependencies layer by layer.

---
---

# Output Projection and Language Modeling Head

---
---

Recall: Once a token representation has passed through all $N$ transformer blocks, we obtain a final hidden state:

$$
H \in \mathbb{R}^{c \times d}
$$

Each row in $H$ is the final, context-enriched representation of a token — it carries both the token’s identity and everything it has attended to across layers.

### What do we do and how do we do it?

To make a prediction (i.e., guess the next token), we need to convert each $d$-dimensional hidden vector into a score for each possible token in the vocabulary. This is done by applying a final linear projection:

$$
\text{logits} = H W^T + b
$$

Where:
- $W \in \mathbb{R}^{v \times d}$ is the output weight matrix
- $b \in \mathbb{R}^v$ is a learned bias
- $\text{logits} \in \mathbb{R}^{c \times v}$ are the unnormalized scores for each vocabulary token at each position

### From Logits to Probabilities

Immediately after we get $\text{logits}$, we apply softmax **row-wise** to obtain a probability distribution:

$$
P = \text{softmax}(\text{logits}) \quad\in\; \mathbb{R}^{c \times v}
$$

- **Training**: We compare $P$ against the ground-truth “next token” at every position using cross-entropy loss.  
- **Inference**: We typically only look at the **last row** $P_{c}\in \mathbb{R}^{v}$, then sample or pick the arg-max as the next token or use something more fancy like beam search.

### Weight tying

In practice, the output weight matrix $W$ is often **tied** with the input token embedding matrix $\mathbf{E}_{\text{token}}$ as proposed by [<cite>Press & Wolf (2017)</cite>](https://arxiv.org/pdf/1608.05859):

$$
W = \mathbf{E}_{\text{token}}
$$

This means the same learned matrix is used:
- At the input stage: as a **lookup table** to map token IDs to embeddings
- At the output stage: as a **matrix** in a **real-valued projection** to compute vocabulary logits via matrix multiplication

While the lookup table simply retrieves a row of $\mathbf{E}_{\text{token}}$ for each token ID, the output projection takes a full hidden vector and performs a dot product with **every row** of $\mathbf{E}_{\text{token}}$ .

### Why tie weights?

- **Parameter efficiency**: Tying reduces the number of parameters, especially with large vocabularies.
- **Consistency**: The model learns a shared space for representing and predicting tokens.
- **Regularization**: It encourages the model to reuse structure in a meaningful way.
- **Empirical gains**: Tied embeddings have been shown to slightly improve performance in many NLP tasks.

### One output per position

During **training**, we compute logits for **every position** and apply a cross-entropy loss against the next token.  
During **inference**, we typically only use the **last token’s output** to generate the next token step-by-step.



# For those of you that stuck till the end, here some questions to see if you got the gist of it!

1. After mapping token ids to embeddings with positional information. Whats the shape of the result? 
2. How many trainable parameters do we have in total for embeddings and learned positional embeddings? 
3. What is the shape of the output in training/inference and why are/aren´t they different?
4. Compute $\mathbb{E}[QK^\top]$, $\mathrm{Var}[QK^\top]$ and $\mathrm{Var}\left[\frac{1}{\sqrt{d}} QK^\top\right]$. Assume that all entries in $Q,K$ are i.i.d. with $N(0, 1)$.What can we see here and what does it justify when considering this as the input of the softmax function?

# Answers

1. Trivial right? But i wanted you to stick around! The size is $\mathbb{R}^{c \times d}$

2. We have $d \times v + d \times c$ trainable parameters sice we only need positions for up to the maximum context lenght!

3. This one is a bit more subtle.

    During **training**, we don’t generate tokens one by one — that would be inefficient and wasteful. Instead, we input the **entire sequence** at once (often maximizing context length to fully utilize the GPU). 
    
    **SIDENOTE** 
    
    That means multiple documents may be present in a single input batch, sometimes separated by `<EOS>` tokens.

    Now, this might raise a concern: _can the model attend across document boundaries?_ Technically yes — it **could** attend to unrelated tokens if they're still in the attention window. But in practice, models learn that `<EOS>` tokens are semantic boundaries, and they adapt by **suppressing attention** beyond them. It’s a learned behavior, not a hard constraint. 
    
    **SIDENOTE END**

    So what does training output look like?

    - The model returns a tensor of shape:  
    $$
    \mathbb{R}^{c \times v}
    $$  
    where $c$ is the number of tokens (context length), and $v$ is the vocabulary size.
    - Each row is the logits for predicting the **next token** at that position.
    - We then apply cross-entropy loss against the next-token ground truth for **each position** in the input.

    In contrast, **inference** is **autoregressive** — we start with a user input, compute all keys and values once for that prompt (per head and block!), and then generate one token at a time.

    - For each new token, we compute just a **single** query vector.
    - This query attends to the cached keys and values from both the prompt and any previously generated tokens.
    - The model produces a **single output vector** of shape:  
    $$
    \mathbb{R}^{v}
    $$  
    representing the logits for the next token.

    And importantly, we update our **KV cache** with the new token’s key and value vectors — so we never recompute the entire sequence again, keeping inference efficient.

4. Let's first tackle the expected value of $QK^\top$. As we know by now this matrix captures how much each token "attends to" every other token, with shape $\mathbb{R}^{c \times c}$, where $c$ is the context length.

    Each entry $(i, j)$ in $QK^\top$ is the dot product between the $i$-th row of $Q$ and the $j$-th row of $K$, i.e., $\mathbf{q}_i \cdot \mathbf{k}_j^\top$. Since all entries in both $Q$ and $K$ are assumed to be i.i.d. samples from $\mathcal{N}(0, 1)$, the expected value of each individual product $q_{i,l} \cdot k_{j,l}$ is zero. 
    
    **SIDENOTE** 

    Lets quickly talk about why this assumptions is actually quite reasonable. 
    - In **later layers**, the inputs to the attention mechanism come **directly from LayerNorm**, which normalizes each token vector to **zero mean and unit variance**. That makes the standard i.i.d. $\mathcal{N}(0,1)$ approximation quite valid (especially early in training, before strong correlations emerge).
    - Even in **initial layers**, if the model is initialized with common weight schemes  (Xavier or Kaiming), and inputs are roughly uncorrelated, the outputs of the linear projections used to compute $Q$ and $K$ will behave like i.i.d. zero-mean Gaussians due to the central limit effect from summing across input dimensions.
    - Moreover, this is a **common theoretical simplification** used in analyzing attention mechanisms — not to model exact values, but to understand scaling behavior, such as why the $1/\sqrt{d}$ factor is crucial.

    **SIDENOTE END**

    The dot product is a sum of these zero-mean terms, so:
    $$
    \mathbb{E}[\mathbf{q}_i \cdot \mathbf{k}_j^\top] = 0
    $$
    and therefore:
    $$
    \mathbb{E}[QK^\top] = 0
    $$

    Now for the variance: Each entry in $QK^\top$ is a sum of $d$ products of independent standard normals. Each term $q_{i,l} \cdot k_{j,l}$ has:
    - Mean 0
    - Variance 1 (since $\text{Var}(XY) = 1$ when $X, Y \sim \mathcal{N}(0, 1)$ independently)

    So each entry in $QK^\top$ has variance:
    $$
    \mathrm{Var}[(QK^\top)_{ij}] = \sum_{l=1}^d \mathrm{Var}(q_{i,l} \cdot k_{j,l}) = d
    $$

    If we instead scale the dot product by $\frac{1}{\sqrt{d}}$, we get:
    $$
    \mathrm{Var}\left[\left(\frac{1}{\sqrt{d}} QK^\top\right)_{ij}\right] = \frac{1}{d} \cdot d = 1
    $$

    **Why does this matter for softmax?**
    Well we all heard that before but here is a recap:

    The softmax function becomes very sharp and unstable when input values are large, which often happens when the variance of the logits grows with dimensionality. Without scaling, the attention logits in $QK^\top$ would have variance $d$, leading to:
    - Large logits
    - Exploding gradients
    - Poor learning dynamics