# **KV Cache in Transformer Decoders**
### **1. What is KV Cache?**
The **KV Cache** (Key-Value Cache) is an optimization technique used in **autoregressive decoding** to **reuse previously computed key (K) and value (V) matrices** instead of recomputing them at every decoding step. This significantly reduces **computational cost** and **memory usage** in large transformer models like **GPT, LLaMA, and ChatGPT**.

---

## **2. Why Do We Need KV Cache?**
### **The Problem in Standard Decoding**
In an **autoregressive transformer decoder**, each token in the sequence is generated **one-by-one**, and at every step \( t \), we compute:

1. **Query (Q), Key (K), and Value (V)** matrices:
   \[
   Q_t = X_t W_Q, \quad K_t = X_t W_K, \quad V_t = X_t W_V
   \]
2. Compute **self-attention**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K^T}{\sqrt{d_k}} \right) V
   \]

At each step \( t \), the decoder needs to **recompute** all previous keys and values to perform attention, leading to **quadratic complexity** \( O(N^2 d) \).

---
### **How KV Cache Fixes This**
Instead of recomputing **all** keys and values from scratch at every step, we **store the past keys and values** and only append new ones.

1. **Store Previous \( K, V \) in Memory**:
   \[
   K_{\text{cache}} = [K_1, K_2, ..., K_{t-1}]
   \]
   \[
   V_{\text{cache}} = [V_1, V_2, ..., V_{t-1}]
   \]
2. **At Step \( t \), Only Compute for New Token**:
   - Compute **only \( K_t \) and \( V_t \)**.
   - Append them to the cache:
     \[
     K_{\text{cache}} = [K_{\text{cache}}, K_t]
     \]
     \[
     V_{\text{cache}} = [V_{\text{cache}}, V_t]
     \]
3. **Compute Attention Efficiently**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{cache}}^T}{\sqrt{d_k}} \right) V_{\text{cache}}
   \]

Now, instead of recomputing all previous key-value pairs, the model **only updates the new token's values and performs attention efficiently**.

---

## **3. Mathematical Formulation of KV Cache**
### **Without KV Cache (Recomputing Each Step)**
At every step \( t \), standard self-attention requires:
\[
A_t = \text{softmax} \left( \frac{Q_t K^T}{\sqrt{d_k}} \right) V
\]

Where:
- \( Q_t \) is the query for the current token.
- \( K = [K_1, ..., K_t] \) (computed fresh at every step).
- \( V = [V_1, ..., V_t] \).

This leads to \( O(N^2 d) \) complexity.

---

### **With KV Cache (Efficient Computation)**
Instead of recalculating old \( K \) and \( V \), we store:

\[
K_{\text{cache}} = [K_1, K_2, ..., K_{t-1}]
\]
\[
V_{\text{cache}} = [V_1, V_2, ..., V_{t-1}]
\]

Then, at **step \( t \)**:
1. Compute only **new key and value**:
   \[
   K_t = X_t W_K, \quad V_t = X_t W_V
   \]
2. Append to the cache:
   \[
   K_{\text{cache}} = [K_{\text{cache}}, K_t], \quad V_{\text{cache}} = [V_{\text{cache}}, V_t]
   \]
3. Compute **attention only on stored values**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{cache}}^T}{\sqrt{d_k}} \right) V_{\text{cache}}
   \]

---

## **4. Memory Usage of KV Cache**
Since we store all previous keys and values, memory usage **grows linearly** with sequence length \( N \).

### **Memory Calculation**
Each token stores:
\[
K_t \in \mathbb{R}^{d_k}, \quad V_t \in \mathbb{R}^{d_v}
\]

For a **batch size \( B \)**, **number of heads \( H \)**, and **sequence length \( N \)**:
- \( K_{\text{cache}} \) has shape \( (B, H, N, d_k) \).
- \( V_{\text{cache}} \) has shape \( (B, H, N, d_v) \).

Thus, the **total KV cache memory** is:
\[
\text{Memory} = B \times H \times N \times (d_k + d_v) \times \text{data type size}
\]

---

### **Example: GPT-4 Memory Usage**
Assume:
- Batch size **\( B = 4 \)**
- Heads **\( H = 16 \)**
- Sequence length **\( N = 2048 \)**
- Key/Value size **\( d_k = d_v = 64 \)**
- Data type **FP16 (2 bytes per element)**

#### **Compute Memory for KV Cache**
\[
\text{Memory} = 4 \times 16 \times 2048 \times (64 + 64) \times 2
\]

\[
= 4 \times 16 \times 2048 \times 128 \times 2
\]

\[
= 4 \times 16 \times 524,288
\]

\[
= 33,554,432 \text{ bytes} = 32MB
\]

For **longer sequences (e.g., 8K tokens)**, memory grows **linearly**.

---

## **5. Advantages of KV Cache**
✅ **Speeds Up Decoding** → Avoids recomputation, reducing latency.  
✅ **Reduces Computational Cost** → No need to multiply against all past tokens.  
✅ **Efficient for Large Models** → Used in **GPT-3, GPT-4, LLaMA, ChatGPT**.  

---

## **6. KV Cache vs. No KV Cache**
| Feature | Without KV Cache | With KV Cache |
|---------|-----------------|--------------|
| **Computation** | \( O(N^2 d) \) | \( O(N d) \) |
| **Memory Usage** | Lower | Higher (stores all past K, V) |
| **Decoding Speed** | Slow | Fast |
| **Used In** | RNNs, Small Transformers | GPT-3, LLaMA, ChatGPT |

🚀 **KV Caching makes large-scale transformers efficient for text generation!**