# **KV Cache in Transformer Decoders**
### **1. What is KV Cache?**
The **KV Cache** (Key-Value Cache) is an optimization technique used in **autoregressive decoding** to **reuse previously computed key (K) and value (V) matrices** instead of recomputing them at every decoding step. This significantly reduces **computational cost** and **memory usage** in large transformer models like **GPT, LLaMA, and ChatGPT**.

---

## **2. Why Do We Need KV Cache?**
### **The Problem in Standard Decoding**
In an **autoregressive transformer decoder**, each token in the sequence is generated **one-by-one**, and at every step \( t \), we compute:

1. **Query (Q), Key (K), and Value (V)** matrices:
   \[
   Q_t = X_t W_Q, \quad K_t = X_t W_K, \quad V_t = X_t W_V
   \]
2. Compute **self-attention**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K^T}{\sqrt{d_k}} \right) V
   \]

At each step \( t \), the decoder needs to **recompute** all previous keys and values to perform attention, leading to **quadratic complexity** \( O(N^2 d) \).

---
### **How KV Cache Fixes This**
Instead of recomputing **all** keys and values from scratch at every step, we **store the past keys and values** and only append new ones.

1. **Store Previous \( K, V \) in Memory**:
   \[
   K_{\text{cache}} = [K_1, K_2, ..., K_{t-1}]
   \]
   \[
   V_{\text{cache}} = [V_1, V_2, ..., V_{t-1}]
   \]
2. **At Step \( t \), Only Compute for New Token**:
   - Compute **only \( K_t \) and \( V_t \)**.
   - Append them to the cache:
     \[
     K_{\text{cache}} = [K_{\text{cache}}, K_t]
     \]
     \[
     V_{\text{cache}} = [V_{\text{cache}}, V_t]
     \]
3. **Compute Attention Efficiently**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{cache}}^T}{\sqrt{d_k}} \right) V_{\text{cache}}
   \]

Now, instead of recomputing all previous key-value pairs, the model **only updates the new token's values and performs attention efficiently**.

---

## **3. Mathematical Formulation of KV Cache**
### **Without KV Cache (Recomputing Each Step)**
At every step \( t \), standard self-attention requires:
\[
A_t = \text{softmax} \left( \frac{Q_t K^T}{\sqrt{d_k}} \right) V
\]

Where:
- \( Q_t \) is the query for the current token.
- \( K = [K_1, ..., K_t] \) (computed fresh at every step).
- \( V = [V_1, ..., V_t] \).

This leads to \( O(N^2 d) \) complexity.

---

### **With KV Cache (Efficient Computation)**
Instead of recalculating old \( K \) and \( V \), we store:

\[
K_{\text{cache}} = [K_1, K_2, ..., K_{t-1}]
\]
\[
V_{\text{cache}} = [V_1, V_2, ..., V_{t-1}]
\]

Then, at **step \( t \)**:
1. Compute only **new key and value**:
   \[
   K_t = X_t W_K, \quad V_t = X_t W_V
   \]
2. Append to the cache:
   \[
   K_{\text{cache}} = [K_{\text{cache}}, K_t], \quad V_{\text{cache}} = [V_{\text{cache}}, V_t]
   \]
3. Compute **attention only on stored values**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{cache}}^T}{\sqrt{d_k}} \right) V_{\text{cache}}
   \]

---

## **4. Memory Usage of KV Cache**
Since we store all previous keys and values, memory usage **grows linearly** with sequence length \( N \).

### **Memory Calculation**
Each token stores:
\[
K_t \in \mathbb{R}^{d_k}, \quad V_t \in \mathbb{R}^{d_v}
\]

For a **batch size \( B \)**, **number of heads \( H \)**, and **sequence length \( N \)**:
- \( K_{\text{cache}} \) has shape \( (B, H, N, d_k) \).
- \( V_{\text{cache}} \) has shape \( (B, H, N, d_v) \).

Thus, the **total KV cache memory** is:
\[
\text{Memory} = B \times H \times N \times (d_k + d_v) \times \text{data type size}
\]

---

### **Example: GPT-4 Memory Usage**
Assume:
- Batch size **\( B = 4 \)**
- Heads **\( H = 16 \)**
- Sequence length **\( N = 2048 \)**
- Key/Value size **\( d_k = d_v = 64 \)**
- Data type **FP16 (2 bytes per element)**

#### **Compute Memory for KV Cache**
\[
\text{Memory} = 4 \times 16 \times 2048 \times (64 + 64) \times 2
\]

\[
= 4 \times 16 \times 2048 \times 128 \times 2
\]

\[
= 4 \times 16 \times 524,288
\]

\[
= 33,554,432 \text{ bytes} = 32MB
\]

For **longer sequences (e.g., 8K tokens)**, memory grows **linearly**.

---

## **5. Advantages of KV Cache**
✅ **Speeds Up Decoding** → Avoids recomputation, reducing latency.  
✅ **Reduces Computational Cost** → No need to multiply against all past tokens.  
✅ **Efficient for Large Models** → Used in **GPT-3, GPT-4, LLaMA, ChatGPT**.  

---

## **6. KV Cache vs. No KV Cache**
| Feature | Without KV Cache | With KV Cache |
|---------|-----------------|--------------|
| **Computation** | \( O(N^2 d) \) | \( O(N d) \) |
| **Memory Usage** | Lower | Higher (stores all past K, V) |
| **Decoding Speed** | Slow | Fast |
| **Used In** | RNNs, Small Transformers | GPT-3, LLaMA, ChatGPT |

🚀 **KV Caching makes large-scale transformers efficient for text generation!**

# **KV Cache Optimization: Shared KV Cache (共用 KV Cache)**
Using a **shared KV cache (共用 KV Cache)** is an optimization strategy that allows multiple queries (Q) to use the same precomputed **key (K) and value (V) matrices**, reducing redundant computations and improving efficiency in transformer decoders.

---

## **1. What is Shared KV Cache?**
### **Standard KV Cache vs. Shared KV Cache**
- **Standard KV Cache**: Each sequence in a batch has its own independent **K and V** cache.
- **Shared KV Cache**: Multiple sequences **reuse the same** precomputed **K, V** cache to **reduce memory usage**.

### **Why Use Shared KV Cache?**
1. **Lower Memory Footprint**  
   - Instead of storing separate KV caches for each sequence, we **share one cache** across multiple decoder steps or sequences.
  
2. **Faster Decoding**  
   - Since the **same K and V** are used across different queries, we avoid recomputing keys/values for every new sequence.

3. **Ideal for Multi-Turn Chatbots & Beam Search**  
   - When generating multiple responses for the same prompt, shared KV cache allows efficient reuse of **previous K, V states**.

---

## **2. Mathematical Explanation of Shared KV Cache**
### **Standard KV Cache (Without Sharing)**
For a sequence of length \( N \), at each decoding step \( t \):

1. Compute **new K and V**:
   \[
   K_t = X_t W_K, \quad V_t = X_t W_V
   \]

2. Append to KV Cache:
   \[
   K_{\text{cache}} = [K_{\text{cache}}, K_t], \quad V_{\text{cache}} = [V_{\text{cache}}, V_t]
   \]

3. Compute Attention:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{cache}}^T}{\sqrt{d_k}} \right) V_{\text{cache}}
   \]

This is done **independently** for each sequence, leading to redundant storage of similar \( K \) and \( V \).

---

### **Shared KV Cache (Optimized)**
Instead of storing separate KV caches, we **reuse one shared cache** across multiple sequences or decoding steps.

1. Store **only one copy** of \( K, V \) for multiple sequences:
   \[
   K_{\text{shared}} = K_{\text{cache}}, \quad V_{\text{shared}} = V_{\text{cache}}
   \]

2. Compute attention **using the shared cache**:
   \[
   A_t = \text{softmax} \left( \frac{Q_t K_{\text{shared}}^T}{\sqrt{d_k}} \right) V_{\text{shared}}
   \]

3. No need to **duplicate K and V** for each sequence.

This reduces **memory overhead from \( O(BHN) \) to \( O(HN) \)**, where:
- \( B \) = batch size
- \( H \) = number of attention heads
- \( N \) = sequence length

---

## **3. When to Use Shared KV Cache?**
### **Use Case 1: Beam Search**
- In **beam search**, we generate multiple sequences **from the same prompt**.
- Instead of storing **separate KV caches for each beam**, we **share one KV cache** across all beams.
  
### **Use Case 2: Multi-Turn Conversations**
- When responding to **multiple follow-up queries** in a chatbot, we can **reuse** the previous KV cache instead of recomputing it.
  
### **Use Case 3: Parallel Decoding in Multi-Agent Models**
- In **multi-agent transformer systems**, different agents may **reference the same past context**, making shared KV cache **efficient**.

---

## **4. Memory Comparison: Standard vs. Shared KV Cache**
### **Standard KV Cache Memory Usage**
Each sequence stores **independent** KV caches:

\[
\text{Memory} = B \times H \times N \times (d_k + d_v) \times 2
\]

For **batch size \( B = 4 \), heads \( H = 16 \), sequence \( N = 2048 \), \( d_k = d_v = 64 \)** (FP16):
\[
\text{Memory} = 4 \times 16 \times 2048 \times 128 \times 2 = 128MB
\]

---

### **Shared KV Cache Memory Usage**
With a **shared KV cache**, we **remove batch duplication**:
\[
\text{Memory} = H \times N \times (d_k + d_v) \times 2
\]

\[
= 16 \times 2048 \times 128 \times 2 = 32MB
\]

✅ **Saves 75% of memory!**

---

## **5. Summary**
| Feature | Standard KV Cache | Shared KV Cache |
|---------|-----------------|----------------|
| **Memory Usage** | \( O(BHN) \) | \( O(HN) \) (Much Lower) |
| **Computation Speed** | Redundant KV recomputation | Faster (Reuses KV Cache) |
| **Ideal for** | Single-sequence generation | Beam search, multi-turn chatbots |

🚀 **Shared KV Cache significantly reduces memory while speeding up decoding!**

## **1. Windows Optimization for Large Language Models (LLMs)**
Running LLMs efficiently on Windows is challenging due to hardware limitations like **VRAM, RAM, and CPU constraints**. Here are **three main optimization techniques**:

---

### **(1.1) StreamingLLM (流式推理)**
**StreamingLLM** is a technique that allows **partial processing of the KV cache** to handle long sequences efficiently **without excessive memory usage**.

#### **How StreamingLLM Works:**
- Instead of storing **all KV cache** in memory, **only the most relevant part** is kept in **GPU memory (VRAM)**.
- Older keys and values are **offloaded** to slower storage (**CPU RAM or disk**) and loaded back when needed.
- This allows the model to handle **longer sequences (e.g., 100K+ tokens)** without running out of memory.

#### **Mathematical View of StreamingLLM**
Normally, the **KV cache grows linearly** with sequence length \( N \):

\[
\text{Memory} = H \times N \times (d_k + d_v) \times 2
\]

For **very long sequences**, we **offload past KV pairs**:
\[
K_{\text{active}} = K_{\text{cache}}[-M:]
\]
\[
V_{\text{active}} = V_{\text{cache}}[-M:]
\]
where \( M \) is a **moving window** (e.g., last 4K tokens) that fits in VRAM.

✅ **Reduces VRAM usage significantly**  
✅ **Allows infinite-length sequences** (as long as CPU RAM is available)  

🚀 **Example Models Using StreamingLLM:**  
- **Mistral 7B with 100K context**  
- **LLaMA 3 fine-tuned with FlashAttention 2**  

---

### **(1.2) FlashAttention for Windows**
FlashAttention is a specialized **memory-efficient attention mechanism** that optimizes how GPU memory accesses the KV cache.

Instead of computing the **entire softmax attention at once**, FlashAttention **splits computation into smaller chunks** that fit into GPU cache memory.

\[
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
\]

**Optimized by FlashAttention:**
- **Rearranges memory access** to improve speed.
- **Avoids redundant computations** of softmax normalization.
- **Works well on consumer GPUs** (like RTX 4090).

🚀 **Example Libraries:**  
- `xformers` (used in Stable Diffusion)  
- `flash-attn` (used in LLaMA and GPT models)  

---

### **(1.3) Offloading for Low VRAM Windows Systems**
For Windows users with **limited GPU VRAM**, offloading is key.

- **CUDA + Paged Attention** → Uses **GPU VRAM** + **CPU RAM** together.
- **4-bit Quantization** → Reduces memory size of model weights.
- **GGUF Format** → Optimized for **llama.cpp** (runs LLaMA models efficiently on CPUs).

✅ **Allows LLMs to run on laptops and consumer GPUs**  
✅ **Enables 65B models on 24GB VRAM GPUs (e.g., RTX 4090)**  

---

## **2. 量化 (Quantization) 与系数 (Coefficient Reduction)**
Quantization **reduces the bit precision** of model weights to **save memory and speed up inference**.

---

### **(2.1) What is Quantization?**
In standard models:
- **Weights use 16-bit (FP16) or 32-bit (FP32) floating-point numbers**.

Quantization **reduces precision** to:
- **8-bit (INT8)**
- **4-bit (INT4)**
- **3-bit or 2-bit (Extreme Quantization)**

\[
W_{\text{quantized}} = \frac{W}{S} + Z
\]
where:
- \( S \) is a **scaling factor**.
- \( Z \) is a **zero-point offset**.

### **(2.2) Types of Quantization**
| Type | Bit Precision | Speed Boost | Memory Reduction |
|------|-------------|-------------|------------------|
| **FP16 (Default)** | 16-bit | ✅ Standard | ❌ High Memory |
| **INT8 Quantization** | 8-bit | ⚡ 1.5x Faster | ✅ 2x Smaller |
| **4-bit Quantization (GPTQ, AWQ)** | 4-bit | ⚡⚡ 2-3x Faster | ✅✅ 4x Smaller |
| **2-bit Quantization** | 2-bit | ⚡⚡⚡ 4-5x Faster | ✅✅✅ 8x Smaller |

🚀 **Example Models Using Quantization:**
- **LLaMA-3 7B (4-bit GGUF)**
- **Mixtral (8-bit AWQ)**

---

### **(2.3) 系数优化 (Coefficient Optimization)**
Transformers rely on **large parameter matrices** (coefficients) to represent knowledge.

- **Low-Rank Adaptation (LoRA)**  
  - Instead of **fine-tuning all weights**, LoRA **trains only small rank-matrices \( \Delta W \)**.
  - Reduces **storage requirements** while keeping model accuracy.

- **Weight Pruning**  
  - Removes **less important weights** in neural networks.
  - Example: **Sparsity-based pruning (LTH - Lottery Ticket Hypothesis)**.

✅ **Combining Quantization + LoRA reduces LLM memory needs by 10x!**  

---

## **3. Store & Computational Optimization**
This includes techniques to **store and compute large models efficiently**.

---

### **(3.1) KV Cache Compression**
Since KV Cache **grows with sequence length**, we can **compress it** to **save memory**.

- **Low-Rank KV Cache** → Stores **only important activations**.
- **Grouped KV Storage** → Merges similar key-value vectors.

✅ **Cuts KV memory usage by 50% without reducing accuracy!**  

---

### **(3.2) FlashInfer: Fast Matrix Multiplication**
Large models require **fast tensor computation**.

- FlashInfer **uses CUDA kernels** for efficient **matrix multiplication (MatMul)**.
- Replaces **slow PyTorch ops** with **optimized NVIDIA operations**.

✅ **Speeds up inference by 2-4x on RTX 4090!**  

---

### **(3.3) Checkpointing & Gradient Offloading**
For **training large models**:
- **Gradient Offloading** → Moves gradients to CPU RAM.
- **Activation Checkpointing** → Only recomputes essential activations.

✅ **Allows training LLaMA-3 65B on 48GB GPUs!**  


### **(3.4) FlashDecoding and VLLM Pageattention**
For **training large models**:
- **flashdecoding make kv in chunk and reduce
- **store kv in not connected meomory space.

✅ **Allows training LLaMA-3 in faster way!**  

---

## **Final Summary**
| **Optimization Type** | **Technique** | **Benefit** |
|------------------|----------------|------------|
| **Windows Optimizations** | StreamingLLM, FlashAttention, Offloading | Run LLMs on low VRAM GPUs |
| **Quantization (量化)** | 8-bit, 4-bit, LoRA | Reduce memory usage 4-10x |
| **Storage Optimization** | KV Cache Compression, Checkpointing | Lower GPU memory needs |
| **Compute Optimization** | FlashInfer, CUDA Kernels | Speed up inference |

🚀 **Using these techniques, you can run massive LLMs (e.g., 65B models) efficiently on consumer hardware!**