# Lab 3.3.3: vLLM Continuous Batching & PagedAttention

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how PagedAttention eliminates KV cache memory fragmentation
- [ ] Configure vLLM's continuous batching for maximum throughput
- [ ] Measure throughput under varying concurrent loads
- [ ] Tune vLLM parameters for DGX Spark's 128GB unified memory

---

## üìö Prerequisites

- Completed: Lab 3.3.1 (Engine Benchmark)
- Knowledge of: REST APIs, concurrency concepts
- Having: Docker with GPU support, HF token for gated models

---

## üåç Real-World Context

**The Problem:** When serving LLMs to many users, you face a dilemma:

- **Sequential processing:** One request at a time ‚Üí Low GPU utilization, bad for business
- **Static batching:** Wait for N requests ‚Üí High latency for early arrivals
- **Memory waste:** Each request reserves max possible KV cache ‚Üí Fewer concurrent users

**vLLM's Solution:**
1. **Continuous Batching:** Dynamically add/remove requests from the batch
2. **PagedAttention:** Manage KV cache like OS virtual memory pages

**Real Impact:**
- 2-4x higher throughput than HuggingFace Transformers
- Near-100% GPU utilization under load
- Serve 3-4x more users with the same hardware

---

## üßí ELI5: What is Continuous Batching?

> **Imagine you're running a ferry service...**
>
> **OLD WAY (Static Batching):**
> - Ferry waits until it has 10 passengers
> - First passenger might wait 30 minutes for others
> - If only 3 people show up, ferry is mostly empty
>
> **NEW WAY (Continuous Batching):**
> - Ferry keeps moving continuously
> - Passengers hop on and off at each dock
> - Person A gets off at dock 3, Person B gets on at dock 3
> - Ferry is always full, everyone arrives faster!
>
> **In AI terms:** Instead of waiting for a batch of requests, vLLM processes tokens continuously. When one request finishes, a new one immediately takes its place in the batch.

---

## üßí ELI5: What is PagedAttention?

> **Imagine a parking lot for thoughts (KV cache)...**
>
> **OLD WAY:**
> - Each car (request) reserves a HUGE space, just in case it grows
> - A Smart car reserves 10 spaces "in case I become a bus"
> - Parking lot fills up with mostly empty reserved spaces
>
> **PagedAttention WAY:**
> - Parking lot is divided into small, equal-sized pages
> - Each car gets exactly the spaces it needs right now
> - As a car grows (longer response), it gets more pages
> - When a car leaves, its pages are immediately reused
>
> **In AI terms:** Instead of pre-allocating max KV cache per request, PagedAttention allocates memory in small blocks (pages) on demand. This eliminates fragmentation and allows ~3x more concurrent requests.

---

## Part 1: Setting Up vLLM on DGX Spark

In [None]:
# Standard imports
import asyncio
import json
import os
import sys
import time
import subprocess
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
import requests
import numpy as np

# Add scripts directory to path
scripts_path = Path("../scripts").resolve()
sys.path.insert(0, str(scripts_path))

try:
    from benchmark_utils import InferenceBenchmark, BenchmarkResult, BatchBenchmarkResult
    from monitoring import GPUMonitor
    print("‚úÖ Custom utilities loaded")
except ImportError as e:
    print(f"‚ö†Ô∏è Could not load custom utilities: {e}")

print(f"üìÅ Scripts path: {scripts_path}")

In [None]:
# Check GPU status
def get_gpu_info():
    """Get GPU information for capacity planning."""
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total,memory.free",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            values = result.stdout.strip().split(",")
            return {
                "name": values[0].strip(),
                "total_gb": int(values[1]) / 1024,
                "free_gb": int(values[2]) / 1024
            }
    except:
        pass
    return None

gpu_info = get_gpu_info()
if gpu_info:
    print(f"üñ•Ô∏è GPU: {gpu_info['name']}")
    print(f"   Total: {gpu_info['total_gb']:.1f}GB")
    print(f"   Free: {gpu_info['free_gb']:.1f}GB")
    
    # Estimate max model size
    usable = gpu_info['free_gb'] * 0.85  # Leave 15% for overhead
    max_model_fp16 = usable / 2  # 2 bytes per parameter for FP16
    print(f"\nüìä Estimated capacity:")
    print(f"   Max FP16 model: ~{max_model_fp16:.0f}B parameters")
    print(f"   Max BF16 model: ~{max_model_fp16:.0f}B parameters")
else:
    print("‚ö†Ô∏è Could not get GPU info")

### üîß Starting vLLM on DGX Spark

vLLM requires `--enforce-eager` on ARM64 (DGX Spark) to disable CUDA graphs:

```bash
# Option 1: Using NGC PyTorch container (recommended)
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash -c "pip install vllm && python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --enforce-eager \
        --dtype bfloat16 \
        --max-model-len 8192 \
        --gpu-memory-utilization 0.85"
```

**Key vLLM configuration flags:**

| Flag | Description | DGX Spark Value |
|------|-------------|----------------|
| `--enforce-eager` | Disable CUDA graphs | **Required** on ARM64 |
| `--dtype` | Model precision | `bfloat16` (native support) |
| `--max-model-len` | Max sequence length | 8192 (adjust based on needs) |
| `--gpu-memory-utilization` | GPU memory to use | 0.85 (leave headroom) |
| `--max-num-seqs` | Max concurrent sequences | Default: 256 |
| `--block-size` | PagedAttention block size | 16 (default) |

In [None]:
# Check vLLM server status
VLLM_URL = "http://localhost:8000"

def check_vllm_server(url: str = VLLM_URL) -> Dict[str, Any]:
    """Check vLLM server status and get model info."""
    try:
        response = requests.get(f"{url}/v1/models", timeout=5)
        if response.status_code == 200:
            data = response.json()
            models = data.get("data", [])
            print(f"‚úÖ vLLM server running at {url}")
            for model in models:
                print(f"   Model: {model.get('id', 'unknown')}")
            return {"available": True, "models": models}
    except requests.exceptions.ConnectionError:
        print(f"‚ùå vLLM server not running at {url}")
        print("\nüìù Start with the command in the cell above")
    except Exception as e:
        print(f"‚ùå Error: {e}")
    return {"available": False, "models": []}

vllm_status = check_vllm_server()

---

## Part 2: Understanding Continuous Batching

Let's visualize how continuous batching works compared to static batching.

In [None]:
# Visualize batching strategies
print("""
üìä STATIC BATCHING vs CONTINUOUS BATCHING
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

STATIC BATCHING (Traditional):
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Time:    0   1   2   3   4   5   6   7   8   9   10
Request1 [=====]                                    (5 tokens)
Request2 [=================]                        (10 tokens)
Request3 [===]                                      (3 tokens) WASTED
                                                    capacity!
         ‚Üë Batch starts    ‚Üë Batch ends (slowest)
         
Problem: Request1 and Request3 finish early but wait for Request2
         GPU sits idle for fast requests

CONTINUOUS BATCHING (vLLM):
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Time:    0   1   2   3   4   5   6   7   8   9   10
Request1 [=====]                                    (done at 5)
Request2 [=================]                        (done at 10)
Request3 [===]                                      (done at 3)
Request4       [=========]                          (starts at 3, done at 8)
Request5             [=======]                      (starts at 5, done at 9)

Benefit: As soon as Request3 finishes, Request4 starts
         GPU always working at full capacity!
""")

In [None]:
# Simulate requests with varying lengths
def simulate_request_lengths(num_requests: int = 20) -> List[Dict]:
    """Generate simulated requests with varying output lengths."""
    np.random.seed(42)
    
    requests_sim = []
    for i in range(num_requests):
        # Simulate varying response lengths (some short, some long)
        output_tokens = int(np.random.lognormal(4, 1))  # Log-normal distribution
        output_tokens = max(10, min(500, output_tokens))  # Clamp to 10-500
        
        requests_sim.append({
            "id": i,
            "output_tokens": output_tokens,
            "arrival_time": i * 0.5  # New request every 0.5 seconds
        })
    
    return requests_sim

simulated_requests = simulate_request_lengths()
print("üìä Simulated request distribution:")
print(f"   Total requests: {len(simulated_requests)}")
print(f"   Min tokens: {min(r['output_tokens'] for r in simulated_requests)}")
print(f"   Max tokens: {max(r['output_tokens'] for r in simulated_requests)}")
print(f"   Median tokens: {np.median([r['output_tokens'] for r in simulated_requests]):.0f}")

---

## Part 3: Benchmarking Throughput Under Load

Let's measure how vLLM handles increasing concurrent request loads.

In [None]:
# Prepare test prompts
TEST_PROMPTS = [
    "What is machine learning? Explain briefly.",
    "Write a short poem about the ocean.",
    "List 5 tips for learning a new programming language.",
    "Explain the water cycle in simple terms.",
    "What are the benefits of exercise?",
    "Describe the solar system.",
    "How does the internet work?",
    "What is artificial intelligence?",
    "Explain photosynthesis.",
    "What causes weather changes?",
    "Describe the process of making bread.",
    "How do airplanes fly?",
] * 4  # 48 prompts total

print(f"üìù Prepared {len(TEST_PROMPTS)} test prompts")

In [None]:
def send_request_sync(url: str, prompt: str, max_tokens: int = 100) -> Dict:
    """Send a single request and return timing info."""
    start_time = time.perf_counter()
    
    try:
        response = requests.post(
            f"{url}/v1/chat/completions",
            json={
                "model": "default",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": 0.7
            },
            timeout=120
        )
        end_time = time.perf_counter()
        
        if response.status_code == 200:
            data = response.json()
            usage = data.get("usage", {})
            return {
                "success": True,
                "latency": end_time - start_time,
                "tokens": usage.get("completion_tokens", 0),
                "prompt_tokens": usage.get("prompt_tokens", 0)
            }
        else:
            return {"success": False, "latency": end_time - start_time, "error": response.status_code}
            
    except Exception as e:
        return {"success": False, "latency": time.perf_counter() - start_time, "error": str(e)}


def benchmark_throughput(
    url: str,
    prompts: List[str],
    concurrency: int,
    max_tokens: int = 100
) -> Dict[str, Any]:
    """
    Benchmark throughput at a given concurrency level.
    
    Returns:
        Dictionary with throughput metrics
    """
    results = []
    start_time = time.perf_counter()
    
    with ThreadPoolExecutor(max_workers=concurrency) as executor:
        futures = [
            executor.submit(send_request_sync, url, prompt, max_tokens)
            for prompt in prompts
        ]
        
        for future in as_completed(futures):
            results.append(future.result())
    
    total_time = time.perf_counter() - start_time
    
    # Compute metrics
    successful = [r for r in results if r["success"]]
    failed = len(results) - len(successful)
    
    if successful:
        latencies = [r["latency"] for r in successful]
        total_tokens = sum(r["tokens"] for r in successful)
        
        return {
            "concurrency": concurrency,
            "total_requests": len(prompts),
            "successful": len(successful),
            "failed": failed,
            "total_time": total_time,
            "throughput_rps": len(successful) / total_time,
            "total_tokens": total_tokens,
            "tokens_per_second": total_tokens / total_time,
            "avg_latency": np.mean(latencies),
            "p50_latency": np.percentile(latencies, 50),
            "p90_latency": np.percentile(latencies, 90),
            "p99_latency": np.percentile(latencies, 99),
        }
    else:
        return {
            "concurrency": concurrency,
            "total_requests": len(prompts),
            "successful": 0,
            "failed": failed,
            "error": "All requests failed"
        }

In [None]:
# Run throughput benchmarks at various concurrency levels
CONCURRENCY_LEVELS = [1, 2, 4, 8, 16, 32]
NUM_REQUESTS_PER_LEVEL = 24  # Adjust based on patience

if vllm_status["available"]:
    print("üöÄ Running throughput benchmark...")
    print(f"   Concurrency levels: {CONCURRENCY_LEVELS}")
    print(f"   Requests per level: {NUM_REQUESTS_PER_LEVEL}")
    print("\n" + "="*70)
    
    throughput_results = []
    
    for concurrency in CONCURRENCY_LEVELS:
        print(f"\nTesting concurrency = {concurrency}...")
        
        result = benchmark_throughput(
            url=VLLM_URL,
            prompts=TEST_PROMPTS[:NUM_REQUESTS_PER_LEVEL],
            concurrency=concurrency,
            max_tokens=100
        )
        
        throughput_results.append(result)
        
        if "error" not in result:
            print(f"   ‚úÖ Throughput: {result['throughput_rps']:.2f} req/s")
            print(f"      Token throughput: {result['tokens_per_second']:.0f} tok/s")
            print(f"      Avg latency: {result['avg_latency']*1000:.0f}ms")
            print(f"      P90 latency: {result['p90_latency']*1000:.0f}ms")
        else:
            print(f"   ‚ùå Failed: {result.get('error')}")
else:
    print("‚ö†Ô∏è vLLM not available. Start the server to run this benchmark.")
    throughput_results = []

In [None]:
# Visualize throughput results
if throughput_results:
    try:
        import matplotlib.pyplot as plt
        
        successful_results = [r for r in throughput_results if "error" not in r]
        
        if successful_results:
            fig, axes = plt.subplots(1, 3, figsize=(15, 4))
            
            # Plot 1: Throughput vs Concurrency
            concs = [r["concurrency"] for r in successful_results]
            throughputs = [r["throughput_rps"] for r in successful_results]
            axes[0].plot(concs, throughputs, 'bo-', linewidth=2, markersize=8)
            axes[0].set_xlabel("Concurrency")
            axes[0].set_ylabel("Throughput (req/s)")
            axes[0].set_title("Throughput vs Concurrency")
            axes[0].grid(True, alpha=0.3)
            
            # Plot 2: Token throughput
            tok_throughputs = [r["tokens_per_second"] for r in successful_results]
            axes[1].plot(concs, tok_throughputs, 'go-', linewidth=2, markersize=8)
            axes[1].set_xlabel("Concurrency")
            axes[1].set_ylabel("Token Throughput (tok/s)")
            axes[1].set_title("Token Throughput vs Concurrency")
            axes[1].grid(True, alpha=0.3)
            
            # Plot 3: Latency
            p50s = [r["p50_latency"]*1000 for r in successful_results]
            p90s = [r["p90_latency"]*1000 for r in successful_results]
            axes[2].plot(concs, p50s, 'b^-', label='P50', linewidth=2, markersize=8)
            axes[2].plot(concs, p90s, 'rs-', label='P90', linewidth=2, markersize=8)
            axes[2].set_xlabel("Concurrency")
            axes[2].set_ylabel("Latency (ms)")
            axes[2].set_title("Latency vs Concurrency")
            axes[2].legend()
            axes[2].grid(True, alpha=0.3)
            
            plt.tight_layout()
            plt.savefig("vllm_throughput.png", dpi=150, bbox_inches='tight')
            plt.show()
            
            print("\nüìà Charts saved to vllm_throughput.png")
            
    except ImportError:
        print("‚ö†Ô∏è matplotlib not available. Install with: pip install matplotlib")

### üîç What Just Happened?

We measured how vLLM handles increasing load:

1. **Throughput increases with concurrency** (up to a point)
   - More concurrent requests = better GPU utilization
   - Continuous batching keeps the GPU busy

2. **Latency increases with concurrency**
   - Individual requests take longer when batched
   - But total throughput is higher (good tradeoff for high-load scenarios)

3. **Saturation point**
   - At some concurrency level, throughput stops increasing
   - GPU is fully utilized, adding more requests just adds queue time

---

## Part 4: PagedAttention Memory Efficiency

Let's understand how PagedAttention improves memory efficiency.

In [None]:
# Calculate memory savings from PagedAttention
def calculate_kv_cache_memory(
    model_hidden_size: int = 4096,  # Llama 3.1 8B
    num_layers: int = 32,
    num_kv_heads: int = 8,
    head_dim: int = 128,
    max_seq_len: int = 8192,
    num_sequences: int = 32,
    dtype_bytes: int = 2  # bfloat16
) -> Dict[str, float]:
    """
    Calculate KV cache memory requirements.
    
    KV cache size = 2 (K+V) * num_layers * num_kv_heads * head_dim * seq_len * dtype_bytes
    """
    # Per-token KV cache size
    per_token_kv = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    
    # Traditional: Reserve max_seq_len for each sequence
    traditional_per_seq = per_token_kv * max_seq_len
    traditional_total = traditional_per_seq * num_sequences
    
    # PagedAttention: Allocate as needed (assume average 50% utilization)
    avg_utilization = 0.5
    paged_per_seq = per_token_kv * max_seq_len * avg_utilization
    paged_total = paged_per_seq * num_sequences
    
    # Also account for fragmentation in traditional (typically 30-50%)
    fragmentation = 0.35
    traditional_with_frag = traditional_total / (1 - fragmentation)
    
    return {
        "per_token_bytes": per_token_kv,
        "traditional_total_gb": traditional_total / (1024**3),
        "traditional_with_frag_gb": traditional_with_frag / (1024**3),
        "paged_total_gb": paged_total / (1024**3),
        "memory_savings": 1 - (paged_total / traditional_with_frag),
        "max_sequences_traditional": int(128 * (1024**3) / traditional_with_frag * num_sequences),
        "max_sequences_paged": int(128 * (1024**3) / paged_total * num_sequences),
    }

kv_analysis = calculate_kv_cache_memory()

print("üìä KV Cache Memory Analysis (Llama 3.1 8B, 32 concurrent sequences)")
print("="*60)
print(f"\nPer-token KV cache: {kv_analysis['per_token_bytes']} bytes")
print(f"\nTraditional allocation:")
print(f"   Total reserved: {kv_analysis['traditional_total_gb']:.1f} GB")
print(f"   With fragmentation: {kv_analysis['traditional_with_frag_gb']:.1f} GB")
print(f"\nPagedAttention:")
print(f"   Total used: {kv_analysis['paged_total_gb']:.1f} GB")
print(f"\nüöÄ Memory savings: {kv_analysis['memory_savings']:.0%}")
print(f"\nMax concurrent sequences on 128GB:")
print(f"   Traditional: ~{kv_analysis['max_sequences_traditional']}")
print(f"   PagedAttention: ~{kv_analysis['max_sequences_paged']}")
print(f"\n   That's {kv_analysis['max_sequences_paged'] / kv_analysis['max_sequences_traditional']:.1f}x more users!")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting `--enforce-eager` on ARM64

```bash
# ‚ùå Wrong - CUDA graphs fail on ARM64
python -m vllm.entrypoints.openai.api_server --model llama3.1

# ‚úÖ Right - Disable CUDA graphs for DGX Spark
python -m vllm.entrypoints.openai.api_server --model llama3.1 --enforce-eager
```

### Mistake 2: Setting `--max-model-len` Too High

```bash
# ‚ùå Wrong - Uses too much memory for KV cache
--max-model-len 131072  # 128K context

# ‚úÖ Right - Set based on your actual needs
--max-model-len 8192   # Most conversations don't need 128K
```

**Why:** Higher max_model_len reserves more KV cache memory, reducing concurrent capacity.

### Mistake 3: Not Using `--ipc=host` in Docker

```bash
# ‚ùå Wrong - Shared memory issues with DataLoader
docker run --gpus all ...

# ‚úÖ Right - Enable host IPC for shared memory
docker run --gpus all --ipc=host ...
```

---

## ‚úã Try It Yourself

### Exercise 1: Find the Saturation Point

Test higher concurrency levels to find where throughput plateaus.

In [None]:
# Exercise 1: Your code here
# Test concurrency levels: [1, 4, 8, 16, 32, 64, 128]
# Find where throughput stops increasing

EXTENDED_CONCURRENCY = [1, 4, 8, 16, 32, 64, 128]

# TODO: Run benchmarks at these levels
# TODO: Plot throughput vs concurrency
# TODO: Identify the saturation point

### Exercise 2: Tune for Your Workload

Experiment with different `--max-model-len` values and measure impact.

In [None]:
# Exercise 2: Notes for different configurations
# 
# Test these max-model-len values and note the max concurrency:
# - 2048: Short conversations
# - 4096: Medium conversations
# - 8192: Long conversations
# - 16384: Very long conversations
#
# For each, restart vLLM with the new value and measure:
# 1. Maximum concurrent sequences before OOM
# 2. Throughput at concurrency=16
# 3. Memory usage (nvidia-smi)

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How continuous batching maximizes GPU utilization
- ‚úÖ How PagedAttention eliminates memory fragmentation
- ‚úÖ How to benchmark throughput under varying loads
- ‚úÖ How to configure vLLM for DGX Spark

---

## üöÄ Challenge (Optional)

**Build an Adaptive Concurrency Controller**

Create a system that:
1. Monitors current P90 latency
2. Automatically adjusts max concurrency to meet SLA
3. Alerts when capacity is exceeded

---

## üìñ Further Reading

- [vLLM Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Continuous Batching Explained (Anyscale Blog)](https://www.anyscale.com/blog/continuous-batching-llm-inference)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear Python garbage
gc.collect()

# Clear GPU memory cache if torch is available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        print("‚úÖ GPU memory cache cleared!")
except ImportError:
    pass

print("‚úÖ Cleanup complete!")
print("\nüìù To stop vLLM server:")
print("   docker stop $(docker ps -q --filter ancestor=nvcr.io/nvidia/pytorch:25.11-py3)")