# Lab 3.3.2: vLLM Deployment with Continuous Batching

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand what makes vLLM special (PagedAttention, continuous batching)
- [ ] Deploy vLLM on DGX Spark with optimal configuration
- [ ] Implement and test continuous batching under load
- [ ] Monitor and tune vLLM for your workload

---

## üìö Prerequisites

- Completed: Lab 3.3.1 (Engine Benchmark)
- Access to: HuggingFace models (HF_TOKEN set if using gated models)
- Docker installed and configured for GPU access

---

## üåç Real-World Context

**Why vLLM matters for production deployments:**

Imagine you're running a customer service chatbot that needs to handle 100 concurrent conversations. Traditional inference would:
- Process one request at a time ‚Üí 100 users waiting
- Or batch requests ‚Üí Users wait until batch is full

**vLLM's continuous batching** lets you:
- Add new requests to an ongoing batch dynamically
- Each user starts getting responses immediately
- GPU stays busy 100% of the time

Companies like Anyscale, Modal, and Replicate use vLLM to serve millions of requests efficiently.

---

## üßí ELI5: Continuous Batching & PagedAttention

### Continuous Batching

> **Imagine you're a DJ at a party...**
>
> **Traditional batching** = You wait until 10 people request songs, then play them all at once.
> Everyone waits, and some people leave frustrated.
>
> **Continuous batching** = You're mixing multiple songs simultaneously. As soon as someone requests
> a song, you blend it into the current mix. No one waits!
>
> **In AI terms:** vLLM can add new requests to an in-progress batch. When a request finishes,
> a new one immediately takes its place - the GPU never sits idle.

### PagedAttention

> **Imagine your GPU memory is like a parking lot...**
>
> **Traditional KV cache** = Each car (request) needs to reserve a parking LANE from start to end.
> Even if the lane is mostly empty, no one else can use it. Wastes 60-80% of space!
>
> **PagedAttention** = Cars park in individual SPOTS. A request might use spots scattered around
> the lot. When a spot is freed, anyone can use it. Near 0% waste!
>
> **In AI terms:** PagedAttention manages KV cache in fixed-size "pages" that can be allocated
> and freed dynamically, dramatically improving memory efficiency.

---

## Part 1: Setting Up vLLM on DGX Spark

DGX Spark has some special requirements for vLLM due to its ARM64 architecture.

In [None]:
# First, let's check our system
import subprocess
import os
import sys
from pathlib import Path

def run_command(cmd):
    """Run a shell command and return output."""
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout.strip() if result.returncode == 0 else f"Error: {result.stderr}"

print("üîç System Information:")
print(f"   Architecture: {run_command('uname -m')}")
print(f"   Python: {sys.version.split()[0]}")
print(f"   Docker: {run_command('docker --version').split(',')[0] if 'Error' not in run_command('docker --version') else 'Not installed'}")

# Check NVIDIA driver
nvidia_output = run_command('nvidia-smi --query-gpu=driver_version,name,memory.total --format=csv,noheader')
if 'Error' not in nvidia_output:
    driver, gpu, memory = nvidia_output.split(',')
    print(f"   GPU: {gpu.strip()}")
    print(f"   Memory: {memory.strip()}")
    print(f"   Driver: {driver.strip()}")

### üöÄ Starting vLLM

There are two ways to run vLLM on DGX Spark:

#### Option 1: PyTorch NGC Container with vLLM (Recommended for DGX Spark)

```bash
# For DGX Spark ARM64, use the PyTorch NGC container and install vLLM
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash -c "pip install vllm && python -m vllm.entrypoints.openai.api_server \
        --model Qwen/Qwen3-8B-Instruct \
        --enforce-eager \
        --dtype bfloat16 \
        --max-model-len 4096 \
        --gpu-memory-utilization 0.9"
```

#### Option 2: Official vLLM Container (verify ARM64 support)

```bash
# Check https://hub.docker.com/r/vllm/vllm-openai for ARM64 availability
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen3-8B-Instruct \
    --enforce-eager
```

#### Important Flags for DGX Spark:

| Flag | Purpose |
|------|--------|
| `--enforce-eager` | Disable CUDA graphs (required for ARM64) |
| `--max-model-len` | Limit context to save memory |
| `--gpu-memory-utilization` | How much GPU memory to use (0.9 = 90%) |
| `--dtype bfloat16` | Use BF16 for Blackwell optimization |
| `--ipc=host` | Required for DataLoader workers (docker flag) |

In [None]:
# Generate the vLLM startup command
def generate_vllm_command(
    model: str = "Qwen/Qwen3-8B-Instruct",
    max_model_len: int = 4096,
    gpu_memory_utilization: float = 0.9,
    port: int = 8000,
    dtype: str = "bfloat16"
) -> str:
    """
    Generate the docker command to start vLLM on DGX Spark.
    
    Args:
        model: HuggingFace model ID
        max_model_len: Maximum context length
        gpu_memory_utilization: Fraction of GPU memory to use (0.0-1.0)
        port: Port to expose the API on
        dtype: Data type for inference (bfloat16 recommended for Blackwell)
        
    Returns:
        Docker command string ready to execute
    """
    hf_token = os.environ.get("HF_TOKEN", "")
    token_flag = f'-e HF_TOKEN={hf_token}' if hf_token else '-e HF_TOKEN=$HF_TOKEN'
    
    # Use PyTorch NGC container for ARM64 compatibility
    cmd = f"""docker run --gpus all -p {port}:8000 \\
    -v ~/.cache/huggingface:/root/.cache/huggingface \\
    {token_flag} \\
    --ipc=host \\
    nvcr.io/nvidia/pytorch:25.11-py3 \\
    bash -c "pip install vllm && python -m vllm.entrypoints.openai.api_server \\
        --model {model} \\
        --enforce-eager \\
        --max-model-len {max_model_len} \\
        --gpu-memory-utilization {gpu_memory_utilization} \\
        --dtype {dtype}"
"""
    
    return cmd

print("üöÄ vLLM Startup Command for DGX Spark:")
print("=" * 60)
print(generate_vllm_command())
print("=" * 60)
print("\nüí° Copy and run this in a separate terminal!")
print("   Note: First run will install vLLM (takes a few minutes)")

### Checking if vLLM is Running

In [None]:
import requests
import time

VLLM_URL = "http://localhost:8000"

def check_vllm_status(url: str = VLLM_URL, timeout: int = 5) -> dict:
    """Check if vLLM server is running and get its status."""
    try:
        # Check models endpoint
        response = requests.get(f"{url}/v1/models", timeout=timeout)
        if response.status_code == 200:
            models = response.json()
            model_list = [m["id"] for m in models.get("data", [])]
            return {
                "status": "running",
                "models": model_list,
                "url": url
            }
    except requests.exceptions.ConnectionError:
        pass
    except Exception as e:
        return {"status": "error", "error": str(e)}
    
    return {"status": "not_running"}

status = check_vllm_status()

if status["status"] == "running":
    print(f"‚úÖ vLLM is running at {status['url']}")
    print(f"   Available models: {', '.join(status['models'])}")
else:
    print("‚ùå vLLM is not running")
    print("   Please start vLLM using the command above")

---

## Part 2: Understanding Continuous Batching

Let's visualize how continuous batching works compared to static batching.

In [None]:
# Simulation of static vs continuous batching
import random

def simulate_batching(n_requests: int = 10, batch_size: int = 4):
    """
    Simulate static vs continuous batching to show the difference.
    """
    # Request arrival times (random within first 5 seconds)
    arrivals = sorted([random.uniform(0, 5) for _ in range(n_requests)])
    # Processing time per request (varies based on output length)
    process_times = [random.uniform(0.5, 2.0) for _ in range(n_requests)]
    
    # STATIC BATCHING
    # Wait for batch_size requests, process together, repeat
    static_completions = []
    current_batch = []
    batch_start = 0
    
    for i, (arrival, proc_time) in enumerate(zip(arrivals, process_times)):
        current_batch.append((i, arrival, proc_time))
        
        if len(current_batch) >= batch_size or i == n_requests - 1:
            # Process batch: starts when last request arrives
            batch_start = max(batch_start, max(x[1] for x in current_batch))
            max_time = max(x[2] for x in current_batch)
            
            for req_id, arr_time, _ in current_batch:
                completion = batch_start + max_time
                wait_time = completion - arr_time
                static_completions.append((req_id, arr_time, completion, wait_time))
            
            batch_start = completion
            current_batch = []
    
    # CONTINUOUS BATCHING
    # Process requests as they arrive, overlapping execution
    continuous_completions = []
    for i, (arrival, proc_time) in enumerate(zip(arrivals, process_times)):
        completion = arrival + proc_time  # Simplified: starts immediately
        wait_time = completion - arrival
        continuous_completions.append((i, arrival, completion, wait_time))
    
    return static_completions, continuous_completions, arrivals

# Run simulation
static, continuous, arrivals = simulate_batching(n_requests=8, batch_size=4)

print("üìä Batching Comparison (8 requests, batch_size=4)")
print("=" * 60)
print("\nüî¥ STATIC BATCHING:")
print(f"{'Request':<10} {'Arrival':<10} {'Complete':<10} {'Wait Time':<10}")
print("-" * 40)
for req_id, arr, comp, wait in static:
    print(f"{req_id:<10} {arr:<10.2f} {comp:<10.2f} {wait:<10.2f}")
avg_static = sum(w for _, _, _, w in static) / len(static)
print(f"\nAverage wait time: {avg_static:.2f}s")

print("\nüü¢ CONTINUOUS BATCHING:")
print(f"{'Request':<10} {'Arrival':<10} {'Complete':<10} {'Wait Time':<10}")
print("-" * 40)
for req_id, arr, comp, wait in continuous:
    print(f"{req_id:<10} {arr:<10.2f} {comp:<10.2f} {wait:<10.2f}")
avg_continuous = sum(w for _, _, _, w in continuous) / len(continuous)
print(f"\nAverage wait time: {avg_continuous:.2f}s")

print(f"\nüéØ Improvement: {(avg_static - avg_continuous) / avg_static * 100:.1f}% lower latency")

### üîç What Just Happened?

In the simulation:
- **Static batching** waits to fill batches before processing. Late-arriving requests in a batch get processed with early ones, but early arrivers wait unnecessarily.
- **Continuous batching** starts processing each request immediately. As requests complete, new ones join the active batch.

The real magic happens when combined with PagedAttention - memory is used efficiently even with variable-length sequences.

---

## Part 3: Testing vLLM Under Load

Now let's send real requests to vLLM and see continuous batching in action.

In [None]:
import asyncio
import aiohttp
import time
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class RequestResult:
    """Result from a single request."""
    request_id: int
    start_time: float
    first_token_time: Optional[float]
    end_time: float
    tokens_generated: int
    success: bool
    error: Optional[str] = None
    
    @property
    def ttft(self) -> float:
        """Time to first token in seconds."""
        if self.first_token_time:
            return self.first_token_time - self.start_time
        return 0.0
    
    @property
    def total_time(self) -> float:
        return self.end_time - self.start_time
    
    @property
    def tokens_per_second(self) -> float:
        decode_time = self.total_time - self.ttft
        if decode_time > 0:
            return self.tokens_generated / decode_time
        return 0.0


async def send_vllm_request(
    session: aiohttp.ClientSession,
    request_id: int,
    prompt: str,
    model: str,
    max_tokens: int = 100
) -> RequestResult:
    """
    Send a single request to vLLM and measure timing.
    """
    start_time = time.perf_counter()
    first_token_time = None
    tokens = 0
    
    try:
        async with session.post(
            f"{VLLM_URL}/v1/chat/completions",
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "stream": True
            },
            timeout=aiohttp.ClientTimeout(total=60)
        ) as response:
            async for line in response.content:
                line_str = line.decode().strip()
                if line_str.startswith("data: "):
                    data_str = line_str[6:]
                    if data_str == "[DONE]":
                        break
                    try:
                        import json
                        chunk = json.loads(data_str)
                        delta = chunk.get("choices", [{}])[0].get("delta", {})
                        if delta.get("content"):
                            if first_token_time is None:
                                first_token_time = time.perf_counter()
                            tokens += 1
                    except:
                        pass
        
        return RequestResult(
            request_id=request_id,
            start_time=start_time,
            first_token_time=first_token_time,
            end_time=time.perf_counter(),
            tokens_generated=tokens,
            success=True
        )
        
    except Exception as e:
        return RequestResult(
            request_id=request_id,
            start_time=start_time,
            first_token_time=None,
            end_time=time.perf_counter(),
            tokens_generated=0,
            success=False,
            error=str(e)
        )

In [None]:
async def run_load_test(
    prompts: List[str],
    model: str,
    concurrency: int = 4,
    max_tokens: int = 100
) -> List[RequestResult]:
    """
    Run a load test with specified concurrency.
    """
    semaphore = asyncio.Semaphore(concurrency)
    
    async def limited_request(session, req_id, prompt):
        async with semaphore:
            return await send_vllm_request(session, req_id, prompt, model, max_tokens)
    
    async with aiohttp.ClientSession() as session:
        tasks = [
            limited_request(session, i, prompt)
            for i, prompt in enumerate(prompts)
        ]
        results = await asyncio.gather(*tasks)
    
    return results


def analyze_load_test(results: List[RequestResult]) -> dict:
    """Analyze load test results."""
    successful = [r for r in results if r.success]
    
    if not successful:
        return {"error": "No successful requests"}
    
    ttfts = [r.ttft * 1000 for r in successful]  # ms
    latencies = [r.total_time * 1000 for r in successful]  # ms
    speeds = [r.tokens_per_second for r in successful if r.tokens_per_second > 0]
    
    total_time = max(r.end_time for r in results) - min(r.start_time for r in results)
    
    return {
        "total_requests": len(results),
        "successful": len(successful),
        "failed": len(results) - len(successful),
        "total_time_s": total_time,
        "throughput_rps": len(successful) / total_time if total_time > 0 else 0,
        "avg_ttft_ms": sum(ttfts) / len(ttfts) if ttfts else 0,
        "p50_ttft_ms": sorted(ttfts)[len(ttfts)//2] if ttfts else 0,
        "p90_ttft_ms": sorted(ttfts)[int(len(ttfts)*0.9)] if ttfts else 0,
        "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
        "p90_latency_ms": sorted(latencies)[int(len(latencies)*0.9)] if latencies else 0,
        "avg_tokens_per_sec": sum(speeds) / len(speeds) if speeds else 0
    }

In [None]:
# Run load test (only if vLLM is running)
status = check_vllm_status()

if status["status"] == "running":
    # Prepare test prompts
    test_prompts = [
        "What is the capital of France?",
        "Explain quantum computing in simple terms.",
        "Write a haiku about programming.",
        "What are the benefits of exercise?",
        "How does photosynthesis work?",
        "Describe the water cycle.",
        "What is machine learning?",
        "Name three famous scientists.",
        "What causes earthquakes?",
        "How do vaccines work?",
        "Explain the theory of relativity.",
        "What is artificial intelligence?",
    ]
    
    model = status["models"][0] if status["models"] else "Qwen/Qwen3-8B-Instruct"
    
    print(f"üöÄ Running load test against {model}")
    print(f"   Requests: {len(test_prompts)}")
    print(f"   Concurrency levels: [1, 2, 4, 8]")
    print("="*60)
    
    load_test_results = {}
    
    for concurrency in [1, 2, 4, 8]:
        print(f"\nüìä Testing concurrency={concurrency}...")
        
        # Run the test
        results = asyncio.run(run_load_test(
            prompts=test_prompts,
            model=model,
            concurrency=concurrency,
            max_tokens=100
        ))
        
        analysis = analyze_load_test(results)
        load_test_results[concurrency] = analysis
        
        print(f"   ‚úÖ Throughput: {analysis['throughput_rps']:.2f} req/s")
        print(f"   ‚úÖ Avg TTFT: {analysis['avg_ttft_ms']:.0f}ms")
        print(f"   ‚úÖ P90 Latency: {analysis['p90_latency_ms']:.0f}ms")
        print(f"   ‚úÖ Success Rate: {analysis['successful']}/{analysis['total_requests']}")

else:
    print("‚ö†Ô∏è vLLM is not running. Please start it first.")
    print("   Simulating results for demonstration...")
    
    # Simulated results for demonstration
    load_test_results = {
        1: {"throughput_rps": 1.2, "avg_ttft_ms": 45, "p90_latency_ms": 850},
        2: {"throughput_rps": 2.3, "avg_ttft_ms": 52, "p90_latency_ms": 920},
        4: {"throughput_rps": 4.1, "avg_ttft_ms": 68, "p90_latency_ms": 1100},
        8: {"throughput_rps": 6.5, "avg_ttft_ms": 95, "p90_latency_ms": 1450},
    }

In [None]:
# Visualize load test results
try:
    import matplotlib.pyplot as plt
    import numpy as np
    
    concurrencies = sorted(load_test_results.keys())
    throughputs = [load_test_results[c]["throughput_rps"] for c in concurrencies]
    ttfts = [load_test_results[c]["avg_ttft_ms"] for c in concurrencies]
    latencies = [load_test_results[c]["p90_latency_ms"] for c in concurrencies]
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Throughput
    axes[0].bar(concurrencies, throughputs, color='steelblue')
    axes[0].set_xlabel('Concurrency')
    axes[0].set_ylabel('Throughput (req/s)')
    axes[0].set_title('Throughput vs Concurrency')
    axes[0].set_xticks(concurrencies)
    
    # TTFT
    axes[1].plot(concurrencies, ttfts, 'o-', color='green', linewidth=2, markersize=8)
    axes[1].set_xlabel('Concurrency')
    axes[1].set_ylabel('Avg TTFT (ms)')
    axes[1].set_title('Time to First Token vs Concurrency')
    axes[1].set_xticks(concurrencies)
    axes[1].grid(True, alpha=0.3)
    
    # Latency
    axes[2].plot(concurrencies, latencies, 's-', color='red', linewidth=2, markersize=8)
    axes[2].set_xlabel('Concurrency')
    axes[2].set_ylabel('P90 Latency (ms)')
    axes[2].set_title('P90 Latency vs Concurrency')
    axes[2].set_xticks(concurrencies)
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('vllm_load_test.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nüìà Chart saved to vllm_load_test.png")
    
except ImportError:
    print("‚ö†Ô∏è matplotlib not available for visualization")
    print("   Install with: pip install matplotlib")
    print("   Or in NGC container: pip install matplotlib --user")

### üîç Understanding the Results

What we observe with continuous batching:

1. **Throughput increases** with concurrency (more requests/second)
2. **TTFT slightly increases** due to batching overhead
3. **Latency increases** but sub-linearly (the key benefit!)

Without continuous batching, latency would increase linearly with concurrency.

---

## Part 4: vLLM Configuration Tuning

Let's explore key vLLM parameters for DGX Spark optimization.

In [None]:
# vLLM Configuration Guide for DGX Spark

vllm_configs = {
    "basic": {
        "description": "Simple setup for testing",
        "flags": {
            "--model": "Qwen/Qwen3-8B-Instruct",
            "--enforce-eager": True,  # Required for ARM64
            "--max-model-len": 4096,
        },
        "use_case": "Development, testing"
    },
    "high_throughput": {
        "description": "Maximize concurrent request handling",
        "flags": {
            "--model": "Qwen/Qwen3-8B-Instruct",
            "--enforce-eager": True,
            "--max-model-len": 4096,
            "--gpu-memory-utilization": 0.95,  # Use more memory
            "--max-num-seqs": 256,  # More concurrent sequences
            "--max-num-batched-tokens": 8192,
        },
        "use_case": "Batch processing, high load"
    },
    "low_latency": {
        "description": "Minimize time to first token",
        "flags": {
            "--model": "Qwen/Qwen3-8B-Instruct",
            "--enforce-eager": True,
            "--max-model-len": 2048,  # Smaller context = faster prefill
            "--gpu-memory-utilization": 0.8,
            "--max-num-seqs": 32,  # Fewer concurrent = faster per-request
        },
        "use_case": "Interactive chat, real-time"
    },
    "large_context": {
        "description": "For long documents and RAG",
        "flags": {
            "--model": "Qwen/Qwen3-8B-Instruct",
            "--enforce-eager": True,
            "--max-model-len": 32768,  # Full context length
            "--gpu-memory-utilization": 0.95,
            "--max-num-seqs": 16,  # Fewer sequences due to memory
        },
        "use_case": "RAG, document QA"
    },
    "70b_model": {
        "description": "Running 70B models on 128GB",
        "flags": {
            "--model": "Qwen/Qwen3-32B-Instruct",
            "--enforce-eager": True,
            "--max-model-len": 4096,
            "--gpu-memory-utilization": 0.98,  # Max memory
            "--dtype": "bfloat16",
            "--max-num-seqs": 8,  # Limited by memory
        },
        "use_case": "Highest quality responses"
    }
}

print("üìã vLLM Configuration Profiles for DGX Spark")
print("=" * 70)

for name, config in vllm_configs.items():
    print(f"\nüîß {name.upper()}")
    print(f"   {config['description']}")
    print(f"   Use case: {config['use_case']}")
    print(f"   Flags:")
    for flag, value in config['flags'].items():
        if value is True:
            print(f"      {flag}")
        else:
            print(f"      {flag} {value}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting `--enforce-eager` on ARM64

```bash
# ‚ùå Wrong - Will crash on DGX Spark (ARM64)
python -m vllm.entrypoints.openai.api_server --model llama

# ‚úÖ Right - Disable CUDA graphs for ARM compatibility
python -m vllm.entrypoints.openai.api_server --model llama --enforce-eager
```

**Why:** CUDA graphs have limited ARM64 support. `--enforce-eager` uses standard execution.

### Mistake 2: Setting max-model-len Too High

```bash
# ‚ùå Wrong - May OOM with many concurrent requests
--max-model-len 131072

# ‚úÖ Right - Balance context length with concurrency
--max-model-len 8192 --max-num-seqs 64
```

**Why:** KV cache memory = max_model_len √ó num_sequences. Don't allocate more than you'll use.

### Mistake 3: Using Wrong Model Format

```python
# ‚ùå Wrong - GGUF is for llama.cpp, not vLLM
model = "TheBloke/Llama-2-7B-GGUF"

# ‚úÖ Right - Use HuggingFace format
model = "Qwen/Qwen3-8B-Instruct"
```

**Why:** vLLM loads HuggingFace transformers format directly.

---

## ‚úã Try It Yourself

### Exercise 1: Find the Optimal Batch Size

Test different `--max-num-seqs` values (8, 16, 32, 64, 128) and find the optimal setting for your workload.

In [None]:
# Exercise 1: Your code here
# TODO: For each max-num-seqs value:
#   1. Start vLLM with that configuration
#   2. Run the load test
#   3. Record throughput and latency
#   4. Find the sweet spot

# Hint: Create a function that generates the vLLM command
# and run load tests at different concurrency levels


### Exercise 2: Compare Streaming vs Non-Streaming

Measure the throughput difference between streaming and non-streaming requests.

In [None]:
# Exercise 2: Your code here
# TODO: Modify send_vllm_request to support non-streaming
# TODO: Compare throughput for streaming vs non-streaming
# TODO: When would you use each?


---

## üéâ Checkpoint

You've learned:
- ‚úÖ How continuous batching works and why it's powerful
- ‚úÖ How to deploy vLLM on DGX Spark with optimal settings
- ‚úÖ How to measure and analyze performance under load
- ‚úÖ Key configuration parameters for different use cases

---

## üöÄ Challenge (Optional)

**Build an Auto-Scaling vLLM Deployment**

Create a system that:
1. Monitors request queue depth
2. Dynamically adjusts `--max-num-seqs` based on load
3. Alerts when latency exceeds thresholds
4. Logs performance metrics for analysis

---

## üìñ Further Reading

- [vLLM: Easy, Fast, and Cheap LLM Serving (Paper)](https://arxiv.org/abs/2309.06180)
- [PagedAttention Explained](https://blog.vllm.ai/2023/06/20/vllm.html)
- [vLLM Performance Tuning Guide](https://docs.vllm.ai/en/latest/serving/performance.html)
- [Continuous Batching vs Static Batching](https://www.anyscale.com/blog/continuous-batching-llm-inference)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear variables
load_test_results = None

gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüí° To stop vLLM container:")
print("   docker ps  # Find the container ID")
print("   docker stop <container_id>")
print("\n   Or to stop all PyTorch containers:")
print("   docker stop $(docker ps -q --filter ancestor=nvcr.io/nvidia/pytorch:25.11-py3)")