# Lab 3.3.1: Inference Engine Benchmark

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the key metrics for comparing inference engines (TTFT, tokens/sec, throughput)
- [ ] Set up and benchmark Ollama, llama.cpp, vLLM, and TensorRT-LLM
- [ ] Know when to use each engine based on your requirements
- [ ] Create comprehensive benchmark reports with visualizations

---

## üìö Prerequisites

- Completed: Module 3.2 (Quantization & Optimization) or equivalent knowledge
- Knowledge of: Basic Python, REST APIs, understanding of LLM inference
- Having at least Ollama installed (`ollama serve`)

---

## üåç Real-World Context

**Choosing the right inference engine is like choosing the right vehicle for a journey:**

| If you need... | Choose... | Like... |
|----------------|-----------|--------|
| Quick local testing | Ollama | A bicycle - easy to start, good for short trips |
| Maximum decode speed | llama.cpp | A sports car - fastest on open roads |
| High concurrent users | vLLM | A bus - carries many passengers efficiently |
| Best first-token latency | TensorRT-LLM | A rocket - fastest acceleration |
| Speculative decoding | SGLang | A teleporter - skip the boring parts |

**Real deployment scenarios:**
- **Interactive chatbot** ‚Üí Optimize for TTFT (time to first token) - users want instant response
- **Batch processing** ‚Üí Optimize for throughput (requests/sec) - you want maximum efficiency
- **Code completion** ‚Üí Optimize for decode speed (tokens/sec) - fast completions feel snappy
- **Document summarization** ‚Üí Optimize for prefill speed - long inputs need fast processing

---

## üßí ELI5: What Are Inference Engines?

> **Imagine you're running a restaurant kitchen...**
>
> The **LLM model** is your recipe book - it contains all the knowledge about how to make dishes.
>
> The **inference engine** is your kitchen setup - the stove, ovens, and how you organize your cooks.
>
> Different kitchen setups work better for different situations:
> - **Ollama** = A home kitchen. Easy to use, great for trying recipes, but not built for 100 customers.
> - **llama.cpp** = A food truck. Super efficient, serves one customer at a time REALLY fast.
> - **vLLM** = A commercial kitchen. Can handle many orders at once by sharing the grill.
> - **TensorRT-LLM** = A high-tech kitchen designed specifically for your stove brand. Optimized for NVIDIA GPUs.
>
> **In AI terms:** The inference engine determines HOW your model runs - how memory is managed, how requests are batched, and how fast tokens are generated.

---

## üîë Key Metrics Explained

| Metric | What It Measures | Why It Matters |
|--------|------------------|----------------|
| **TTFT** (Time To First Token) | How long until you see the first word | Users perceive this as "response time" |
| **Prefill Speed** | Tokens/sec processing the input | Important for long prompts (RAG, documents) |
| **Decode Speed** | Tokens/sec generating output | How fast text appears after it starts |
| **Throughput** | Requests/sec the system handles | How many users you can serve simultaneously |
| **Latency (P50/P90/P99)** | Time to complete requests | P99 = slowest 1% of requests (worst case) |

---

## Part 1: Environment Setup

Let's start by setting up our benchmarking environment and checking what's available on your DGX Spark.

In [None]:
# Standard library imports
import json
import os
import sys
import time
import subprocess
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
import requests
import numpy as np

# Add scripts directory to path
scripts_path = Path("../scripts").resolve()
sys.path.insert(0, str(scripts_path))

# Import our custom utilities
from benchmark_utils import (
    InferenceBenchmark,
    BenchmarkResult,
    BatchBenchmarkResult,
    load_benchmark_prompts,
    compare_engines,
    format_comparison_table,
    get_gpu_memory_usage
)

print("‚úÖ Imports successful!")
print(f"üìÅ Scripts path: {scripts_path}")

In [None]:
# Check GPU availability and status
def check_gpu_status() -> bool:
    """
    Check GPU availability and memory.
    
    Returns:
        bool: True if GPU is available and accessible, False otherwise
    """
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total,memory.free,memory.used,utilization.gpu",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            values = result.stdout.strip().split(",")
            name = values[0].strip()
            total = int(values[1]) / 1024  # Convert to GB
            free = int(values[2]) / 1024
            used = int(values[3]) / 1024
            util = values[4].strip()
            
            print("üñ•Ô∏è GPU Status:")
            print(f"   Name: {name}")
            print(f"   Memory: {used:.1f}GB used / {total:.1f}GB total ({free:.1f}GB free)")
            print(f"   Utilization: {util}%")
            return True
    except Exception as e:
        print(f"‚ö†Ô∏è GPU check failed: {e}")
    return False

check_gpu_status()

In [None]:
# Check which inference engines are available
def check_engine_availability() -> Dict[str, str]:
    """
    Check which inference engines are running and accessible.
    
    Returns:
        Dict[str, str]: Dictionary mapping engine names to their base URLs
    """
    engines = {
        "ollama": "http://localhost:11434",
        "vllm": "http://localhost:8000",
        "sglang": "http://localhost:30000",
        "tensorrt-llm": "http://localhost:8000",
    }
    
    health_endpoints = {
        "ollama": "/api/tags",
        "vllm": "/v1/models",
        "sglang": "/v1/models",
        "tensorrt-llm": "/v1/models",
    }
    
    available = {}
    
    print("üîç Checking inference engine availability...\n")
    
    for engine, base_url in engines.items():
        endpoint = health_endpoints[engine]
        try:
            response = requests.get(f"{base_url}{endpoint}", timeout=3)
            if response.status_code == 200:
                print(f"   ‚úÖ {engine.ljust(15)} Available at {base_url}")
                available[engine] = base_url
                
                # For Ollama, list available models
                if engine == "ollama":
                    models = response.json().get("models", [])
                    model_names = [m["name"] for m in models[:3]]  # First 3
                    if model_names:
                        print(f"                       Models: {', '.join(model_names)}")
            else:
                print(f"   ‚ùå {engine.ljust(15)} Not responding (status {response.status_code})")
        except requests.exceptions.ConnectionError:
            print(f"   ‚ùå {engine.ljust(15)} Not running")
        except Exception as e:
            print(f"   ‚ùå {engine.ljust(15)} Error: {e}")
    
    print(f"\nüìä Available engines: {len(available)}")
    return available

available_engines = check_engine_availability()

### üîß Starting Inference Engines

If you need to start any engines, here are the commands:

**Ollama** (simplest to start):
```bash
# In a separate terminal
ollama serve

# Pull a model if you haven't already
ollama pull llama3.1:8b
```

**vLLM** (using PyTorch NGC container):
```bash
# Start vLLM with Llama 3.1 8B
# Note: Use PyTorch NGC container and install vLLM for DGX Spark ARM64 compatibility
docker run --gpus all -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash -c "pip install vllm && python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --enforce-eager \
        --dtype bfloat16 \
        --max-model-len 4096"
```

**SGLang**:
```bash
python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 30000
```

---

## Part 2: Load Benchmark Prompts

We'll use a variety of prompts to test different aspects of inference performance.

In [None]:
# Load benchmark prompts from our data file
data_path = Path("../data/benchmark_prompts.json")

if data_path.exists():
    with open(data_path) as f:
        all_prompts = json.load(f)
    
    print("üìù Loaded benchmark prompts:")
    for category, prompts in all_prompts.items():
        if category != "chat":  # Skip chat format for now
            print(f"   ‚Ä¢ {category}: {len(prompts)} prompts")
else:
    print("‚ö†Ô∏è Benchmark prompts file not found. Creating sample prompts...")
    all_prompts = {
        "short": [
            {"id": "s1", "text": "What is the capital of France?", "expected_tokens": 10},
            {"id": "s2", "text": "What is 2 + 2?", "expected_tokens": 5},
        ],
        "medium": [
            {"id": "m1", "text": "Explain machine learning in 3 sentences.", "expected_tokens": 100},
        ]
    }

In [None]:
# Select prompts for benchmarking
# We'll use a mix of short and medium prompts for this benchmark

benchmark_prompts = []

# Add short prompts (for latency testing)
for p in all_prompts.get("short", [])[:3]:
    benchmark_prompts.append({
        "id": p["id"],
        "text": p["text"],
        "category": "short",
        "max_tokens": 50
    })

# Add medium prompts (for throughput testing)
for p in all_prompts.get("medium", [])[:2]:
    benchmark_prompts.append({
        "id": p["id"],
        "text": p["text"],
        "category": "medium",
        "max_tokens": 200
    })

# Add a code prompt
for p in all_prompts.get("code", [])[:1]:
    benchmark_prompts.append({
        "id": p["id"],
        "text": p["text"],
        "category": "code",
        "max_tokens": 300
    })

print(f"üìä Selected {len(benchmark_prompts)} prompts for benchmarking:")
for p in benchmark_prompts:
    print(f"   [{p['category']}] {p['text'][:50]}...")

---

## Part 3: Single Request Benchmarks

Let's start by benchmarking single requests to understand baseline latency.

### üßí ELI5: Latency vs Throughput

> Think of a water slide at a water park:
>
> **Latency** = How long it takes YOU to go from top to bottom (your wait time)
> **Throughput** = How many people go down the slide per hour (total capacity)
>
> You can have a fast slide (low latency) but if only one person can go at a time, throughput is low.
> Or you can have multiple lanes (high throughput) but each lane might be a bit slower.

In [None]:
# Benchmark a single engine with single requests
def benchmark_single_requests(
    engine: str,
    model: str,
    prompts: List[Dict],
    base_url: Optional[str] = None
) -> List[BenchmarkResult]:
    """
    Run single-request benchmarks for an engine.
    
    Args:
        engine: Engine name ("ollama", "vllm", etc.)
        model: Model identifier
        prompts: List of prompt dictionaries
        base_url: Optional custom URL
    
    Returns:
        List of BenchmarkResult objects
    """
    print(f"\nüöÄ Benchmarking {engine} ({model})...")
    
    try:
        benchmark = InferenceBenchmark(engine=engine, model=model, base_url=base_url)
        
        # Warmup
        print("   Warming up (3 requests)...")
        benchmark.warmup(3)
        
        results = []
        for i, prompt in enumerate(prompts):
            print(f"   Running prompt {i+1}/{len(prompts)}: {prompt['text'][:40]}...", end="")
            
            result = benchmark.run_single(
                prompt["text"],
                max_tokens=prompt.get("max_tokens", 100),
                temperature=0.7,
                stream=True
            )
            results.append(result)
            
            if result.error:
                print(f" ‚ùå Error: {result.error}")
            else:
                print(f" ‚úÖ TTFT: {result.time_to_first_token*1000:.0f}ms, "
                      f"{result.tokens_per_second:.1f} tok/s")
        
        return results
        
    except Exception as e:
        print(f"   ‚ùå Failed: {e}")
        return []

In [None]:
# Run benchmarks on available engines
all_results = {}

# Define engine configurations
engine_configs = {
    "ollama": {
        "model": "llama3.1:8b",  # Change this to your installed model
        "base_url": "http://localhost:11434"
    },
    "vllm": {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "base_url": "http://localhost:8000"
    },
    "sglang": {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "base_url": "http://localhost:30000"
    }
}

# Run benchmarks only on available engines
for engine, config in engine_configs.items():
    if engine in available_engines:
        results = benchmark_single_requests(
            engine=engine,
            model=config["model"],
            prompts=benchmark_prompts,
            base_url=config["base_url"]
        )
        if results:
            all_results[engine] = results

print(f"\nüìä Completed benchmarks for {len(all_results)} engines")

In [None]:
# Analyze and compare results
def analyze_results(results_by_engine: Dict[str, List[BenchmarkResult]]) -> Dict[str, Dict[str, Any]]:
    """
    Compute statistics for each engine.
    
    Args:
        results_by_engine: Dictionary mapping engine names to lists of BenchmarkResult
        
    Returns:
        Dictionary with computed statistics for each engine
    """
    import numpy as np  # Ensure numpy is available even if run out of order
    
    analysis = {}
    
    for engine, results in results_by_engine.items():
        successful = [r for r in results if r.error is None]
        
        if not successful:
            continue
            
        ttfts = [r.time_to_first_token * 1000 for r in successful]  # in ms
        speeds = [r.tokens_per_second for r in successful]
        latencies = [r.total_time * 1000 for r in successful]  # in ms
        
        analysis[engine] = {
            "count": len(successful),
            "ttft_avg_ms": np.mean(ttfts),
            "ttft_p50_ms": np.percentile(ttfts, 50),
            "ttft_p90_ms": np.percentile(ttfts, 90),
            "speed_avg_tps": np.mean(speeds),
            "speed_max_tps": np.max(speeds),
            "latency_avg_ms": np.mean(latencies),
            "latency_p90_ms": np.percentile(latencies, 90),
        }
    
    return analysis

if all_results:
    analysis = analyze_results(all_results)
    
    print("\n" + "="*70)
    print("üìä SINGLE REQUEST BENCHMARK RESULTS")
    print("="*70)
    
    # Create comparison table
    print(f"\n{'Engine':<15} {'Avg TTFT':>12} {'P90 TTFT':>12} {'Avg Speed':>12} {'P90 Latency':>12}")
    print("-"*70)
    
    for engine, stats in analysis.items():
        print(f"{engine:<15} "
              f"{stats['ttft_avg_ms']:>10.1f}ms "
              f"{stats['ttft_p90_ms']:>10.1f}ms "
              f"{stats['speed_avg_tps']:>10.1f}/s "
              f"{stats['latency_p90_ms']:>10.1f}ms")
else:
    print("\n‚ö†Ô∏è No results to analyze. Make sure at least one engine is running.")

### üîç What Just Happened?

We just measured three key metrics for each engine:

1. **TTFT (Time To First Token)**: How quickly the model starts responding
   - Lower is better for interactive applications
   - Ollama and llama.cpp typically excel here

2. **Speed (tokens/second)**: How fast tokens are generated
   - Higher is better for long responses
   - llama.cpp typically leads in single-request decode speed

3. **Latency (total request time)**: End-to-end request completion
   - Combines TTFT + (output_length / speed)
   - Important for batch processing

---

## Part 4: Concurrent Request Benchmarks

Now let's test how each engine handles multiple simultaneous requests.

### üßí ELI5: Why Concurrency Matters

> Imagine you're a chef:
>
> **Single request** = Making one sandwich at a time
> **Concurrent requests** = Making 10 sandwiches at once
>
> Some chefs (engines) can juggle many sandwiches efficiently by reusing ingredients (KV cache sharing).
> Others might get overwhelmed and slow down for everyone.
>
> vLLM is like a chef with a special system for managing multiple orders efficiently.

In [None]:
# Concurrent benchmark settings
CONCURRENCY_LEVELS = [1, 2, 4, 8]  # Number of simultaneous requests
REQUESTS_PER_LEVEL = 12  # Total requests at each concurrency level

# Prepare prompts for concurrent testing (repeat as needed)
concurrent_prompts = [p["text"] for p in benchmark_prompts] * 4  # Repeat to get enough
concurrent_prompts = concurrent_prompts[:REQUESTS_PER_LEVEL]

print(f"üìä Concurrent Benchmark Configuration:")
print(f"   Concurrency levels: {CONCURRENCY_LEVELS}")
print(f"   Requests per level: {REQUESTS_PER_LEVEL}")
print(f"   Prompts prepared: {len(concurrent_prompts)}")

In [None]:
# Run concurrent benchmarks
concurrent_results = {}

for engine, config in engine_configs.items():
    if engine not in available_engines:
        continue
    
    print(f"\nüöÄ Concurrent benchmark: {engine}")
    print("-" * 50)
    
    concurrent_results[engine] = {}
    
    try:
        benchmark = InferenceBenchmark(
            engine=engine,
            model=config["model"],
            base_url=config["base_url"]
        )
        
        for concurrency in CONCURRENCY_LEVELS:
            print(f"   Testing concurrency={concurrency}...", end=" ")
            
            result = benchmark.run_batch(
                prompts=concurrent_prompts,
                max_tokens=100,
                concurrency=concurrency,
                stream=False  # Faster for batch testing
            )
            
            concurrent_results[engine][concurrency] = result
            
            print(f"‚úÖ {result.throughput_rps:.2f} req/s, "
                  f"avg latency: {result.avg_ttft*1000:.0f}ms")
            
    except Exception as e:
        print(f"   ‚ùå Error: {e}")

print("\n‚úÖ Concurrent benchmarks complete!")

In [None]:
# Visualize concurrent benchmark results
try:
    import matplotlib.pyplot as plt
    
    if concurrent_results:
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Plot 1: Throughput vs Concurrency
        ax1 = axes[0]
        for engine, results in concurrent_results.items():
            concurrencies = sorted(results.keys())
            throughputs = [results[c].throughput_rps for c in concurrencies]
            ax1.plot(concurrencies, throughputs, 'o-', label=engine, linewidth=2, markersize=8)
        
        ax1.set_xlabel("Concurrency (simultaneous requests)", fontsize=12)
        ax1.set_ylabel("Throughput (requests/second)", fontsize=12)
        ax1.set_title("Throughput vs Concurrency", fontsize=14)
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        ax1.set_xticks(CONCURRENCY_LEVELS)
        
        # Plot 2: Latency vs Concurrency
        ax2 = axes[1]
        for engine, results in concurrent_results.items():
            concurrencies = sorted(results.keys())
            latencies = [results[c].p90_latency * 1000 for c in concurrencies]  # in ms
            ax2.plot(concurrencies, latencies, 's-', label=engine, linewidth=2, markersize=8)
        
        ax2.set_xlabel("Concurrency (simultaneous requests)", fontsize=12)
        ax2.set_ylabel("P90 Latency (ms)", fontsize=12)
        ax2.set_title("Latency vs Concurrency", fontsize=14)
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        ax2.set_xticks(CONCURRENCY_LEVELS)
        
        plt.tight_layout()
        plt.savefig("benchmark_results.png", dpi=150, bbox_inches='tight')
        plt.show()
        
        print("\nüìà Chart saved to benchmark_results.png")
    else:
        print("‚ö†Ô∏è No results to visualize")
        
except ImportError:
    print("‚ö†Ô∏è matplotlib not available for visualization")
    print("   Install with: pip install matplotlib")

---

## Part 5: Generate Benchmark Report

Let's create a comprehensive report of our findings.

In [None]:
# Generate comprehensive benchmark report
def generate_report(single_results, concurrent_results, engine_configs):
    """Generate a markdown benchmark report."""
    
    report = []
    report.append("# Inference Engine Benchmark Report")
    report.append(f"\n**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    report.append(f"**Platform:** DGX Spark (128GB Unified Memory)")
    report.append("")
    
    # GPU Info
    report.append("## Hardware Configuration")
    report.append("")
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total", "--format=csv,noheader"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            report.append(f"- **GPU:** {result.stdout.strip()}")
    except:
        pass
    report.append("")
    
    # Engines tested
    report.append("## Engines Tested")
    report.append("")
    report.append("| Engine | Model | URL |")
    report.append("|--------|-------|-----|")
    for engine, config in engine_configs.items():
        if engine in single_results or engine in concurrent_results:
            report.append(f"| {engine} | {config['model']} | {config['base_url']} |")
    report.append("")
    
    # Single request results
    if single_results:
        analysis = analyze_results(single_results)
        
        report.append("## Single Request Performance")
        report.append("")
        report.append("| Engine | Avg TTFT (ms) | P90 TTFT (ms) | Avg Speed (tok/s) | P90 Latency (ms) |")
        report.append("|--------|---------------|---------------|-------------------|------------------|")
        
        for engine, stats in analysis.items():
            report.append(
                f"| {engine} | {stats['ttft_avg_ms']:.1f} | {stats['ttft_p90_ms']:.1f} | "
                f"{stats['speed_avg_tps']:.1f} | {stats['latency_p90_ms']:.1f} |"
            )
        report.append("")
    
    # Concurrent results
    if concurrent_results:
        report.append("## Concurrent Request Performance")
        report.append("")
        report.append("| Engine | Concurrency | Throughput (req/s) | Avg TTFT (ms) | P90 Latency (ms) |")
        report.append("|--------|-------------|-------------------|---------------|------------------|")
        
        for engine, results in concurrent_results.items():
            for conc, result in sorted(results.items()):
                report.append(
                    f"| {engine} | {conc} | {result.throughput_rps:.2f} | "
                    f"{result.avg_ttft*1000:.1f} | {result.p90_latency*1000:.1f} |"
                )
        report.append("")
    
    # Recommendations
    report.append("## Recommendations")
    report.append("")
    report.append("Based on the benchmark results:")
    report.append("")
    
    if single_results:
        analysis = analyze_results(single_results)
        
        # Find best for TTFT
        best_ttft = min(analysis.items(), key=lambda x: x[1]['ttft_avg_ms'])
        report.append(f"- **Best for interactive chat (lowest TTFT):** {best_ttft[0]} "
                     f"({best_ttft[1]['ttft_avg_ms']:.0f}ms average)")
        
        # Find best for speed
        best_speed = max(analysis.items(), key=lambda x: x[1]['speed_avg_tps'])
        report.append(f"- **Best for long responses (highest speed):** {best_speed[0]} "
                     f"({best_speed[1]['speed_avg_tps']:.0f} tokens/sec)")
    
    if concurrent_results:
        # Find best throughput at highest concurrency
        max_conc = max(CONCURRENCY_LEVELS)
        best_throughput = None
        best_engine = None
        for engine, results in concurrent_results.items():
            if max_conc in results:
                if best_throughput is None or results[max_conc].throughput_rps > best_throughput:
                    best_throughput = results[max_conc].throughput_rps
                    best_engine = engine
        
        if best_engine:
            report.append(f"- **Best for high load (concurrency={max_conc}):** {best_engine} "
                         f"({best_throughput:.1f} req/sec)")
    
    report.append("")
    
    return "\n".join(report)

# Generate and save report
report = generate_report(all_results, concurrent_results, engine_configs)
print(report)

# Save report
report_path = Path("benchmark_report.md")
with open(report_path, "w") as f:
    f.write(report)
print(f"\nüìÑ Report saved to: {report_path}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Warming Up the Engine

```python
# ‚ùå Wrong - First request is slow due to model loading
result = benchmark.run_single("Hello")  # Includes model load time!

# ‚úÖ Right - Warm up first
benchmark.warmup(3)  # Model is loaded and cached
result = benchmark.run_single("Hello")  # Now measures actual inference
```

**Why:** The first request often includes model loading and JIT compilation overhead. Warming up ensures you're measuring steady-state performance.

### Mistake 2: Comparing Different Model Sizes

```python
# ‚ùå Wrong - Not a fair comparison
ollama_8b = InferenceBenchmark(engine="ollama", model="llama3.1:8b")
vllm_70b = InferenceBenchmark(engine="vllm", model="Llama-3.1-70B")  # Different size!

# ‚úÖ Right - Same model size
ollama_8b = InferenceBenchmark(engine="ollama", model="llama3.1:8b")
vllm_8b = InferenceBenchmark(engine="vllm", model="meta-llama/Llama-3.1-8B-Instruct")
```

**Why:** Larger models are inherently slower. Compare apples to apples.

### Mistake 3: Ignoring Quantization Differences

```python
# ‚ùå Wrong - Different quantization levels
ollama_q4 = "llama3.1:8b"  # Usually Q4_K_M
vllm_fp16 = "Llama-3.1-8B"  # Usually FP16

# ‚úÖ Right - Document the quantization
# Note: Ollama uses Q4 GGUF, vLLM uses FP16/BF16
# This explains some performance differences
```

**Why:** Quantized models (Q4) are faster but less accurate than full precision (FP16).

---

## ‚úã Try It Yourself

Now it's your turn! Complete these exercises:

### Exercise 1: Test Different Prompt Lengths

Create a benchmark that tests how different engines handle:
- Very short prompts (5-10 tokens)
- Medium prompts (50-100 tokens)
- Long prompts (500+ tokens)

Which engine handles long contexts best?

In [None]:
# Exercise 1: Your code here
# Hint: Create prompts of different lengths and compare prefill times

short_prompt = "Hi!"
medium_prompt = "Explain the theory of relativity and its implications for modern physics." * 5
long_prompt = "Write a detailed analysis of the following text: " + ("Lorem ipsum dolor sit amet. " * 50)

# TODO: Run benchmarks on each prompt length
# TODO: Compare TTFT (prefill) performance
# TODO: Which engine is best for long contexts?


<details>
<summary>üí° Hint</summary>

Focus on the `time_to_first_token` metric, which measures prefill speed. Create a table like:

```python
prompt_lengths = [
    ("short", short_prompt),
    ("medium", medium_prompt),
    ("long", long_prompt)
]

for name, prompt in prompt_lengths:
    result = benchmark.run_single(prompt, max_tokens=50)
    print(f"{name}: TTFT={result.time_to_first_token*1000:.0f}ms")
```

</details>

### Exercise 2: Find the Saturation Point

At what concurrency level does each engine start to struggle? Find the "knee" in the latency curve.

In [None]:
# Exercise 2: Your code here
# Hint: Test concurrency levels [1, 2, 4, 8, 16, 32, 64] and look for when latency spikes

# TODO: Run concurrent benchmarks at higher levels
# TODO: Plot latency vs concurrency to find the saturation point
# TODO: What's the optimal concurrency for each engine?


---

## üéâ Checkpoint

You've learned:
- ‚úÖ The key metrics for LLM inference: TTFT, tokens/sec, throughput, latency
- ‚úÖ How to benchmark Ollama, vLLM, and other inference engines
- ‚úÖ The trade-offs between single-request latency and concurrent throughput
- ‚úÖ How to generate comprehensive benchmark reports

---

## üöÄ Challenge (Optional)

**Create an Automated Benchmark Suite**

Build a script that:
1. Automatically detects which engines are running
2. Runs a comprehensive benchmark suite
3. Generates a HTML report with interactive charts
4. Sends alerts if performance drops below thresholds

This is useful for production monitoring!

---

## üìñ Further Reading

- [vLLM Paper: Efficient Memory Management for LLM Serving](https://arxiv.org/abs/2309.06180)
- [TensorRT-LLM Optimization Guide](https://nvidia.github.io/TensorRT-LLM/)
- [llama.cpp Performance Tips](https://github.com/ggerganov/llama.cpp/blob/master/docs/development.md)
- [Continuous Batching Explained](https://www.anyscale.com/blog/continuous-batching-llm-inference)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear results if they're taking too much memory
# all_results = None
# concurrent_results = None

# Clear Python garbage
gc.collect()

# Clear GPU memory cache if torch is available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        print("‚úÖ GPU memory cache cleared!")
except ImportError:
    pass

print("‚úÖ Cleanup complete!")
print(f"\nüìä GPU Memory: {get_gpu_memory_usage():.2f} GB used")