# Lab 3.3.3: TensorRT-LLM Optimization

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand TensorRT-LLM's optimization pipeline
- [ ] Build optimized TensorRT engines for LLMs
- [ ] Benchmark prefill performance (TensorRT-LLM's strength)
- [ ] Configure quantization and other optimizations

---

## üìö Prerequisites

- Completed: Tasks 12.1 and 12.2
- Docker with NVIDIA container toolkit
- ~3 hours for engine build (can run in background)
- HuggingFace access token for gated models

---

## üåç Real-World Context

**When to use TensorRT-LLM:**

TensorRT-LLM shines when you need the **fastest possible prefill** (processing the input prompt).
This is critical for:

- **RAG applications**: Processing long retrieved documents
- **Code completion**: Large context windows with existing code
- **Summarization**: Long documents need fast input processing
- **First-response latency**: Getting that first token FAST

Companies like NVIDIA, Meta, and AWS use TensorRT-LLM for production deployments where every millisecond counts.

---

## üßí ELI5: What is TensorRT-LLM?

> **Imagine you're a chef preparing for a big dinner...**
>
> **Regular inference** = You cook each dish from scratch when orders come in.
> Every time someone orders pasta, you boil water, make sauce, cook noodles.
>
> **TensorRT-LLM** = You spend HOURS preparing in advance.
> - Pre-chop all vegetables (fuse operations)
> - Pre-make sauces (optimize kernels)
> - Set up assembly lines (operator fusion)
> - When orders come in, dishes fly out! (fast inference)
>
> The **build time is long** (45-90 minutes), but the **serving is blazing fast**.
>
> **In AI terms:** TensorRT-LLM analyzes your model, fuses operations, generates
> custom CUDA kernels for your specific GPU, and creates an optimized "engine"
> that runs much faster than the original model.

---

## üîë TensorRT-LLM Optimizations

| Optimization | What It Does | Benefit |
|--------------|--------------|--------|
| **Operator Fusion** | Combines multiple ops into one | Fewer memory transfers |
| **Custom Kernels** | GPU-specific code generation | Maximum hardware utilization |
| **In-flight Batching** | Like continuous batching | High throughput |
| **Fused MLP** | Combines MLP layers | Faster feedforward |
| **FP8/FP4 Quantization** | Lower precision math | 2-4x faster on Blackwell |
| **Paged KV Cache** | Dynamic memory allocation | Better memory efficiency |

---

## Part 1: Environment Setup

TensorRT-LLM requires NVIDIA's container for proper setup.

In [None]:
import subprocess
import os
import sys
import json
import time
from pathlib import Path
from datetime import datetime

def run_command(cmd, shell=True):
    """Run a shell command and return output."""
    result = subprocess.run(cmd, shell=shell, capture_output=True, text=True)
    return result.stdout.strip(), result.stderr.strip(), result.returncode

# Check system
print("üîç System Check for TensorRT-LLM:")
print("=" * 50)

# Architecture
arch, _, _ = run_command("uname -m")
print(f"   Architecture: {arch}")

# GPU info
gpu_info, _, _ = run_command("nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader")
if gpu_info:
    name, compute = gpu_info.split(",")
    print(f"   GPU: {name.strip()}")
    print(f"   Compute Capability: {compute.strip()}")

# Check for TRT-LLM container
docker_images, _, _ = run_command("docker images --format '{{.Repository}}:{{.Tag}}' | grep tensorrt")
print(f"   TensorRT-LLM containers: {docker_images if docker_images else 'None found'}")

### üê≥ Setting Up TensorRT-LLM Container

TensorRT-LLM is best run in NVIDIA's official container:

In [None]:
# Generate container setup commands
import platform

# Check architecture first - critical for DGX Spark
arch = platform.machine()
print(f"üîç System Architecture: {arch}")

if arch == "aarch64":
    print("   ‚úì DGX Spark detected (ARM64/aarch64)")
    print("")
    print("   ‚ö†Ô∏è  IMPORTANT: Verify container ARM64 support before using!")
    print("   Check NGC catalog: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver")
    print("")
    print("   If the TensorRT-LLM container doesn't support ARM64, use this alternative:")
    print("   nvcr.io/nvidia/pytorch:25.11-py3 (then install TensorRT-LLM from source)")
elif arch == "x86_64":
    print("   ‚úì x86_64 architecture detected")
else:
    print(f"   ‚ö†Ô∏è  Unknown architecture: {arch}")

# TensorRT-LLM container configuration
# NOTE: Verify ARM64 support at NGC catalog before using on DGX Spark
trtllm_container = "nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3"

workspace_dir = Path.home() / "trtllm-workspace"
models_dir = workspace_dir / "models"
engines_dir = workspace_dir / "engines"

print("üì¶ TensorRT-LLM Container Setup for DGX Spark")
print("=" * 60)
print(f"""
# Step 1: Create workspace directories
mkdir -p {models_dir}
mkdir -p {engines_dir}

# Step 2: For 70B models, clear buffer cache first
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

# Step 3: Pull the TensorRT-LLM container
# NOTE: Verify ARM64 support at NGC catalog before pulling
docker pull {trtllm_container}

# Step 4: Start interactive container
# Key flags for DGX Spark:
#   --ipc=host       : Required for DataLoader workers
#   --shm-size=16g   : Shared memory for optimization
#   --ulimit memlock=-1 : Unlimited locked memory
docker run --gpus all -it --rm \\
    -v {workspace_dir}:/workspace \\
    -v ~/.cache/huggingface:/root/.cache/huggingface \\
    -e HF_TOKEN=$HF_TOKEN \\
    --ipc=host \\
    --shm-size=16g \\
    --ulimit memlock=-1 \\
    {trtllm_container} \\
    bash

# Inside the container, TensorRT-LLM is pre-installed!
""")

print("üí° Copy these commands to your terminal to get started.")
print("‚ö†Ô∏è Verify container ARM64 support at: https://catalog.ngc.nvidia.com/")
print("   If no ARM64 support, use PyTorch NGC container and install TensorRT-LLM from source.")

---

## Part 2: Understanding the TensorRT-LLM Pipeline

Building a TensorRT engine involves several steps:

```
HuggingFace Model ‚Üí Convert to TRT-LLM format ‚Üí Build TensorRT Engine ‚Üí Deploy
```

### Pipeline Diagram

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  HuggingFace Model  ‚îÇ  (meta-llama/Llama-3.1-8B)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
           ‚ñº convert_checkpoint.py
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  TRT-LLM Checkpoint ‚îÇ  (optimized weights format)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
           ‚ñº trtllm-build
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  TensorRT Engine    ‚îÇ  (GPU-specific binary)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
           ‚ñº Triton or TRT-LLM Server
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Production API     ‚îÇ  (OpenAI-compatible)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# TensorRT-LLM Build Configuration
# This generates the commands you'll run inside the container

def generate_trtllm_build_commands(
    model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
    output_name: str = "llama-8b-trtllm",
    dtype: str = "bfloat16",
    max_input_len: int = 4096,
    max_output_len: int = 2048,
    max_batch_size: int = 8,
    use_fused_mlp: bool = True,
    quantization: str = None,  # None, "fp8", "int8_sq"
) -> str:
    """
    Generate TensorRT-LLM build commands.
    
    Args:
        model_name: HuggingFace model ID
        output_name: Name for the output engine
        dtype: Data type (float16, bfloat16)
        max_input_len: Maximum input sequence length
        max_output_len: Maximum output sequence length
        max_batch_size: Maximum batch size
        use_fused_mlp: Enable fused MLP optimization
        quantization: Quantization method if any
    """
    
    checkpoint_dir = f"/workspace/checkpoints/{output_name}"
    engine_dir = f"/workspace/engines/{output_name}"
    
    # Step 1: Convert checkpoint
    convert_cmd = f"""# Step 1: Convert HuggingFace model to TRT-LLM checkpoint
python /opt/TensorRT-LLM/examples/llama/convert_checkpoint.py \\
    --model_dir {model_name} \\
    --output_dir {checkpoint_dir} \\
    --dtype {dtype}"""
    
    if quantization == "fp8":
        convert_cmd += " \\
    --use_fp8"
    elif quantization == "int8_sq":
        convert_cmd += " \\
    --use_smooth_quant"
    
    # Step 2: Build engine
    build_cmd = f"""\n\n# Step 2: Build TensorRT engine (this takes 45-90 minutes)
trtllm-build \\
    --checkpoint_dir {checkpoint_dir} \\
    --output_dir {engine_dir} \\
    --max_input_len {max_input_len} \\
    --max_seq_len {max_input_len + max_output_len} \\
    --max_batch_size {max_batch_size} \\
    --gemm_plugin {dtype}"""
    
    if use_fused_mlp:
        build_cmd += " \\
    --use_fused_mlp enable"
    
    # Step 3: Run server
    server_cmd = f"""\n\n# Step 3: Start the inference server
python /opt/TensorRT-LLM/examples/run.py \\
    --engine_dir {engine_dir} \\
    --tokenizer_dir {model_name} \\
    --max_output_len 512 \\
    --input_text "Hello, how are you?"""
    
    return convert_cmd + build_cmd + server_cmd

# Generate commands for Llama 3.1 8B
print("üîß TensorRT-LLM Build Commands for Llama 3.1 8B")
print("=" * 60)
print(generate_trtllm_build_commands())

### ‚è±Ô∏è Build Time Expectations

| Model Size | Approximate Build Time | Engine Size |
|------------|------------------------|-------------|
| 7-8B | 45-60 minutes | ~15 GB |
| 13B | 60-90 minutes | ~25 GB |
| 70B | 2-3 hours | ~140 GB |

**Tip:** Start the build and let it run in the background while you work on other tasks!

---

## Part 3: Configuration Options Deep Dive

Let's explore the key configuration options for TensorRT-LLM.

In [None]:
# TensorRT-LLM Configuration Guide

trtllm_configs = {
    "low_latency": {
        "description": "Minimize time-to-first-token",
        "settings": {
            "max_batch_size": 4,
            "max_input_len": 2048,
            "max_output_len": 512,
            "use_fused_mlp": True,
            "paged_kv_cache": True,
            "dtype": "bfloat16"
        },
        "use_case": "Interactive chat, real-time applications"
    },
    "high_throughput": {
        "description": "Maximize requests per second",
        "settings": {
            "max_batch_size": 64,
            "max_input_len": 4096,
            "max_output_len": 2048,
            "use_fused_mlp": True,
            "paged_kv_cache": True,
            "inflight_batching": True,
            "dtype": "bfloat16"
        },
        "use_case": "Batch processing, API serving"
    },
    "long_context": {
        "description": "For RAG and document processing",
        "settings": {
            "max_batch_size": 8,
            "max_input_len": 32768,
            "max_output_len": 4096,
            "use_fused_mlp": True,
            "paged_kv_cache": True,
            "dtype": "bfloat16"
        },
        "use_case": "RAG, document summarization"
    },
    "fp8_quantized": {
        "description": "FP8 for Blackwell GPU (2x speedup)",
        "settings": {
            "max_batch_size": 32,
            "max_input_len": 4096,
            "max_output_len": 2048,
            "use_fused_mlp": True,
            "paged_kv_cache": True,
            "quantization": "fp8",
            "dtype": "float16"  # FP8 compute, FP16 I/O
        },
        "use_case": "Maximum performance on Blackwell"
    }
}

print("üìã TensorRT-LLM Configuration Profiles")
print("=" * 70)

for name, config in trtllm_configs.items():
    print(f"\nüîß {name.upper()}")
    print(f"   Description: {config['description']}")
    print(f"   Use case: {config['use_case']}")
    print(f"   Settings:")
    for key, value in config['settings'].items():
        print(f"      {key}: {value}")

### üîë Key Parameters Explained

| Parameter | Description | Trade-off |
|-----------|-------------|----------|
| `max_batch_size` | Maximum concurrent requests | Higher = more throughput, more memory |
| `max_input_len` | Maximum prompt length | Higher = longer context, more memory |
| `max_output_len` | Maximum generated length | Higher = longer responses, more memory |
| `use_fused_mlp` | Fuse MLP operations | Faster, no downside |
| `paged_kv_cache` | Dynamic KV cache allocation | More efficient memory |
| `gemm_plugin` | Use optimized GEMM kernels | Faster matrix operations |

---

## Part 4: Benchmarking TensorRT-LLM

Let's set up benchmarking for when your engine is built.

In [None]:
import requests
import time
from typing import List, Dict, Optional
from dataclasses import dataclass
import json

@dataclass
class TRTLLMBenchmarkResult:
    """Result from TensorRT-LLM benchmark."""
    prompt: str
    prompt_tokens: int
    output_tokens: int
    prefill_time_ms: float
    decode_time_ms: float
    total_time_ms: float
    prefill_tokens_per_sec: float
    decode_tokens_per_sec: float
    
def benchmark_trtllm(
    server_url: str,
    prompt: str,
    max_tokens: int = 100
) -> Optional[TRTLLMBenchmarkResult]:
    """
    Benchmark a single request to TensorRT-LLM server.
    
    Args:
        server_url: TRT-LLM server URL
        prompt: Input prompt
        max_tokens: Maximum tokens to generate
    """
    try:
        start_time = time.perf_counter()
        first_token_time = None
        output_tokens = 0
        
        response = requests.post(
            f"{server_url}/v1/chat/completions",
            json={
                "model": "tensorrt_llm",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "stream": True
            },
            stream=True,
            timeout=120
        )
        
        for line in response.iter_lines():
            if line:
                line_str = line.decode()
                if line_str.startswith("data: "):
                    data_str = line_str[6:]
                    if data_str == "[DONE]":
                        break
                    try:
                        chunk = json.loads(data_str)
                        if chunk.get("choices", [{}])[0].get("delta", {}).get("content"):
                            if first_token_time is None:
                                first_token_time = time.perf_counter()
                            output_tokens += 1
                    except:
                        pass
        
        end_time = time.perf_counter()
        
        # Calculate metrics
        prefill_time = (first_token_time - start_time) * 1000 if first_token_time else 0
        total_time = (end_time - start_time) * 1000
        decode_time = total_time - prefill_time
        
        # Estimate prompt tokens (rough)
        prompt_tokens = len(prompt.split()) * 1.3
        
        return TRTLLMBenchmarkResult(
            prompt=prompt[:50] + "...",
            prompt_tokens=int(prompt_tokens),
            output_tokens=output_tokens,
            prefill_time_ms=prefill_time,
            decode_time_ms=decode_time,
            total_time_ms=total_time,
            prefill_tokens_per_sec=prompt_tokens / (prefill_time / 1000) if prefill_time > 0 else 0,
            decode_tokens_per_sec=output_tokens / (decode_time / 1000) if decode_time > 0 else 0
        )
        
    except Exception as e:
        print(f"Benchmark error: {e}")
        return None

In [None]:
# Prefill-focused benchmark prompts (different input lengths)
prefill_benchmark_prompts = {
    "short": "What is 2+2?",
    "medium": """Explain the concept of neural networks and how they learn from data. 
    Include information about backpropagation, gradient descent, and activation functions.""",
    "long": """The following is an excerpt from a technical document about machine learning:

Machine learning is a subset of artificial intelligence that enables systems to learn and 
improve from experience without being explicitly programmed. The field has evolved significantly 
over the past decades, from simple linear regression models to complex deep neural networks 
that can process images, text, and audio with remarkable accuracy.

Deep learning, a subset of machine learning, uses artificial neural networks with multiple 
layers to progressively extract higher-level features from raw input. For example, in image 
processing, lower layers may identify edges, while higher layers may identify concepts relevant 
to humans such as digits or letters or faces.

The training process involves feeding large amounts of labeled data through the network, 
calculating the error between predictions and actual values, and then adjusting the network's 
parameters to minimize this error. This is typically done using optimization algorithms like 
stochastic gradient descent (SGD) or Adam.

Transformer models, introduced in 2017, have revolutionized natural language processing. 
These models use self-attention mechanisms to process input sequences in parallel, allowing 
them to capture long-range dependencies more effectively than previous recurrent architectures.

Given this context, please summarize the key points about machine learning evolution.""",
}

print("üìù Prefill Benchmark Prompts:")
for name, prompt in prefill_benchmark_prompts.items():
    word_count = len(prompt.split())
    estimated_tokens = int(word_count * 1.3)
    print(f"   {name}: ~{estimated_tokens} tokens ({word_count} words)")

In [None]:
# Run prefill benchmark (if TRT-LLM is running)
TRTLLM_URL = "http://localhost:8000"  # Adjust if using different port

def check_trtllm_status(url: str) -> bool:
    """Check if TensorRT-LLM server is running."""
    try:
        response = requests.get(f"{url}/v1/models", timeout=5)
        return response.status_code == 200
    except:
        return False

if check_trtllm_status(TRTLLM_URL):
    print("‚úÖ TensorRT-LLM server is running!")
    print("\nüìä Running prefill-focused benchmark...")
    print("=" * 60)
    
    results = []
    for name, prompt in prefill_benchmark_prompts.items():
        print(f"\nTesting {name} prompt...")
        result = benchmark_trtllm(TRTLLM_URL, prompt, max_tokens=100)
        if result:
            results.append((name, result))
            print(f"   Prefill: {result.prefill_time_ms:.1f}ms "
                  f"({result.prefill_tokens_per_sec:.0f} tok/s)")
            print(f"   Decode:  {result.decode_time_ms:.1f}ms "
                  f"({result.decode_tokens_per_sec:.0f} tok/s)")
    
    # Summary
    if results:
        print("\n" + "=" * 60)
        print("üìà PREFILL PERFORMANCE SUMMARY")
        print("=" * 60)
        print(f"{'Prompt':<10} {'Tokens':<10} {'Prefill (ms)':<15} {'Prefill (tok/s)':<15}")
        print("-" * 50)
        for name, r in results:
            print(f"{name:<10} {r.prompt_tokens:<10} {r.prefill_time_ms:<15.1f} {r.prefill_tokens_per_sec:<15.0f}")
else:
    print("‚ùå TensorRT-LLM server is not running")
    print("\nüìù Simulated benchmark results for demonstration:")
    print("")
    print("TensorRT-LLM excels at prefill (processing input):")
    print(f"{'Prompt':<10} {'Tokens':<10} {'Prefill (ms)':<15} {'Prefill (tok/s)':<15}")
    print("-" * 50)
    print(f"{'short':<10} {'10':<10} {'8.5':<15} {'1176':<15}")
    print(f"{'medium':<10} {'50':<10} {'15.2':<15} {'3289':<15}")
    print(f"{'long':<10} {'250':<10} {'45.8':<15} {'5459':<15}")
    print("\nüí° Note: TRT-LLM's prefill speed scales well with longer inputs!")

### üîç Understanding Prefill Performance

TensorRT-LLM typically achieves:
- **Prefill: 3,000-10,000 tokens/second** (vs 500-2,000 for other engines)
- **Decode: 50-150 tokens/second** (similar to other engines)

This makes TensorRT-LLM ideal for:
- Long context applications (RAG)
- Latency-sensitive first-token requirements
- High-throughput batch processing

---

## Part 5: Comparing TensorRT-LLM vs Other Engines

Let's create a comparison framework.

In [None]:
# Comparison data (from typical benchmarks)
# These are representative values - actual results will vary

comparison_data = {
    "Ollama": {
        "prefill_tok_s": 800,
        "decode_tok_s": 85,
        "ttft_ms": 45,
        "setup_time": "Minutes",
        "ease_of_use": 5,
        "best_for": "Development, easy setup"
    },
    "vLLM": {
        "prefill_tok_s": 1500,
        "decode_tok_s": 75,
        "ttft_ms": 35,
        "setup_time": "Minutes",
        "ease_of_use": 4,
        "best_for": "High throughput, batching"
    },
    "TensorRT-LLM": {
        "prefill_tok_s": 5000,
        "decode_tok_s": 70,
        "ttft_ms": 20,
        "setup_time": "Hours",
        "ease_of_use": 2,
        "best_for": "Lowest latency, long context"
    },
    "llama.cpp": {
        "prefill_tok_s": 600,
        "decode_tok_s": 95,
        "ttft_ms": 50,
        "setup_time": "Minutes",
        "ease_of_use": 3,
        "best_for": "Fastest decode, GGUF format"
    }
}

print("üìä Inference Engine Comparison (8B model, typical values)")
print("=" * 80)
print(f"{'Engine':<15} {'Prefill':<15} {'Decode':<12} {'TTFT':<10} {'Setup':<10} {'Best For'}")
print(f"{'':15} {'(tok/s)':<15} {'(tok/s)':<12} {'(ms)':<10} {'Time':<10}")
print("-" * 80)

for engine, data in comparison_data.items():
    print(f"{engine:<15} {data['prefill_tok_s']:<15} {data['decode_tok_s']:<12} "
          f"{data['ttft_ms']:<10} {data['setup_time']:<10} {data['best_for']}")

In [None]:
# Visualize comparison
try:
    import matplotlib.pyplot as plt
    import numpy as np
    
    engines = list(comparison_data.keys())
    prefill = [comparison_data[e]["prefill_tok_s"] for e in engines]
    decode = [comparison_data[e]["decode_tok_s"] for e in engines]
    ttft = [comparison_data[e]["ttft_ms"] for e in engines]
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    colors = ['#4C78A8', '#F58518', '#E45756', '#72B7B2']
    
    # Prefill speed
    axes[0].bar(engines, prefill, color=colors)
    axes[0].set_ylabel('Tokens/second')
    axes[0].set_title('Prefill Speed (higher is better)')
    axes[0].tick_params(axis='x', rotation=15)
    
    # Decode speed
    axes[1].bar(engines, decode, color=colors)
    axes[1].set_ylabel('Tokens/second')
    axes[1].set_title('Decode Speed (higher is better)')
    axes[1].tick_params(axis='x', rotation=15)
    
    # TTFT
    axes[2].bar(engines, ttft, color=colors)
    axes[2].set_ylabel('Milliseconds')
    axes[2].set_title('Time to First Token (lower is better)')
    axes[2].tick_params(axis='x', rotation=15)
    
    plt.tight_layout()
    plt.savefig('engine_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nüìà Chart saved to engine_comparison.png")
    
except ImportError:
    print("‚ö†Ô∏è matplotlib not available for visualization")
    print("   Install with: pip install matplotlib")
    print("   Or in NGC container: pip install matplotlib --user")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Building Without Enough Memory

```bash
# ‚ùå Wrong - Build will fail or be very slow
docker run --gpus all -it nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3

# ‚úÖ Right - Allocate enough shared memory
docker run --gpus all --shm-size=16g -it nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3
```

**Why:** TensorRT-LLM's build process needs significant shared memory for graph optimization.

### Mistake 2: Mismatched max_input_len and max_seq_len

```bash
# ‚ùå Wrong - max_seq_len must be >= max_input_len + max_output_len
trtllm-build --max_input_len 4096 --max_seq_len 4096

# ‚úÖ Right - Leave room for output
trtllm-build --max_input_len 4096 --max_seq_len 6144  # 4096 + 2048
```

**Why:** If max_seq_len equals max_input_len, there's no room for output tokens.

### Mistake 3: Not Using gemm_plugin

```bash
# ‚ùå Wrong - Slower matrix operations
trtllm-build --output_dir ./engine

# ‚úÖ Right - Use optimized GEMM kernels
trtllm-build --output_dir ./engine --gemm_plugin bfloat16
```

**Why:** The GEMM plugin provides significant speedups for matrix multiplications.

---

## ‚úã Try It Yourself

### Exercise 1: Build a TensorRT Engine

Follow the steps above to build a TensorRT engine for Llama 3.1 8B. Time the build process and note the engine size.

In [None]:
# Exercise 1: Document your build process

# Build Configuration:
build_config = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "dtype": "bfloat16",
    "max_input_len": 4096,
    "max_output_len": 2048,
    "max_batch_size": 8,
    # Add your other settings...
}

# Results (fill in after building):
build_results = {
    "build_time_minutes": None,  # TODO: Record this
    "engine_size_gb": None,      # TODO: Check with ls -lh
    "errors_encountered": [],     # TODO: Document any issues
}

print("üìù Document your build results above!")

### Exercise 2: Prefill Scaling Test

Test how TensorRT-LLM's prefill speed scales with input length.

In [None]:
# Exercise 2: Prefill scaling test
# Create prompts of lengths: 100, 500, 1000, 2000, 4000 tokens
# Measure prefill time for each
# Plot the relationship

# TODO: Your code here
# Hint: Use lorem ipsum or repeated text to create consistent prompts


---

## üéâ Checkpoint

You've learned:
- ‚úÖ How TensorRT-LLM optimizes models for NVIDIA GPUs
- ‚úÖ The build pipeline: HuggingFace ‚Üí Checkpoint ‚Üí Engine
- ‚úÖ Key configuration options for different use cases
- ‚úÖ When to choose TensorRT-LLM over other engines

---

## üöÄ Challenge (Optional)

**Build an FP8 Quantized Engine**

DGX Spark's Blackwell GPU supports FP8 inference. Try building an FP8 engine and compare:
1. Build time
2. Engine size
3. Inference speed
4. Output quality

---

## üìñ Further Reading

- [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)
- [TensorRT-LLM Performance Guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html)
- [Triton Inference Server Integration](https://github.com/triton-inference-server/tensorrtllm_backend)
- [FP8 Training and Inference](https://developer.nvidia.com/blog/nvidia-hopper-architecture-enables-fp8-training-and-inference/)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüí° To stop TensorRT-LLM container:")
print("   docker stop $(docker ps -q --filter ancestor=nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3)")
print("\n‚ö†Ô∏è Engine files can be large (~15GB per 8B model)")
print("   Delete unused engines to save space.")