# Lab 3.3.6: TensorRT-LLM Optimization

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand TensorRT-LLM's architecture and optimization pipeline
- [ ] Build an optimized TRT engine for Llama models
- [ ] Benchmark prefill and decode performance
- [ ] Know when TensorRT-LLM is the right choice

---

## üìö Prerequisites

- Completed: Labs 3.3.1-3.3.5
- Knowledge of: Model quantization, CUDA basics
- Having: NGC container access, 128GB+ free disk for engine builds

---

## üåç Real-World Context

**When Maximum Performance Matters:**

TensorRT-LLM is NVIDIA's highest-performance inference engine, designed for:
- Production deployments where every millisecond counts
- High-throughput batch processing
- Applications with long input prompts (RAG, document analysis)

**Trade-offs:**
- Best performance ‚Üí Requires engine compilation (45-90 minutes)
- NVIDIA optimized ‚Üí Less portable than other solutions
- Most complex ‚Üí Steeper learning curve

**Real Impact:**
- 2-5x faster prefill than PyTorch
- Best FP8/NVFP4 support on Blackwell
- Used by major cloud providers for LLM serving

---

## üßí ELI5: What is TensorRT-LLM?

> **Imagine you have a recipe for a fancy cake...**
>
> **PyTorch/HuggingFace:** You follow the recipe step by step, measuring each ingredient as you go.
> This is flexible - you can adjust on the fly - but not the fastest.
>
> **TensorRT-LLM:** Before baking, you spend an hour organizing:
> - Pre-measure ALL ingredients
> - Arrange tools in optimal order
> - Figure out which steps can happen simultaneously
> - Create a highly optimized "production line"
>
> Now when it's time to bake, everything flows perfectly - much faster!
>
> **The downside?** You spent an hour planning. And if you want a different cake,
> you need a new plan. But for making 1000 identical cakes? Way faster!
>
> **In AI terms:** TensorRT-LLM pre-compiles the model into a highly optimized "engine"
> that runs on NVIDIA GPUs with maximum efficiency. The compilation takes time,
> but inference is blazing fast.

---

## üìä TensorRT-LLM Optimization Pipeline

```
HuggingFace Model ‚Üí Convert ‚Üí Quantize ‚Üí Build Engine ‚Üí Deploy
     (FP16)          (TRT)     (FP8)       (TRT-LLM)     (Triton)
     
Time:  ~10min        ~5min     ~15min      ~60min        ~2min
```

## Part 1: Understanding TensorRT-LLM Architecture

In [None]:
# Standard imports
import json
import os
import sys
import time
import subprocess
from pathlib import Path
from typing import Dict, List, Any
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
import requests
import numpy as np

# Add scripts directory to path
scripts_path = Path("../scripts").resolve()
sys.path.insert(0, str(scripts_path))

print("‚úÖ Imports successful!")
print(f"üìÅ Scripts path: {scripts_path}")

In [None]:
# Visualize TensorRT-LLM components
print("""
üìä TENSORRT-LLM ARCHITECTURE
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    TensorRT-LLM Stack                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ  Model API    ‚îÇ    ‚îÇ   Executor    ‚îÇ    ‚îÇ   Runtime     ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  (Python)     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ   (Batch Mgr) ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ   (C++/CUDA)  ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ         ‚îÇ                     ‚îÇ                    ‚îÇ           ‚îÇ
‚îÇ         ‚îÇ                     ‚îÇ                    ‚îÇ           ‚îÇ
‚îÇ         ‚ñº                     ‚ñº                    ‚ñº           ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ              TensorRT Engine (Compiled)               ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - Fused CUDA kernels                                 ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - Optimized memory layout                            ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - Hardware-specific tuning                           ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                              ‚îÇ                                  ‚îÇ
‚îÇ                              ‚ñº                                  ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ              NVIDIA GPU (Blackwell GB10)              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - 6,144 CUDA cores                                   ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - 192 Tensor Cores (5th gen)                         ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  - Native FP8/NVFP4 support                           ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

In [None]:
# Key optimizations in TensorRT-LLM
print("""
üìä KEY TENSORRT-LLM OPTIMIZATIONS
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

1. KERNEL FUSION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Before: MatMul ‚Üí Add Bias ‚Üí Activation (3 kernel launches)
   After:  FusedLinear (1 kernel launch)
   
   Benefit: ~30% less kernel launch overhead

2. FLASH ATTENTION INTEGRATION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Fused multi-head attention kernel
   - Memory-efficient O(N) instead of O(N¬≤)
   - Specialized for GQA/MQA architectures
   
   Benefit: ~40% faster attention, ~60% less memory

3. IN-FLIGHT BATCHING
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Like vLLM's continuous batching, but at the engine level:
   - Dynamic batch management
   - Request-level scheduling
   
   Benefit: Higher throughput under varied load

4. QUANTIZATION (FP8/NVFP4)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Native FP8 on Blackwell (no emulation)
   - NVFP4 exclusive to Blackwell architecture
   - Per-channel or per-block scaling
   
   Benefit: 2-4x memory reduction, 1.5-2x faster compute

5. PAGED KV CACHE
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Similar to vLLM's PagedAttention
   - Managed at the engine level
   
   Benefit: More concurrent sequences
""")

---

## Part 2: Building a TensorRT-LLM Engine

In [None]:
# Check if TensorRT-LLM is available
def check_trt_llm_installation():
    """Check if TensorRT-LLM is installed and accessible."""
    try:
        result = subprocess.run(
            ["python", "-c", "import tensorrt_llm; print(tensorrt_llm.__version__)"],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            version = result.stdout.strip()
            print(f"‚úÖ TensorRT-LLM installed: v{version}")
            return True
        else:
            print("‚ùå TensorRT-LLM not installed")
            print("\nüìù To install via NGC container:")
            print("   docker pull nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3")
            return False
    except FileNotFoundError:
        print("‚ùå Python not found")
        return False

trt_llm_available = check_trt_llm_installation()

In [None]:
# Engine build configuration
print("""
üìä TENSORRT-LLM ENGINE BUILD OPTIONS
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

Key Build Parameters:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

--dtype:
  ‚Ä¢ float16  - Standard half precision
  ‚Ä¢ bfloat16 - Recommended for Blackwell (native support)
  ‚Ä¢ float8   - FP8 E4M3 for inference (2x memory reduction)
  ‚Ä¢ fp4      - NVFP4 for Blackwell only (4x memory reduction)

--max_input_len:
  ‚Ä¢ Maximum input prompt length engine will accept
  ‚Ä¢ Affects memory allocation
  ‚Ä¢ Default: 2048, can set to 8192 for RAG

--max_output_len:
  ‚Ä¢ Maximum tokens to generate
  ‚Ä¢ Affects KV cache allocation
  ‚Ä¢ Set based on your use case

--max_batch_size:
  ‚Ä¢ Maximum concurrent sequences
  ‚Ä¢ DGX Spark with 128GB can handle 32-64 for 8B model

--use_fused_mlp:
  ‚Ä¢ Fuse gate/up/down projections
  ‚Ä¢ ~15% speedup, slightly more memory

--enable_context_fmha:
  ‚Ä¢ Flash attention for context (prefill)
  ‚Ä¢ Essential for long input performance

Example Build Command:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
python -m tensorrt_llm.commands.build \\
    --model_dir ./llama-3.1-8b-hf \\
    --output_dir ./llama-3.1-8b-trt \\
    --dtype bfloat16 \\
    --max_input_len 4096 \\
    --max_output_len 2048 \\
    --max_batch_size 32 \\
    --use_fused_mlp \\
    --enable_context_fmha
""")

In [None]:
# Estimate engine build time and size
def estimate_build_resources(
    model_size_b: float,
    dtype: str = "bfloat16"
) -> Dict[str, Any]:
    """
    Estimate build time and disk space for TensorRT-LLM engine.
    
    Args:
        model_size_b: Model size in billions of parameters
        dtype: Target data type
    
    Returns:
        Dictionary with estimates
    """
    # Bytes per parameter
    dtype_bytes = {
        "float32": 4,
        "float16": 2,
        "bfloat16": 2,
        "float8": 1,
        "fp4": 0.5
    }
    
    bytes_per_param = dtype_bytes.get(dtype, 2)
    
    # Model file size
    model_size_gb = (model_size_b * 1e9 * bytes_per_param) / (1024**3)
    
    # Engine is typically 1.1-1.3x model size
    engine_size_gb = model_size_gb * 1.2
    
    # Temp space during build
    temp_space_gb = model_size_gb * 3  # Need space for intermediate files
    
    # Build time estimate (roughly 5 min per billion params)
    build_time_min = model_size_b * 5
    
    return {
        "model_size_gb": round(model_size_gb, 1),
        "engine_size_gb": round(engine_size_gb, 1),
        "temp_space_gb": round(temp_space_gb, 1),
        "total_disk_needed_gb": round(model_size_gb + engine_size_gb + temp_space_gb, 1),
        "build_time_min": round(build_time_min),
        "build_time_formatted": f"{int(build_time_min // 60)}h {int(build_time_min % 60)}m"
    }

# Estimate for common models
print("üìä Engine Build Estimates\n")
print(f"{'Model':<20} {'Dtype':<10} {'Model Size':<12} {'Engine Size':<12} {'Build Time'}")
print("-" * 65)

models_to_estimate = [
    ("Llama 3.1 8B", 8, "bfloat16"),
    ("Llama 3.1 8B", 8, "float8"),
    ("Llama 3.1 70B", 70, "bfloat16"),
    ("Llama 3.1 70B", 70, "float8"),
    ("Llama 3.1 70B", 70, "fp4"),
]

for model_name, size_b, dtype in models_to_estimate:
    est = estimate_build_resources(size_b, dtype)
    print(f"{model_name:<20} {dtype:<10} {est['model_size_gb']:<10}GB {est['engine_size_gb']:<10}GB {est['build_time_formatted']}")

### üîß Building a TensorRT-LLM Engine

Here's the complete workflow for building an engine on DGX Spark:

```bash
# Step 1: Start the TensorRT-LLM container
docker run --gpus all -it --rm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/trt-engines:/workspace/engines \
    -e HF_TOKEN=$HF_TOKEN \
    --ipc=host \
    nvcr.io/nvidia/tritonserver:25.11-trtllm-python-py3

# Step 2: Convert HuggingFace model to TensorRT-LLM format
python -m tensorrt_llm.commands.convert_checkpoint \
    --model_dir Qwen/Qwen3-8B-Instruct \
    --output_dir /workspace/engines/llama-3.1-8b-ckpt \
    --dtype bfloat16

# Step 3: Build the TensorRT engine
python -m tensorrt_llm.commands.build \
    --checkpoint_dir /workspace/engines/llama-3.1-8b-ckpt \
    --output_dir /workspace/engines/llama-3.1-8b-trt \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --max_batch_size 32 \
    --use_fused_mlp \
    --enable_context_fmha

# Step 4: Run the server
python -m tensorrt_llm.commands.serve \
    --engine_dir /workspace/engines/llama-3.1-8b-trt \
    --host 0.0.0.0 \
    --port 8000
```

---

## Part 3: Benchmarking TensorRT-LLM

In [None]:
# Check if TensorRT-LLM server is running
TRT_LLM_URL = "http://localhost:8000"

def check_trt_server(url: str) -> bool:
    """Check if TensorRT-LLM server is running."""
    try:
        response = requests.get(f"{url}/v1/models", timeout=5)
        if response.status_code == 200:
            print(f"‚úÖ TensorRT-LLM server running at {url}")
            models = response.json().get("data", [])
            for model in models:
                print(f"   Model: {model.get('id', 'unknown')}")
            return True
    except requests.exceptions.ConnectionError:
        print(f"‚ùå No TensorRT-LLM server at {url}")
        print("\nüìù See build instructions above to start a server")
    return False

trt_server_available = check_trt_server(TRT_LLM_URL)

In [None]:
# Benchmark prompts for prefill testing
PREFILL_TEST_PROMPTS = {
    "short": "What is 2+2?",
    "medium": "Explain the concept of machine learning in 2-3 sentences. " * 10,  # ~100 tokens
    "long": """Analyze the following text and provide insights:

Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that 
focuses on using data and algorithms to enable AI to imitate the way that humans learn, 
gradually improving its accuracy. Machine learning is an important component of the growing 
field of data science. Through the use of statistical methods, algorithms are trained to make 
classifications or predictions, and to uncover key insights in data mining projects.

These insights subsequently drive decision making within applications and businesses, ideally 
impacting key growth metrics. As big data continues to expand and grow, the market demand for 
data scientists will increase. They will be required to help identify the most relevant business 
questions and the data to answer them.

Machine learning algorithms are typically created using frameworks that accelerate solution 
development, such as TensorFlow and PyTorch. Machine learning models are improving in accuracy, 
thanks to the vast amounts of data now available and to the increasing power of computers. 
""" + "Please summarize the key points. " * 5,  # ~500 tokens
}

# Estimate token counts
for name, prompt in PREFILL_TEST_PROMPTS.items():
    token_estimate = len(prompt.split()) * 1.3  # Rough estimate
    print(f"{name}: ~{int(token_estimate)} tokens")

In [None]:
def benchmark_prefill(
    url: str,
    prompts: Dict[str, str],
    max_tokens: int = 50,
    num_runs: int = 3
) -> Dict[str, Dict]:
    """
    Benchmark prefill performance for different prompt lengths.
    
    Prefill (TTFT) is TensorRT-LLM's strength - measures time to process input.
    """
    print("\nüß™ Benchmarking prefill performance...")
    print("="*60)
    
    results = {}
    
    for name, prompt in prompts.items():
        print(f"\n{name} prompt ({len(prompt.split())} words):")
        
        ttfts = []
        
        for run in range(num_runs):
            start_time = time.perf_counter()
            first_token_time = None
            
            try:
                response = requests.post(
                    f"{url}/v1/chat/completions",
                    json={
                        "model": "default",
                        "messages": [{"role": "user", "content": prompt}],
                        "max_tokens": max_tokens,
                        "temperature": 0.7,
                        "stream": True
                    },
                    stream=True,
                    timeout=120
                )
                
                for line in response.iter_lines():
                    if line and first_token_time is None:
                        line_str = line.decode("utf-8")
                        if "content" in line_str:
                            first_token_time = time.perf_counter()
                            break
                
                if first_token_time:
                    ttft = (first_token_time - start_time) * 1000  # ms
                    ttfts.append(ttft)
                    print(f"   Run {run+1}: TTFT = {ttft:.1f}ms")
                else:
                    print(f"   Run {run+1}: No response received")
                    
            except Exception as e:
                print(f"   Run {run+1}: Error - {e}")
        
        if ttfts:
            results[name] = {
                "avg_ttft_ms": np.mean(ttfts),
                "min_ttft_ms": np.min(ttfts),
                "max_ttft_ms": np.max(ttfts),
                "word_count": len(prompt.split())
            }
    
    return results

In [None]:
# Run prefill benchmark if server is available
if trt_server_available:
    prefill_results = benchmark_prefill(
        url=TRT_LLM_URL,
        prompts=PREFILL_TEST_PROMPTS,
        max_tokens=50,
        num_runs=3
    )
    
    # Print summary
    print("\n" + "="*60)
    print("üìä PREFILL (TTFT) BENCHMARK RESULTS")
    print("="*60)
    print(f"\n{'Prompt':<10} {'Words':<8} {'Avg TTFT':<12} {'Min':<10} {'Max'}")
    print("-" * 50)
    
    for name, stats in prefill_results.items():
        print(f"{name:<10} {stats['word_count']:<8} "
              f"{stats['avg_ttft_ms']:.1f}ms{'':>4} "
              f"{stats['min_ttft_ms']:.1f}ms{'':>3} "
              f"{stats['max_ttft_ms']:.1f}ms")
else:
    print("\nüìä Expected TensorRT-LLM prefill performance (Llama 3.1 8B, BF16):")
    print("   Short (10 tokens): ~25ms TTFT")
    print("   Medium (100 tokens): ~50ms TTFT")
    print("   Long (500 tokens): ~120ms TTFT")
    print("\n   Compare to vLLM:")
    print("   Short: ~35ms, Medium: ~80ms, Long: ~250ms")
    print("\n   TensorRT-LLM prefill is 1.5-2x faster!")

---

## Part 4: When to Choose TensorRT-LLM

In [None]:
# Decision matrix
print("""
üìä INFERENCE ENGINE DECISION MATRIX
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

| Use Case                      | Best Choice   | Why                         |
|-------------------------------|---------------|-----------------------------|
| Local development             | Ollama        | Easy setup, model mgmt      |
| Quick prototyping             | vLLM          | Fast to start, good perf    |
| Interactive chat (latency)    | SGLang+Medusa | Lowest decode latency       |
| Long inputs (RAG, docs)       | TensorRT-LLM  | Best prefill performance    |
| Max throughput (batch)        | TensorRT-LLM  | Best GPU utilization        |
| FP8/NVFP4 quantization        | TensorRT-LLM  | Native support on Blackwell |
| Production (NVIDIA GPUs)      | TensorRT-LLM  | Fully optimized             |
| Edge deployment               | llama.cpp     | Smallest footprint          |

TensorRT-LLM Trade-offs:
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úÖ PROS:
  ‚Ä¢ Best raw performance on NVIDIA hardware
  ‚Ä¢ Best prefill speed (crucial for long prompts)
  ‚Ä¢ Native FP8/NVFP4 support
  ‚Ä¢ Tight integration with Triton Inference Server

‚ùå CONS:
  ‚Ä¢ Long engine build time (45-90 minutes)
  ‚Ä¢ Engine tied to specific hardware
  ‚Ä¢ More complex setup
  ‚Ä¢ Less flexible for experimentation
""")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Building Engine on Wrong Hardware

```bash
# ‚ùå Wrong - Building on different GPU than deployment target
# Built on A100, deploying to DGX Spark (Blackwell)

# ‚úÖ Right - Build on target hardware or specify target
python -m tensorrt_llm.commands.build \
    --target_architecture blackwell  # Specify target
```

### Mistake 2: Setting max_input_len Too High

```python
# ‚ùå Wrong - Wastes memory on unused context length
max_input_len = 131072  # 128K context - but you only use 4K

# ‚úÖ Right - Set based on actual usage
max_input_len = 8192   # Sufficient for most RAG applications
```

### Mistake 3: Forgetting --enable_context_fmha

```bash
# ‚ùå Wrong - Missing flash attention for prefill
python -m tensorrt_llm.commands.build --model_dir ...

# ‚úÖ Right - Enable context FMHA for fast prefill
python -m tensorrt_llm.commands.build \
    --model_dir ... \
    --enable_context_fmha
```

---

## ‚úã Try It Yourself

### Exercise 1: Compare Different dtypes

Build engines with different dtypes and compare performance.

In [None]:
# Exercise 1: Your code here
# If you have TensorRT-LLM access, build engines with:
# - bfloat16
# - float8
# - fp4 (if on Blackwell)
#
# For each, measure:
# - Build time
# - Engine size
# - Prefill latency
# - Decode speed
# - Output quality (perplexity if possible)

# TODO: Document your findings

### Exercise 2: Optimize for Your Workload

Tune TensorRT-LLM build parameters for a specific use case.

In [None]:
# Exercise 2: Your code here
# Choose one of these workloads:
# A) Customer support chatbot (short prompts, short responses)
# B) Document QA (long prompts, medium responses)
# C) Batch summarization (long prompts, long responses, high batch)
#
# For your chosen workload, determine optimal:
# - max_input_len
# - max_output_len
# - max_batch_size
# - dtype
#
# Document your reasoning

# TODO: Implement your optimization

---

## üéâ Checkpoint

You've learned:
- ‚úÖ TensorRT-LLM's architecture and optimization pipeline
- ‚úÖ How to build optimized TRT engines for LLMs
- ‚úÖ When TensorRT-LLM is the right choice (long inputs, max throughput)
- ‚úÖ Key build parameters and their effects

---

## üìñ Further Reading

- [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)
- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [NVIDIA NGC Container Catalog](https://catalog.ngc.nvidia.com/)
- [Triton Inference Server](https://github.com/triton-inference-server)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear Python garbage
gc.collect()

# Clear GPU memory cache if torch is available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        print("‚úÖ GPU memory cache cleared!")
except ImportError:
    pass

print("‚úÖ Cleanup complete!")