# Lab 3.2.8: TensorRT-LLM Engine

## Production-Ready LLM Deployment with NVIDIA TensorRT-LLM

**Duration:** 2 hours

---

### Learning Objectives

By the end of this lab, you will be able to:

1. **Understand TensorRT-LLM architecture** and why it's essential for production
2. **Convert models to TensorRT-LLM engines** with quantization
3. **Optimize for DGX Spark Blackwell** using native FP8/FP4 support
4. **Configure batching and KV-cache** for maximum throughput
5. **Benchmark and compare** TensorRT-LLM vs vanilla transformers

---

### Why TensorRT-LLM?

```
Professor SPARK says:

"TensorRT-LLM is like having a Formula 1 pit crew for your AI model.

Vanilla PyTorch is a regular car - gets you there, but not optimized.
TensorRT-LLM rebuilds your engine with:
- Fused operations (fewer pit stops)
- Optimized memory (bigger fuel tank)
- Native quantization (lighter weight)
- Paged KV-cache (smart fuel management)

Result? 2-5x faster inference. On Blackwell? Even more with FP8!"
```

### TensorRT-LLM Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    TensorRT-LLM Pipeline                │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────┐  │
│  │   HuggingFace │ -> │   Convert    │ -> │  Engine  │  │
│  │    Model      │    │   to TRT-LLM │    │  (.engine)│  │
│  └──────────────┘    └──────────────┘    └──────────┘  │
│                              │                          │
│                              v                          │
│  ┌────────────────────────────────────────────────────┐│
│  │              Optimizations Applied:                ││
│  │  • Kernel fusion (attention, MLP, LayerNorm)       ││
│  │  • Quantization (INT8, FP8, INT4, FP4)            ││
│  │  • Flash Attention 2                               ││
│  │  • Paged KV-cache                                  ││
│  │  • In-flight batching                              ││
│  │  • Tensor parallelism                              ││
│  └────────────────────────────────────────────────────┘│
│                              │                          │
│                              v                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │        Runtime: Triton Inference Server           │  │
│  │        or direct Python API                       │  │
│  └──────────────────────────────────────────────────┘  │
│                                                         │
└─────────────────────────────────────────────────────────┘
```

---

## Section 1: Environment Setup

### 1.1 Install TensorRT-LLM

In [None]:
# TensorRT-LLM installation
# Note: TensorRT-LLM requires specific CUDA and driver versions
# On DGX Spark, it's pre-installed in the AI container

# If not installed, use:
# !pip install tensorrt-llm

# Or pull the official container:
# docker pull nvcr.io/nvidia/tensorrt-llm:latest

print("Checking TensorRT-LLM installation...")

In [None]:
# Core imports
import os
import sys
import json
import time
import subprocess
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

# PyTorch
import torch

# Try importing TensorRT-LLM
try:
    import tensorrt_llm
    from tensorrt_llm import LLM, SamplingParams
    from tensorrt_llm.quantization import QuantMode
    TRT_LLM_AVAILABLE = True
    print(f"TensorRT-LLM version: {tensorrt_llm.__version__}")
except ImportError:
    TRT_LLM_AVAILABLE = False
    print("TensorRT-LLM not available - running in simulation mode")
    print("Install with: pip install tensorrt-llm")

# Local utilities
sys.path.append('..')
from scripts import (
    get_gpu_memory,
    clear_memory,
    MemoryTracker,
    print_dgx_spark_status,
    benchmark_inference
)

# Plotting
plt.style.use('seaborn-v0_8-whitegrid')

print("\nEnvironment setup complete!")
print_dgx_spark_status()

### 1.2 Configuration

In [None]:
@dataclass
class TRTLLMConfig:
    """Configuration for TensorRT-LLM engine building."""
    
    # Model settings
    model_name: str = "meta-llama/Llama-3.2-3B-Instruct"
    
    # Engine settings
    engine_dir: str = "../data/trt_engines"
    
    # Quantization (for Blackwell)
    quantization: str = "fp8"  # Options: none, int8, int4, fp8, fp4
    
    # Batch settings
    max_batch_size: int = 8
    max_input_len: int = 2048
    max_output_len: int = 512
    max_beam_width: int = 1
    
    # KV-cache settings
    kv_cache_type: str = "paged"  # paged or contiguous
    max_num_tokens: int = 8192
    
    # Hardware settings (DGX Spark)
    tp_size: int = 1  # Tensor parallelism (1 for single GPU)
    pp_size: int = 1  # Pipeline parallelism
    
    # Builder settings
    builder_opt_level: int = 3  # 0-5, higher = more optimization
    use_fused_mlp: bool = True
    use_flash_attention: bool = True
    
    # Benchmark settings
    warmup_runs: int = 5
    benchmark_runs: int = 50


# Create configuration
config = TRTLLMConfig()

# Ensure directories exist
Path(config.engine_dir).mkdir(parents=True, exist_ok=True)

print(f"TensorRT-LLM Configuration:")
print(f"  Model: {config.model_name}")
print(f"  Quantization: {config.quantization}")
print(f"  Max batch size: {config.max_batch_size}")
print(f"  Max input length: {config.max_input_len}")
print(f"  Engine directory: {config.engine_dir}")

---

## Section 2: Understanding TensorRT-LLM Optimizations

### 2.1 Key Optimizations Explained

```
Professor SPARK's Optimization Guide:

1. KERNEL FUSION
   Before: LayerNorm -> Linear -> GELU -> Linear (4 kernel launches)
   After:  FusedMLP (1 kernel launch)
   Benefit: Less memory traffic, fewer synchronization points

2. QUANTIZATION
   FP8 on Blackwell: 2x throughput with minimal quality loss
   INT4: 4x memory reduction for larger batch sizes

3. FLASH ATTENTION
   Standard: O(n²) memory for attention
   Flash:    O(n) memory, tiled computation
   Benefit:  Handle longer sequences, faster

4. PAGED KV-CACHE
   Like virtual memory for your model
   Only allocates cache pages as needed
   Benefit: More requests in parallel, less memory waste

5. IN-FLIGHT BATCHING
   Requests join/leave batch dynamically
   No waiting for slowest request
   Benefit: Higher GPU utilization
```

In [None]:
# Visualization of optimization impact
optimizations = {
    'Baseline (PyTorch)': 1.0,
    '+ Kernel Fusion': 1.4,
    '+ Flash Attention': 1.8,
    '+ FP8 Quantization': 2.5,
    '+ Paged KV-Cache': 3.0,
    '+ In-flight Batching': 4.0,
}

fig, ax = plt.subplots(figsize=(12, 6))

names = list(optimizations.keys())
speedups = list(optimizations.values())
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(names)))

bars = ax.barh(names, speedups, color=colors, edgecolor='black', linewidth=1.2)

# Add value labels
for bar, speedup in zip(bars, speedups):
    ax.text(bar.get_width() + 0.05, bar.get_y() + bar.get_height()/2,
            f'{speedup:.1f}x', va='center', fontweight='bold')

ax.set_xlabel('Speedup Factor', fontsize=12)
ax.set_title('Cumulative TensorRT-LLM Optimizations on DGX Spark', 
             fontsize=14, fontweight='bold')
ax.set_xlim(0, 5)

# Add vertical line at 1x
ax.axvline(x=1.0, color='red', linestyle='--', alpha=0.5, label='Baseline')

plt.tight_layout()
plt.savefig(Path(config.engine_dir) / 'optimization_impact.png', dpi=150)
plt.show()

### 2.2 Quantization Modes in TensorRT-LLM

In [None]:
# TensorRT-LLM Quantization Options
quant_modes = {
    'none': {
        'description': 'FP16/BF16 weights and activations',
        'memory_ratio': 1.0,
        'speedup': 1.0,
        'quality_loss': 'None',
        'blackwell_optimized': False
    },
    'int8_weight_only': {
        'description': 'INT8 weights, FP16 compute',
        'memory_ratio': 0.5,
        'speedup': 1.5,
        'quality_loss': 'Minimal',
        'blackwell_optimized': False
    },
    'int8_sq': {
        'description': 'SmoothQuant INT8 weights + activations',
        'memory_ratio': 0.5,
        'speedup': 2.0,
        'quality_loss': 'Low',
        'blackwell_optimized': True
    },
    'fp8': {
        'description': 'FP8 weights and activations (E4M3)',
        'memory_ratio': 0.5,
        'speedup': 2.5,
        'quality_loss': 'Very Low',
        'blackwell_optimized': True
    },
    'int4_awq': {
        'description': 'AWQ 4-bit weight quantization',
        'memory_ratio': 0.25,
        'speedup': 2.0,
        'quality_loss': 'Low-Medium',
        'blackwell_optimized': False
    },
    'int4_gptq': {
        'description': 'GPTQ 4-bit weight quantization',
        'memory_ratio': 0.25,
        'speedup': 2.0,
        'quality_loss': 'Low-Medium',
        'blackwell_optimized': False
    },
    'fp4': {
        'description': 'NVFP4 weights (Blackwell native)',
        'memory_ratio': 0.25,
        'speedup': 3.5,
        'quality_loss': 'Low',
        'blackwell_optimized': True
    }
}

# Display as table
df_quant = pd.DataFrame(quant_modes).T
df_quant.index.name = 'Mode'

print("TensorRT-LLM Quantization Modes:")
print("="*80)
print(df_quant.to_string())
print("\n" + "="*80)
print("\nRecommendation for DGX Spark Blackwell: Use 'fp8' or 'fp4' for best performance")

---

## Section 3: Building TensorRT-LLM Engine

### 3.1 Engine Builder Class

In [None]:
class TRTLLMEngineBuilder:
    """
    Helper class for building TensorRT-LLM engines.
    
    Supports various quantization modes optimized for DGX Spark Blackwell.
    """
    
    def __init__(self, config: TRTLLMConfig):
        self.config = config
        self.engine_path = None
    
    def get_quant_config(self) -> dict:
        """
        Get quantization configuration based on mode.
        """
        quant_configs = {
            'none': {},
            'int8': {
                'quant_mode': 'int8_weight_only'
            },
            'int8_sq': {
                'quant_mode': 'int8_sq',
                'per_token': True,
                'per_channel': True
            },
            'fp8': {
                'quant_mode': 'fp8',
                'fp8_kv_cache': True
            },
            'int4_awq': {
                'quant_mode': 'int4_awq',
                'group_size': 128
            },
            'int4_gptq': {
                'quant_mode': 'int4_gptq',
                'group_size': 128
            },
            'fp4': {
                'quant_mode': 'fp4',
                'fp4_kv_cache': True
            }
        }
        return quant_configs.get(self.config.quantization, {})
    
    def build_engine(self) -> str:
        """
        Build TensorRT-LLM engine from HuggingFace model.
        
        Returns:
            Path to built engine
        """
        if not TRT_LLM_AVAILABLE:
            print("TensorRT-LLM not available - returning simulated path")
            return f"{self.config.engine_dir}/simulated_engine"
        
        print(f"\nBuilding TensorRT-LLM engine...")
        print(f"  Model: {self.config.model_name}")
        print(f"  Quantization: {self.config.quantization}")
        
        # Create engine name
        model_short = self.config.model_name.split('/')[-1]
        engine_name = f"{model_short}_{self.config.quantization}_tp{self.config.tp_size}"
        engine_path = Path(self.config.engine_dir) / engine_name
        
        # Build using TensorRT-LLM LLM class
        try:
            with MemoryTracker() as tracker:
                # Get quantization config
                quant_config = self.get_quant_config()
                
                # Build engine (simplified API)
                llm = LLM(
                    model=self.config.model_name,
                    tensor_parallel_size=self.config.tp_size,
                    **quant_config
                )
                
                # Save engine
                llm.save(str(engine_path))
            
            print(f"\nEngine built successfully!")
            print(f"  Path: {engine_path}")
            print(f"  Build time: {tracker.elapsed_time:.1f}s")
            print(f"  Peak memory: {tracker.peak_memory_gb:.2f} GB")
            
            self.engine_path = str(engine_path)
            return self.engine_path
            
        except Exception as e:
            print(f"Engine build failed: {e}")
            raise
    
    def build_engine_cli(self) -> str:
        """
        Build engine using CLI tools (alternative method).
        
        Useful for more control over the build process.
        """
        print("\nBuilding engine using CLI tools...")
        
        model_short = self.config.model_name.split('/')[-1]
        checkpoint_dir = Path(self.config.engine_dir) / f"{model_short}_checkpoint"
        engine_dir = Path(self.config.engine_dir) / f"{model_short}_{self.config.quantization}_engine"
        
        # Step 1: Convert checkpoint
        convert_cmd = f"""
python -m tensorrt_llm.commands.convert_checkpoint \
    --model_dir {self.config.model_name} \
    --output_dir {checkpoint_dir} \
    --dtype float16 \
    --tp_size {self.config.tp_size}
"""
        
        # Add quantization flags
        if self.config.quantization == 'fp8':
            convert_cmd += " --use_fp8"
        elif self.config.quantization == 'int8':
            convert_cmd += " --int8_kv_cache"
        elif self.config.quantization == 'int4_awq':
            convert_cmd += " --use_weight_only --weight_only_precision int4_awq"
        
        print(f"Convert command:\n{convert_cmd}")
        
        # Step 2: Build engine
        build_cmd = f"""
trtllm-build \
    --checkpoint_dir {checkpoint_dir} \
    --output_dir {engine_dir} \
    --max_batch_size {self.config.max_batch_size} \
    --max_input_len {self.config.max_input_len} \
    --max_seq_len {self.config.max_input_len + self.config.max_output_len} \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --paged_kv_cache enable \
    --remove_input_padding enable
"""
        
        print(f"\nBuild command:\n{build_cmd}")
        
        # Return path (actual execution would be done in terminal)
        self.engine_path = str(engine_dir)
        return self.engine_path


# Create builder
builder = TRTLLMEngineBuilder(config)
print("\nEngine builder ready!")

### 3.2 Build Engine (Demonstration)

In [None]:
# Demonstrate CLI commands for building engine
print("TensorRT-LLM Engine Build Commands:")
print("="*60)

engine_path = builder.build_engine_cli()

print("\n" + "="*60)
print("\nTo build the engine, run these commands in your terminal.")
print("The build process can take 10-30 minutes depending on model size.")

### 3.3 Build with Different Quantization Modes

In [None]:
def generate_build_commands(model_name: str, output_base: str) -> dict:
    """
    Generate build commands for all quantization modes.
    """
    modes = ['fp16', 'int8', 'fp8', 'int4_awq', 'fp4']
    commands = {}
    
    for mode in modes:
        checkpoint_dir = f"{output_base}/{mode}_checkpoint"
        engine_dir = f"{output_base}/{mode}_engine"
        
        # Convert checkpoint command
        convert_cmd = f"python -m tensorrt_llm.commands.convert_checkpoint --model_dir {model_name} --output_dir {checkpoint_dir} --dtype float16"
        
        if mode == 'fp8':
            convert_cmd += " --use_fp8"
        elif mode == 'int8':
            convert_cmd += " --int8_kv_cache"
        elif mode == 'int4_awq':
            convert_cmd += " --use_weight_only --weight_only_precision int4_awq"
        elif mode == 'fp4':
            convert_cmd += " --use_fp4"  # Blackwell native
        
        # Build engine command
        build_cmd = f"trtllm-build --checkpoint_dir {checkpoint_dir} --output_dir {engine_dir} --max_batch_size 8 --max_input_len 2048 --max_seq_len 2560 --gemm_plugin float16 --gpt_attention_plugin float16 --paged_kv_cache enable"
        
        commands[mode] = {
            'convert': convert_cmd,
            'build': build_cmd
        }
    
    return commands

# Generate commands for all modes
all_commands = generate_build_commands(config.model_name, config.engine_dir)

print("Build Commands for All Quantization Modes:")
print("="*60)
for mode, cmds in all_commands.items():
    print(f"\n### {mode.upper()} ###")
    print(f"1. Convert: {cmds['convert'][:80]}...")
    print(f"2. Build: {cmds['build'][:80]}...")

---

## Section 4: Running Inference with TensorRT-LLM

### 4.1 TensorRT-LLM Runtime

In [None]:
class TRTLLMInference:
    """
    TensorRT-LLM inference wrapper.
    
    Provides easy-to-use interface for running inference.
    """
    
    def __init__(self, engine_dir: str):
        self.engine_dir = engine_dir
        self.llm = None
        self.loaded = False
    
    def load(self):
        """Load the TensorRT-LLM engine."""
        if not TRT_LLM_AVAILABLE:
            print("TensorRT-LLM not available - using simulation mode")
            self.loaded = False
            return
        
        print(f"Loading engine from {self.engine_dir}...")
        
        try:
            self.llm = LLM(model=self.engine_dir)
            self.loaded = True
            print("Engine loaded successfully!")
        except Exception as e:
            print(f"Failed to load engine: {e}")
            self.loaded = False
    
    def generate(
        self,
        prompts: List[str],
        max_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9
    ) -> List[str]:
        """
        Generate text from prompts.
        
        Args:
            prompts: List of input prompts
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature
            top_p: Nucleus sampling parameter
            
        Returns:
            List of generated texts
        """
        if not self.loaded:
            # Simulation mode
            return [f"[Simulated output for: {p[:50]}...]" for p in prompts]
        
        sampling_params = SamplingParams(
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p
        )
        
        outputs = self.llm.generate(prompts, sampling_params)
        
        return [output.outputs[0].text for output in outputs]
    
    def benchmark(
        self,
        prompt: str,
        max_tokens: int = 100,
        num_runs: int = 50,
        warmup_runs: int = 5,
        batch_sizes: List[int] = [1, 2, 4, 8]
    ) -> dict:
        """
        Benchmark inference performance.
        
        Returns:
            Dictionary with benchmark results
        """
        results = {}
        
        for batch_size in batch_sizes:
            prompts = [prompt] * batch_size
            
            # Warmup
            for _ in range(warmup_runs):
                self.generate(prompts, max_tokens=max_tokens)
            
            # Benchmark
            latencies = []
            for _ in range(num_runs):
                start = time.perf_counter()
                outputs = self.generate(prompts, max_tokens=max_tokens)
                latency = (time.perf_counter() - start) * 1000  # ms
                latencies.append(latency)
            
            mean_latency = np.mean(latencies)
            std_latency = np.std(latencies)
            tokens_per_sec = (batch_size * max_tokens) / (mean_latency / 1000)
            
            results[batch_size] = {
                'mean_latency_ms': mean_latency,
                'std_latency_ms': std_latency,
                'tokens_per_second': tokens_per_sec,
                'throughput_requests_per_sec': batch_size / (mean_latency / 1000)
            }
            
            print(f"Batch {batch_size}: {mean_latency:.1f}ms, {tokens_per_sec:.0f} tok/s")
        
        return results


# Note: Actual loading requires built engine
print("TensorRT-LLM inference class ready!")
print("(Engine loading requires pre-built engine files)")

### 4.2 Compare with Vanilla Transformers

In [None]:
def compare_inference_methods(
    model_name: str,
    prompt: str = "The future of artificial intelligence is",
    max_tokens: int = 50,
    num_runs: int = 20
) -> pd.DataFrame:
    """
    Compare vanilla transformers vs TensorRT-LLM performance.
    
    Note: For demonstration, uses simulated TRT-LLM results.
    """
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    results = []
    
    # Vanilla Transformers FP16
    print("Benchmarking vanilla Transformers (FP16)...")
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        
        # Warmup
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        for _ in range(3):
            model.generate(**inputs, max_new_tokens=max_tokens, do_sample=False)
        
        # Benchmark
        latencies = []
        for _ in tqdm(range(num_runs), desc="Transformers"):
            start = time.perf_counter()
            outputs = model.generate(**inputs, max_new_tokens=max_tokens, do_sample=False)
            latency = (time.perf_counter() - start) * 1000
            latencies.append(latency)
        
        mean_lat = np.mean(latencies)
        tps = max_tokens / (mean_lat / 1000)
        
        results.append({
            'Method': 'Transformers FP16',
            'Latency (ms)': mean_lat,
            'Tokens/sec': tps,
            'Memory (GB)': get_gpu_memory().get('used_gb', 0),
            'Speedup': 1.0
        })
        
        baseline_lat = mean_lat
        
        del model
        clear_memory()
        
    except Exception as e:
        print(f"Transformers benchmark failed: {e}")
        baseline_lat = 100  # Fallback
    
    # Simulated TensorRT-LLM results (based on typical improvements)
    trt_configs = [
        ('TRT-LLM FP16', 1.0, 0.5, 0.85),
        ('TRT-LLM INT8', 0.75, 0.6, 0.55),
        ('TRT-LLM FP8', 0.55, 0.7, 0.50),
        ('TRT-LLM INT4-AWQ', 0.65, 0.55, 0.35),
        ('TRT-LLM FP4 (Blackwell)', 0.40, 0.85, 0.30),
    ]
    
    print("\nSimulated TensorRT-LLM results (typical improvements):")
    for name, lat_factor, tps_factor, mem_factor in trt_configs:
        # Apply typical TRT-LLM improvement factors
        lat = baseline_lat * lat_factor
        tps = (max_tokens / (lat / 1000)) * tps_factor * 2  # TRT-LLM typically 2x+ faster
        mem = results[0]['Memory (GB)'] * mem_factor if results else 3.0 * mem_factor
        
        results.append({
            'Method': name,
            'Latency (ms)': lat,
            'Tokens/sec': tps,
            'Memory (GB)': mem,
            'Speedup': baseline_lat / lat if results else 1.0 / lat_factor
        })
    
    return pd.DataFrame(results)


# Run comparison
print("Running inference method comparison...")
comparison_df = compare_inference_methods(config.model_name)

print("\n" + "="*60)
print("Inference Method Comparison:")
print("="*60)
print(comparison_df.to_string(index=False))

### 4.3 Visualize Performance Comparison

In [None]:
def plot_inference_comparison(df: pd.DataFrame):
    """
    Visualize inference performance comparison.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Throughput comparison
    ax1 = axes[0]
    colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(df)))
    bars = ax1.bar(df['Method'], df['Tokens/sec'], color=colors, edgecolor='black')
    ax1.set_xlabel('Method', fontsize=10)
    ax1.set_ylabel('Tokens/second', fontsize=10)
    ax1.set_title('Throughput Comparison', fontsize=12, fontweight='bold')
    ax1.tick_params(axis='x', rotation=45, labelsize=8)
    
    for bar, val in zip(bars, df['Tokens/sec']):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
                 f'{val:.0f}', ha='center', va='bottom', fontsize=8)
    
    # Latency comparison
    ax2 = axes[1]
    bars = ax2.bar(df['Method'], df['Latency (ms)'], color=colors, edgecolor='black')
    ax2.set_xlabel('Method', fontsize=10)
    ax2.set_ylabel('Latency (ms)', fontsize=10)
    ax2.set_title('Latency Comparison', fontsize=12, fontweight='bold')
    ax2.tick_params(axis='x', rotation=45, labelsize=8)
    
    # Memory comparison
    ax3 = axes[2]
    bars = ax3.bar(df['Method'], df['Memory (GB)'], color=colors, edgecolor='black')
    ax3.set_xlabel('Method', fontsize=10)
    ax3.set_ylabel('Memory (GB)', fontsize=10)
    ax3.set_title('Memory Usage', fontsize=12, fontweight='bold')
    ax3.tick_params(axis='x', rotation=45, labelsize=8)
    
    plt.tight_layout()
    plt.savefig(Path(config.engine_dir) / 'inference_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()


# Visualize comparison
plot_inference_comparison(comparison_df)

---

## Section 5: Production Deployment Configuration

### 5.1 Triton Inference Server Configuration

In [None]:
def generate_triton_config(
    model_name: str,
    max_batch_size: int = 8,
    instance_count: int = 1
) -> str:
    """
    Generate Triton Inference Server model configuration.
    """
    config = f"""
# Triton Model Configuration for {model_name}
# Generated for DGX Spark with TensorRT-LLM

name: "{model_name.replace('/', '_')}"
backend: "tensorrtllm"
max_batch_size: {max_batch_size}

model_transaction_policy {{
  decoupled: True
}}

input [
  {{
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }},
  {{
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }},
  {{
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  }},
  {{
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  }}
]

output [
  {{
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }}
]

instance_group [
  {{
    count: {instance_count}
    kind: KIND_GPU
  }}
]

parameters: {{
  key: "gpt_model_type"
  value: {{
    string_value: "inflight_fused_batching"
  }}
}}

parameters: {{
  key: "kv_cache_type"
  value: {{
    string_value: "paged"
  }}
}}

parameters: {{
  key: "max_tokens_in_paged_kv_cache"
  value: {{
    string_value: "16384"
  }}
}}

dynamic_batching {{
  preferred_batch_size: [1, 2, 4, 8]
  max_queue_delay_microseconds: 100
}}
"""
    return config


# Generate config
triton_config = generate_triton_config(config.model_name)

# Save config
config_path = Path(config.engine_dir) / "triton_config.pbtxt"
with open(config_path, 'w') as f:
    f.write(triton_config)

print(f"Triton configuration saved to: {config_path}")
print("\n" + triton_config)

### 5.2 Docker Compose for Production

In [None]:
docker_compose = """
# Docker Compose for TensorRT-LLM Production Deployment
# Optimized for DGX Spark

version: '3.8'

services:
  triton-trtllm:
    image: nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
    container_name: triton-llm
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # gRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./model_repository:/models
      - ./engines:/engines
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: |
      tritonserver 
        --model-repository=/models 
        --http-port=8000 
        --grpc-port=8001 
        --metrics-port=8002
        --log-verbose=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    depends_on:
      - triton-trtllm

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  grafana-data:
"""

# Save docker-compose
compose_path = Path(config.engine_dir) / "docker-compose.yml"
with open(compose_path, 'w') as f:
    f.write(docker_compose)

print(f"Docker Compose configuration saved to: {compose_path}")
print("\nTo start the production server:")
print(f"  cd {config.engine_dir}")
print("  docker-compose up -d")

---

## Section 6: Performance Optimization Tips

### 6.1 DGX Spark Blackwell Optimizations

In [None]:
optimization_tips = """
============================================================
TensorRT-LLM Optimization Tips for DGX Spark Blackwell
============================================================

1. QUANTIZATION SELECTION
   -----------------------
   • Use FP8 for best quality/speed balance (native Blackwell support)
   • Use FP4 for maximum throughput (2-3x faster than FP16)
   • Avoid INT4 if quality is critical - use FP4 instead

2. BATCHING CONFIGURATION
   ----------------------
   • Enable inflight batching for variable-length requests
   • Set max_batch_size based on expected load (8-16 typical)
   • Use paged KV-cache to maximize concurrent requests

3. MEMORY OPTIMIZATION
   -------------------
   • DGX Spark has 128GB unified memory - use it!
   • Set max_num_tokens based on expected sequence lengths
   • Enable FP8/FP4 KV-cache for 2x memory savings

4. KERNEL OPTIMIZATION
   -------------------
   • Always enable Flash Attention 2
   • Enable fused MLP operations
   • Use GEMM plugin with appropriate dtype

5. BUILD OPTIMIZATION
   ------------------
   • Use builder_optimization_level=5 for production
   • Enable remove_input_padding for variable lengths
   • Profile with different batch sizes to find optimal

6. BLACKWELL-SPECIFIC
   ------------------
   • Native FP4 support = 3-4x throughput vs FP16
   • FP8 compute = minimal quality loss, 2x speedup
   • Unified memory eliminates GPU memory limits

EXPECTED PERFORMANCE ON DGX SPARK:
==================================

| Model Size | FP16 | FP8 | FP4 |
|------------|------|-----|-----|
| 3B params  | 100  | 200 | 350 | tokens/sec (batch=1)
| 8B params  | 40   | 80  | 150 | tokens/sec (batch=1)
| 70B params | 15   | 30  | 60  | tokens/sec (batch=1)

* With batching, throughput scales nearly linearly up to memory limits
"""

print(optimization_tips)

### 6.2 Troubleshooting Guide

In [None]:
troubleshooting_guide = """
============================================================
TensorRT-LLM Troubleshooting Guide
============================================================

ISSUE: "Out of memory" during engine build
SOLUTION:
  - Reduce max_batch_size
  - Reduce max_input_len + max_output_len
  - Use weight-only quantization first (INT8/INT4)
  - Close other GPU processes

ISSUE: Engine build fails with quantization
SOLUTION:
  - Ensure calibration data is valid
  - Check model architecture is supported
  - Try simpler quantization (INT8 before FP8)
  - Update TensorRT-LLM to latest version

ISSUE: Slow inference speed
SOLUTION:
  - Enable all plugins (GEMM, attention, etc.)
  - Check GPU utilization with nvidia-smi
  - Profile with nsys/ncu for bottlenecks
  - Verify quantization is applied correctly

ISSUE: Quality degradation after quantization
SOLUTION:
  - Use larger calibration dataset (1000+ samples)
  - Try SmoothQuant for INT8
  - Use FP8 instead of INT4
  - Increase group size for weight quantization

ISSUE: Triton server crashes
SOLUTION:
  - Check model configuration (pbtxt)
  - Verify engine path is correct
  - Check CUDA/TensorRT version compatibility
  - Review Triton logs: docker logs triton-llm

COMMON COMMANDS:
================

# Check GPU status
nvidia-smi

# Check TensorRT-LLM version
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

# Profile engine build
nsys profile -o trtllm_build trtllm-build ...

# Test Triton health
curl localhost:8000/v2/health/ready

# Get model metrics
curl localhost:8002/metrics | grep trtllm
"""

print(troubleshooting_guide)

---

## Section 7: Expected Results Summary

### 7.1 Benchmark Results on DGX Spark

In [None]:
# Expected TensorRT-LLM results on DGX Spark
expected_results = pd.DataFrame([
    {
        'Configuration': 'Llama-3.2-3B FP16',
        'Batch=1 (tok/s)': 95,
        'Batch=4 (tok/s)': 350,
        'Batch=8 (tok/s)': 650,
        'Memory (GB)': 6.0,
        'Quality': '100%'
    },
    {
        'Configuration': 'Llama-3.2-3B FP8',
        'Batch=1 (tok/s)': 180,
        'Batch=4 (tok/s)': 650,
        'Batch=8 (tok/s)': 1150,
        'Memory (GB)': 3.2,
        'Quality': '99.5%'
    },
    {
        'Configuration': 'Llama-3.2-3B FP4',
        'Batch=1 (tok/s)': 280,
        'Batch=4 (tok/s)': 1000,
        'Batch=8 (tok/s)': 1800,
        'Memory (GB)': 1.8,
        'Quality': '98%'
    },
    {
        'Configuration': 'Llama-3.2-8B FP8',
        'Batch=1 (tok/s)': 85,
        'Batch=4 (tok/s)': 320,
        'Batch=8 (tok/s)': 580,
        'Memory (GB)': 8.5,
        'Quality': '99.3%'
    },
    {
        'Configuration': 'Llama-3.1-70B FP4',
        'Batch=1 (tok/s)': 45,
        'Batch=4 (tok/s)': 160,
        'Batch=8 (tok/s)': 280,
        'Memory (GB)': 42,
        'Quality': '97%'
    },
])

print("Expected TensorRT-LLM Results on DGX Spark Blackwell:")
print("="*75)
print(expected_results.to_string(index=False))
print("\n" + "="*75)
print("\nKey Observations:")
print("  • FP8 provides 2x throughput with <1% quality loss")
print("  • FP4 provides 3x throughput - ideal for high-traffic applications")
print("  • 70B model fits in memory with FP4 quantization")
print("  • Batch processing increases throughput near-linearly")

---

## Summary and Key Takeaways

### What We Learned

```
Professor SPARK's Summary:

"TensorRT-LLM is THE production solution for LLM deployment on NVIDIA hardware.

Key benefits:
1. 2-5x faster than vanilla PyTorch/Transformers
2. Native FP8/FP4 support on Blackwell = maximum performance
3. Paged KV-cache = handle more concurrent requests
4. In-flight batching = maximize GPU utilization
5. Triton integration = production-ready serving

For DGX Spark:
- Use FP8 for best quality/speed balance
- Use FP4 for maximum throughput
- 128GB unified memory enables large models without swapping
- Single GPU can serve production traffic efficiently

Remember: The engine build takes time (10-30 min), but runtime is FAST!"
```

### Module 3.2 Complete!

You have now learned:

1. **Data types** (FP32 → FP16 → BF16 → INT8 → INT4 → FP8 → FP4)
2. **NVFP4** - Blackwell's native 4-bit format with micro-block scaling
3. **FP8 training and inference** - E4M3 vs E5M2 with Transformer Engine
4. **GPTQ** - Hessian-based post-training quantization
5. **AWQ** - Activation-aware weight quantization
6. **GGUF** - llama.cpp format for CPU/mixed inference
7. **Quality benchmarks** - Perplexity + MMLU evaluation
8. **TensorRT-LLM** - Production deployment with optimized engines

### Next Steps

- Module 3.3: Model Deployment with Triton and FastAPI
- Apply quantization to your own models
- Build production inference pipelines

---

## Exercises

### Exercise 1: Build Custom Engine

Build a TensorRT-LLM engine with your own configuration.

In [None]:
# Exercise 1: Create custom engine configuration

# TODO: Modify these parameters for your use case
custom_config = TRTLLMConfig(
    model_name="your-model-name",
    quantization="fp8",  # Try: none, int8, fp8, int4_awq, fp4
    max_batch_size=16,
    max_input_len=4096,
    max_output_len=1024
)

# Generate build commands
# custom_builder = TRTLLMEngineBuilder(custom_config)
# custom_builder.build_engine_cli()

### Exercise 2: Benchmark Different Batch Sizes

Find the optimal batch size for your workload.

In [None]:
# Exercise 2: Batch size optimization

# TODO: Run benchmarks with different batch sizes
batch_sizes = [1, 2, 4, 8, 16, 32]

# For each batch size, measure:
# - Throughput (tokens/second)
# - Latency (per-request)
# - Memory usage

# Find the sweet spot for your latency requirements

### Exercise 3: Deploy to Triton

Create a complete Triton deployment.

In [None]:
# Exercise 3: Triton deployment

# TODO:
# 1. Build your TensorRT-LLM engine
# 2. Create model repository structure:
#    model_repository/
#      your_model/
#        1/
#          model.plan  (or engine files)
#        config.pbtxt

# 3. Start Triton server:
#    docker run --gpus all -p 8000:8000 -v ./model_repository:/models \
#      nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3 \
#      tritonserver --model-repository=/models

# 4. Test with curl:
#    curl -X POST localhost:8000/v2/models/your_model/infer \
#      -H 'Content-Type: application/json' \
#      -d '{"inputs": [...]}'