## Overview

What makes LLM inference fast on GPUs? This post summarizes the core CUDA kernels, memory layouts, and scheduling tricks that dominate latency and throughput.

- Attention with KV cache (paged KV, continuous batch, block-sparse)
- GEMM-heavy layers (QKV projections, MLP) and tensor cores
- Quantization (weight-only, KV cache) and dequant overhead
- Serving stacks (vLLM, TensorRT-LLM, Triton) and scheduling




```{=html}
<div style="text-align:center;">
  <img src="vllm-logo-text-light.png" alt="vLLM" width="35%"/>
  <p><em>Figure 1. vLLM logo </em></p>
</div>
```



## Kernel hotspots

- Attention: QK^T, softmax, and AV; fused kernels reduce memory traffic.
- KV cache: paging and swizzling to keep contiguous access; block tables.
- GEMM: QKV, output projection, MLP GEMMs dominate FLOPs; use tensor cores.
- Quant: weight-only (W8A16) vs. activation quant; KV cache quant (e.g., 8-bit) saves VRAM.



In [None]:
# Simple throughput estimator
from typing import Optional

def estimate_throughput(tokens_per_request: int,
                        requests_per_second: float,
                        decode_tokens_per_s: float,
                        prefill_tokens_per_s: Optional[float] = None,
                        prefill_tokens: int = 0) -> float:
    """
    Crude throughput estimate in tokens/sec.
    - tokens_per_request: average generated tokens per request
    - requests_per_second: steady-state request rate
    - decode_tokens_per_s: steady-state decode speed (per token) across batch
    - prefill_tokens_per_s: optional prefill speed; if None, ignored
    - prefill_tokens: avg prompt tokens per request
    """
    decode_toks = tokens_per_request * requests_per_second
    total = decode_toks
    if prefill_tokens_per_s is not None and prefill_tokens > 0:
        total += (prefill_tokens * requests_per_second) * (prefill_tokens_per_s / decode_tokens_per_s)
    return total

print(estimate_throughput(tokens_per_request=200,
                          requests_per_second=5.0,
                          decode_tokens_per_s=200000.0,
                          prefill_tokens_per_s=1200000.0,
                          prefill_tokens=1000))


## vLLM vs TensorRT-LLM vs Triton (high level)

- vLLM: Continuous batching, paged KV cache, flexible Python API; great for dynamic workloads; supports multi-model serving.
- TensorRT-LLM: Highly optimized CUDA/TensorRT kernels, graph capture, INT8/FP8 pipelines; excels on NVIDIA stacks for max perf.
- Triton (NVIDIA inference server): Orchestrates models/runtimes (can host vLLM or TensorRT-LLM backends), handles deployment/HTTP/gRPC/metrics.

Rule of thumb: For maximum single-GPU perf and static graphs, TensorRT-LLM can edge out; for dynamic batching/multitenancy ease, vLLM is excellent; Triton adds production serving features.



## KV cache sizing and paging

- KV size per token ≈ 2 * num_layers * num_heads * head_dim * dtype_bytes.
- Multiply by max sequence length and batch to estimate worst-case VRAM.
- Paged KV: allocate in fixed-size blocks (e.g., 16/32/64 tokens) and map logical positions → physical pages; reduces fragmentation and enables efficient eviction.
- Swizzling/contiguous layout within blocks keeps memory coalesced for attention reads.



In [None]:
# KV cache memory estimator (bytes)
from typing import Literal

dtype_sizes = {
    "fp16": 2,
    "bf16": 2,
    "fp32": 4,
    "int8": 1,
}

def kv_cache_bytes(num_layers: int,
                   num_heads: int,
                   head_dim: int,
                   seq_len: int,
                   batch_size: int,
                   dtype: Literal["fp16", "bf16", "fp32", "int8"] = "fp16") -> int:
    per_token = 2 * num_layers * num_heads * head_dim * dtype_sizes[dtype]
    total_tokens = seq_len * batch_size
    return per_token * total_tokens

# Example: 32 layers, 32 heads, 128 dim, seq 4k, batch 16, fp16
print(kv_cache_bytes(32, 32, 128, 4096, 16, "fp16") / (1024**3), "GiB")


## Batching and scheduling

- Continuous batching: backfill finished sequences to keep kernels busy; great for variable-length requests.
- Short-first / length-aware: schedule shorter jobs to reduce tail latency; mix for throughput.
- Spec decode: trade compute for fewer steps; helps throughput if memory is ample.
- Pinned memory and overlap: overlap H2D/D2H copies with compute using streams.



## Kernel-level optimization details

- Fused attention: fuse QK^T, scaling, softmax, and AV to reduce DRAM round-trips.
- FlashAttention-style tiling: block SRAM tiles to keep on-chip; minimize global reads/writes.
- Layouts: use row/col-major consistent with GEMM kernels; prefer tensor core-friendly shapes (multiple of 8/16).
- Persistent kernels: reduce launch overhead; good for steady-state decode.
- Quant & dequant fusion: fuse dequant→GEMM to avoid bandwidth blowups.



## System-level tips

- CUDA graphs: capture steady decode loop to reduce CPU overhead.
- Streams: separate prefill, decode, and IO with stream priorities.
- NUMA & pinning: pin host buffers and align worker affinity to avoid cross-socket traffic.
- Overlap: prefetch next batch to GPU while decoding current tokens.



## References

- vLLM docs and papers (continuous batching, paged KV)
- TensorRT-LLM samples and perf guides
- FlashAttention papers and kernels
- NVIDIA Triton Inference Server docs

