## Overview

What makes LLM inference fast on GPUs? This post summarizes the core CUDA kernels, memory layouts, and scheduling tricks that dominate latency and throughput.

- Attention with KV cache (paged KV, continuous batch, block-sparse)
- GEMM-heavy layers (QKV projections, MLP) and tensor cores
- Quantization (weight-only, KV cache) and dequant overhead
- Serving stacks (vLLM, TensorRT-LLM, Triton) and scheduling




```{=html}
<div style="text-align:center;">
  <img src="vllm-logo-text-light.png" alt="vLLM" width="35%"/>
  <p><em>Figure 1. vLLM logo </em></p>
</div>
```



## Kernel hotspots

- Attention: QK^T, softmax, and AV; fused kernels reduce memory traffic.
- KV cache: paging and swizzling to keep contiguous access; block tables.
- GEMM: QKV, output projection, MLP GEMMs dominate FLOPs; use tensor cores.
- Quant: weight-only (W8A16) vs. activation quant; KV cache quant (e.g., 8-bit) saves VRAM.



In [None]:
# Simple throughput estimator
from typing import Optional

def estimate_throughput(tokens_per_request: int,
                        requests_per_second: float,
                        decode_tokens_per_s: float,
                        prefill_tokens_per_s: Optional[float] = None,
                        prefill_tokens: int = 0) -> float:
    """
    Crude throughput estimate in tokens/sec.
    - tokens_per_request: average generated tokens per request
    - requests_per_second: steady-state request rate
    - decode_tokens_per_s: steady-state decode speed (per token) across batch
    - prefill_tokens_per_s: optional prefill speed; if None, ignored
    - prefill_tokens: avg prompt tokens per request
    """
    decode_toks = tokens_per_request * requests_per_second
    total = decode_toks
    if prefill_tokens_per_s is not None and prefill_tokens > 0:
        total += (prefill_tokens * requests_per_second) * (prefill_tokens_per_s / decode_tokens_per_s)
    return total

print(estimate_throughput(tokens_per_request=200,
                          requests_per_second=5.0,
                          decode_tokens_per_s=200000.0,
                          prefill_tokens_per_s=1200000.0,
                          prefill_tokens=1000))
