Bridging Runtime Gaps in LLVM: Vendor-Agnostic Dispatch for ML Kernels
EuroLLVM Developers' Meeting, Dublin 2026 — Poster Session
MLIR can compile a single gpu.module to multiple GPU vendors (NVIDIA, AMD, Intel) and pack them into one OffloadBinary. But at runtime, the offload stack picks the first compatible image and stops — there is no metadata vocabulary, no published measurement, and no "best-compatible" selection mechanism.
This project addresses three gaps:
- OffloadBinary Metadata Vocabulary — 5 new standard keys (
min_sm,min_gfx,requires_features,variant_priority,variant_tag) extendingisMetadataCompatible() - First Dispatch Stack Flame Graph — per-layer latency decomposition of the LLVM GPU dispatch path, measured on real hardware
#gpu.runtime_selectDesign — an MLIR attribute that defers binary selection to runtime, inspired by CPU Function Multi-Versioning (IFunc)
| Metric | Value |
|---|---|
| Selection overhead | 3–6 ns per dispatch |
Cold module load (cuModuleLoadData) |
36.0 µs (90% of cold path) |
| Hot-path total (launch + sync) | 4.1 µs |
| Overhead vs 10ms ML kernel | < 0.0001% |
| Prototype LOC | 5,100 (libkdl) + 664 (PoC) |
At 3–6 ns per dispatch, selection overhead is faster than a single L2 cache access.
├── poster/ # Conference poster (A0 HTML, slides)
│ ├── poster-combo-a.html # Main poster — open in browser, print to PDF
│ ├── slides.tex # Beamer slides
│ └── slides.pdf
├── experiments/
│ └── prototype/
│ ├── src/
│ │ ├── kdl.c / kdl.h # libkdl runtime library
│ │ ├── runtime_select_poc.c # PoC: real OffloadBinary dispatch
│ │ ├── bench_layers.c # Per-layer latency benchmark
│ │ ├── RuntimeSelectAttr.cpp.sketch # MLIR attribute design
│ │ └── Makefile
│ ├── benchmarks/ # Benchmark drivers + plotting
│ └── results/ # Benchmark figures
├── research/
│ ├── combo-a-deep-dive/
│ │ ├── proposals/ # RFCs, extended abstract, Q&A cards
│ │ ├── research/ # Benchmark data, statistical analysis
│ │ └── critiques/ # Review feedback
│ └── mega-survey/ # Literature survey (~450 sources)
├── literature/ # 40+ annotated paper summaries
└── findings.md # Core research findings
cd experiments/prototype/src
make # Builds libkdl.so + all benchmarks
./runtime_select_poc # Run the PoC dispatcher
./bench_layers # Run per-layer latency benchmarkRequirements: CUDA Toolkit 12+, GCC/Clang, Linux (tested on GTX 1650 sm_75, CUDA 13.1)
# Open in browser
xdg-open poster/poster-combo-a.html
# Print to PDF (A0)
# Chrome → Ctrl+P → Paper: Custom 841×1189mm → Margins: None → Background graphics: ON- Metadata RFC on discourse.llvm.org — 5 keys, ~30 LOC header patch
- Flame graph benchmark in
llvm-test-suite #gpu.runtime_selectRFC — ~780 LOC, implementsOffloadingLLVMTranslationAttrInterface
| System | Runtime Select? | Cross-Vendor? | MLIR-Native? | Ranked? |
|---|---|---|---|---|
| IREE HAL | Yes | Yes | Yes | Partial |
| chipStar | Yes (SPIR-V) | Yes | No | No |
| Proteus (LLNL) | Yes (JIT) | Partial | No | No |
| HetGPU | Yes (IR translate) | Yes | No | No |
| liboffload #186088 | Yes | Yes | No | No (first-wins) |
| CPU FMV (target_clones) | Yes | N/A | No | Yes (IFunc) |
| This Work | Metadata + Measurement + Design | Yes | Yes | Yes |
- IREE HAL — iree.dev
- chipStar — github.com/CHIP-SPV
- Proteus — CGO 2025
- HetGPU — arXiv:2506.15993
- Universal GPU ISA — arXiv:2603.28793
- KernelEvolve — ISCA 2026, arXiv:2512.23236
- AdaptiveCpp — IWOCL 2025
S. Akash — IIT Patna | CERN GSoC | vLLM contributor
Research project — see individual files for applicable licenses.