Manifest-driven ahead-of-time (AOT) builder for AITER kernels. QoLA wraps AITER's build_module() JIT compilation system with a declarative TOML manifest, producing either:
- pybind11 Python modules — standard
.sofiles importable from Python (requires PyTorch) - torch-free C-linkable shared libraries (
cpp_itfsmode) — plain.sofiles linked via HIP/ROCm with no PyTorch dependency
QoLA is designed for Transformer Engine to pre-build AITER attention (MHA) kernels at package install time, replacing hours-long JIT compilation with a structured, reproducible build.
- Declarative manifests — a single TOML file pins the AITER commit, target architectures, kernel modules, and MHA variant matrix
- torch-free builds —
cpp_itfsmode eliminates the PyTorch build dependency for C-linkable libraries - Symbol isolation — linker version scripts and C++ namespace wrapping prevent symbol collisions when multiple AITER-backed
.sofiles coexist in one process - No AITER modifications — QoLA reconstructs AITER's build namespace without importing
aiter, and compiles AITER sources unmodified
- Python >= 3.10
- ROCm / HIP toolchain (hipcc)
- AITER source tree (included as a git submodule at
3rdparty/aiter/) - PyTorch (pybind mode only)
pip install -e .# Build all modules declared in a manifest (pybind mode)
qola build \
--manifest example/te-manifest.toml \
--aiter-root 3rdparty/aiter \
--output-dir /tmp/qola-out
# Build in cpp_itfs mode (no PyTorch dependency)
qola build \
--manifest example/te-manifest.toml \
--aiter-root 3rdparty/aiter \
--output-dir /tmp/qola-out \
--mode cpp_itfs| Option | Description |
|---|---|
--manifest |
Path to the TOML manifest file |
--aiter-root |
Path to the AITER source tree |
--output-dir |
Directory for build artifacts |
--arch |
Target GPU architecture (repeatable, e.g. --arch gfx950) |
--mode |
Build mode: pybind (default) or cpp_itfs |
--verbose |
Enable verbose build output |
The manifest is a TOML file that declares what to build. See example/te-manifest.toml for a full example.
[qola]
aiter_commit = "33f2e6a..." # Pinned AITER commit
namespace = "te" # C++ namespace and .so prefix
rocm_versions = ["7.2"]
[build]
architectures = ["gfx950"]
# Static modules from AITER's optCompilerConfig.json
[[modules]]
name = "libmha_fwd"
mode = "cpp_itfs"
drop_srcs = ["mha_fwd_split.cu", "mha_fwd_batch_prefill.cu"]
drop_directions = ["fwd_splitkv", "batch_prefill"]
[[modules]]
name = "libmha_bwd"
mode = "cpp_itfs"
# MHA variant matrix — Cartesian expansion of CK codegen filters
[[mha_fwd_variants]]
dtype = ["bf16", "fp16"]
has_lse = true
has_skip = false
[[mha_bwd_variants]]
dtype = ["bf16", "fp16"]Produces pybind11 .so modules importable from Python. Requires PyTorch at both build and runtime.
Produces torch-free C-linkable shared libraries. Each module exposes a C++ API under the configured namespace:
#include "qola_mha_fwd.h"
// With namespace = "te":
float ret = qola::te::mha_fwd(args, stream_config);Source replacement is driven by cpp_itfs/registry.toml: pybind entry points are swapped for thin C wrappers that expose a namespace-guarded C++ API.
output-dir/
lib/ # Compiled .so files
te_libmha_fwd.so
te_libmha_bwd.so
configs/ # AITER tuning CSVs
manifest.json # Build metadata and per-module results
| Module | Description | cpp_itfs API |
|---|---|---|
libmha_fwd |
Multi-head attention forward | qola::te::mha_fwd() |
libmha_bwd |
Multi-head attention backward | qola::te::mha_bwd() |
QoLA reconstructs AITER's build-time eval namespace from a source tree path alone, without ever running import aiter. This avoids AITER's __init__.py side effects and torch import requirements. See resolver.py.
Two layers prevent symbol leaks when multiple .so files coexist:
- C++ namespace wrapping —
QOLA_NS_BEGIN/QOLA_NS_ENDmacros place all public symbols underqola::<namespace>:: - Linker version script —
qola_exports.ldsforces all non-qola::*symbols local, including AITER symbols with explicitvisibility("default")
The manifest's [[mha_fwd_variants]] / [[mha_bwd_variants]] sections declare option dimensions (dtype, has_bias, has_mask, etc.) that are expanded into CK codegen filter patterns. This controls which of the ~34K possible kernel instances are actually compiled. See variant_matrix.py. This is currently only support for pybind11 output.
generate_embedded_hsa.py converts binary .co ASM blobs into a C++ header with compile-time byte arrays, enabling kernel distribution without a runtime AITER_ASM_DIR.
- CI support for building and publishing pre-built libraries from manifests
- Kernel filtering for
libmha— prune CK codegen instances based on manifest variant declarations incpp_itfsmode (currently pybind-only) - C-level JIT for
libmha— compile MHA variant.sofiles on first use at the C layer, avoiding ahead-of-time compilation of the full variant matrix
See the parent repository for license terms.