v0.1.15
AITER v0.1.15
Bi-weekly release, paired with ATOM v0.1.4 (first AITER+ATOM paired release in the bi-weekly cadence pilot — see cadence proposal).
Same commit as v0.1.15-rc0 (8ddfc7510) — zero delta after 6-day RC soak with no partner issues filed. Release branch: release/v0.1.15.
Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)
- DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
- MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
- FlyDSL — pin bumped to
0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373) - Triton —
tl.dot(..., acc=...)accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288) - mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
- mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
- OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
- gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
- DP-attention — CUDAGraph capture compatibility fix (#3375)
- CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
- CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9431 | 0.94 | PASS |
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9340 | 0.92 | PASS |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8795 | 0.87 | PASS |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9431 | 0.93 | PASS |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9340 | 0.93 | PASS |
5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base.
Wheel Matrix (6 wheels)
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
| ROCm | Python | torch ABI | Size |
|---|---|---|---|
| 7.0 | 3.10 | 2.10 | 466 MB |
| 7.0 | 3.12 | 2.10 | 467 MB |
| 7.1 | 3.10 | 2.10 | 459 MB |
| 7.1 | 3.12 | 2.10 | 459 MB |
| 7.2 | 3.10 | 2.11 | 452 MB |
| 7.2 | 3.12 | 2.11 | 453 MB |
Install
pip install \
--extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
https://github.com/ROCm/aiter/releases/download/v0.1.15/<wheel-filename>The --extra-index-url is required — see "Partner dependencies" below.
Partner dependencies (READ BEFORE INSTALLING)
1. flydsl==0.1.9.dev599 (REQUIRED)
setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:
https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/
Always pass --extra-index-url <above> to pip. Without it, ROCm/HIP JIT silently disables.
2. triton>=3.6.0 (REQUIRED)
aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels. Use the paired ATOM v0.1.4 container which ships triton 3.6.0, or before installing the aiter wheel:
pip install --force-reinstall triton==3.6.0Paired container
rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4 ships this AITER wheel pre-installed with matching triton + flydsl. Recommended pin point for partners.
Known Issues
- rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
- pip 26.0.1 "wrong number of parts in filename": workaround — download the wheel and rename to drop the
.manylinux.2.28infix from the version segment beforepip install. Tracked for v0.1.16.
Feedback
- Bug reports: https://github.com/ROCm/aiter/issues — tag
v0.1.15 - Direct: peng.sun@amd.com