AITER v0.1.15.post1
Hotfix release for downstream vLLM partner unblock. Adds 4 cherry-picked fixes on top of v0.1.15 (plus 5 prerequisite commits to make them apply cleanly).
Inherited from v0.1.15 (full list in v0.1.15 release notes) — including PR #3304
mla fp8 qh32 seqlen=1 persistent kernel for gfx950(required for DSv3.2 with--kv-cache-dtype fp8_e4m3). The cherry-pick lives at commit6415d586on the release/v0.1.15 branch (added during v0.1.15-rc0); the commit title matches PR #3304 verbatim.
Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)
| PR | Title | Fixes |
|---|---|---|
| #3540 | Rebuild 32x384 kernel from new sources | MiniMax M2.5 OOB access in fmoe (#3471) |
| #3428 | add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4 | gpt-oss W4A8 regressions + crashes |
| #3492 | Enabled stride-aware KV-cache block dim for non-contiguous layouts (fused_qk_norm_rope_cache_pts_quant_shuffle) | Qwen fusion non-contiguous KV-cache |
| #3546 | [Triton][Gluon] fused_qk_rope_cat_and_cache_mla new grid layout | MLA perf + vLLM unit-test fix |
Prerequisite commits (required for the above to cherry-pick clean): #3372 (LDS-aware num_stages), #3159 (hip kl refactor), #3358 (partial rope), #2888 (FP4 GFX12 support), GFX12 import fix.
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)
| Model | v0.1.15.post1 | v0.1.15 | Threshold | Result |
|---|---|---|---|---|
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9378 | 0.9340 | 0.92 | PASS ↑ |
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9515 | 0.9431 | 0.94 | PASS ↑ |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8756 | 0.8795 | 0.87 | PASS (within noise band) |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9454 | 0.9431 | 0.93 | PASS ↑ |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9363 | 0.9340 | 0.92 | PASS ↑ |
5/5 PASS. 4/5 improved or equal to v0.1.15 baseline; Qwen3 single-question noise.
Related vLLM PRs (downstream)
For full gpt-oss + Qwen path you still need the vLLM-side companion PRs:
- vLLM #44893 — Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (Rohan138)
- vLLM #44804 — Hybrid CDNA4 swizzle gate for A8W4 MoE (xiaohuguo2023)
- vLLM intermediate_pad TP-aware fix (Rohan138)
Wheel Matrix
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. torch 2.10 (rocm7.0/7.1) / torch 2.11 (rocm7.2). Fat binary gfx942 + gfx950.
Install
pip install \
--extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
https://github.com/ROCm/aiter/releases/download/v0.1.15.post1/<wheel-filename>Partner deps same as v0.1.15: flydsl==0.1.9.dev599, triton>=3.6.0.
Known Issues
- pip 26.0.1 wheel filename "wrong number of parts": rename wheel to drop
.manylinux.2.28infix before install. Will fix in v0.1.16.
Feedback
- Bug reports: https://github.com/ROCm/aiter/issues — tag
v0.1.15.post1 - Direct: peng.sun@amd.com