Skip to content

v0.1.15.post1

Latest

Choose a tag to compare

@sunway513 sunway513 released this 08 Jun 20:29
· 178 commits to main since this release

AITER v0.1.15.post1

Hotfix release for downstream vLLM partner unblock. Adds 4 cherry-picked fixes on top of v0.1.15 (plus 5 prerequisite commits to make them apply cleanly).

Inherited from v0.1.15 (full list in v0.1.15 release notes) — including PR #3304 mla fp8 qh32 seqlen=1 persistent kernel for gfx950 (required for DSv3.2 with --kv-cache-dtype fp8_e4m3). The cherry-pick lives at commit 6415d586 on the release/v0.1.15 branch (added during v0.1.15-rc0); the commit title matches PR #3304 verbatim.

Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)

PR Title Fixes
#3540 Rebuild 32x384 kernel from new sources MiniMax M2.5 OOB access in fmoe (#3471)
#3428 add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4 gpt-oss W4A8 regressions + crashes
#3492 Enabled stride-aware KV-cache block dim for non-contiguous layouts (fused_qk_norm_rope_cache_pts_quant_shuffle) Qwen fusion non-contiguous KV-cache
#3546 [Triton][Gluon] fused_qk_rope_cat_and_cache_mla new grid layout MLA perf + vLLM unit-test fix

Prerequisite commits (required for the above to cherry-pick clean): #3372 (LDS-aware num_stages), #3159 (hip kl refactor), #3358 (partial rope), #2888 (FP4 GFX12 support), GFX12 import fix.

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model v0.1.15.post1 v0.1.15 Threshold Result
MiniMax-M2.5 (TP=2, fp8 KV) 0.9378 0.9340 0.92 PASS ↑
DeepSeek-R1-0528 (TP=8, fp8 KV) 0.9515 0.9431 0.94 PASS ↑
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) 0.8756 0.8795 0.87 PASS (within noise band)
GLM-5-FP8 (TP=8, fp8 KV) 0.9454 0.9431 0.93 PASS ↑
Kimi-K2.5-MXFP4 (TP=4, fp8 KV) 0.9363 0.9340 0.92 PASS ↑

5/5 PASS. 4/5 improved or equal to v0.1.15 baseline; Qwen3 single-question noise.

Related vLLM PRs (downstream)

For full gpt-oss + Qwen path you still need the vLLM-side companion PRs:

  • vLLM #44893 — Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (Rohan138)
  • vLLM #44804 — Hybrid CDNA4 swizzle gate for A8W4 MoE (xiaohuguo2023)
  • vLLM intermediate_pad TP-aware fix (Rohan138)

Wheel Matrix

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. torch 2.10 (rocm7.0/7.1) / torch 2.11 (rocm7.2). Fat binary gfx942 + gfx950.

Install

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.15.post1/<wheel-filename>

Partner deps same as v0.1.15: flydsl==0.1.9.dev599, triton>=3.6.0.

Known Issues

  • pip 26.0.1 wheel filename "wrong number of parts": rename wheel to drop .manylinux.2.28 infix before install. Will fix in v0.1.16.

Feedback