AITER v0.1.15.post1

Hotfix release for downstream vLLM partner unblock. Adds 4 cherry-picked fixes on top of v0.1.15 (plus 5 prerequisite commits to make them apply cleanly).

Inherited from v0.1.15 (full list in v0.1.15 release notes) — including PR #3304 mla fp8 qh32 seqlen=1 persistent kernel for gfx950 (required for DSv3.2 with --kv-cache-dtype fp8_e4m3). The cherry-pick lives at commit 6415d586 on the release/v0.1.15 branch (added during v0.1.15-rc0); the commit title matches PR #3304 verbatim.

Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)

PR	Title	Fixes
#3540	Rebuild 32x384 kernel from new sources	MiniMax M2.5 OOB access in fmoe (#3471)
#3428	add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4	gpt-oss W4A8 regressions + crashes
#3492	Enabled stride-aware KV-cache block dim for non-contiguous layouts (fused_qk_norm_rope_cache_pts_quant_shuffle)	Qwen fusion non-contiguous KV-cache
#3546	[Triton][Gluon] fused_qk_rope_cat_and_cache_mla new grid layout	MLA perf + vLLM unit-test fix

Prerequisite commits (required for the above to cherry-pick clean): #3372 (LDS-aware num_stages), #3159 (hip kl refactor), #3358 (partial rope), #2888 (FP4 GFX12 support), GFX12 import fix.

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model	v0.1.15.post1	v0.1.15	Threshold	Result
MiniMax-M2.5 (TP=2, fp8 KV)	0.9378	0.9340	0.92	PASS ↑
DeepSeek-R1-0528 (TP=8, fp8 KV)	0.9515	0.9431	0.94	PASS ↑
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	0.8756	0.8795	0.87	PASS (within noise band)
GLM-5-FP8 (TP=8, fp8 KV)	0.9454	0.9431	0.93	PASS ↑
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9363	0.9340	0.92	PASS ↑

5/5 PASS. 4/5 improved or equal to v0.1.15 baseline; Qwen3 single-question noise.

Related vLLM PRs (downstream)

For full gpt-oss + Qwen path you still need the vLLM-side companion PRs:

vLLM #44893 — Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (Rohan138)
vLLM #44804 — Hybrid CDNA4 swizzle gate for A8W4 MoE (xiaohuguo2023)
vLLM intermediate_pad TP-aware fix (Rohan138)

Wheel Matrix

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. torch 2.10 (rocm7.0/7.1) / torch 2.11 (rocm7.2). Fat binary gfx942 + gfx950.

Install

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.15.post1/<wheel-filename>

Partner deps same as v0.1.15: flydsl==0.1.9.dev599, triton>=3.6.0.

Known Issues

pip 26.0.1 wheel filename "wrong number of parts": rename wheel to drop .manylinux.2.28 infix before install. Will fix in v0.1.16.

Feedback

Bug reports: https://github.com/ROCm/aiter/issues — tag v0.1.15.post1
Direct: peng.sun@amd.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.15.post1

Choose a tag to compare

Sorry, something went wrong.