Skip to content

v0.1.15

Choose a tag to compare

@sunway513 sunway513 released this 06 Jun 20:01
· 170 commits to main since this release

AITER v0.1.15

Bi-weekly release, paired with ATOM v0.1.4 (first AITER+ATOM paired release in the bi-weekly cadence pilot — see cadence proposal).

Same commit as v0.1.15-rc0 (8ddfc7510) — zero delta after 6-day RC soak with no partner issues filed. Release branch: release/v0.1.15.

Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)

  • DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
  • MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
  • FlyDSL — pin bumped to 0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373)
  • Tritontl.dot(..., acc=...) accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288)
  • mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
  • mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
  • OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
  • gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
  • DP-attention — CUDAGraph capture compatibility fix (#3375)
  • CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
  • CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model Score Threshold Result
DeepSeek-R1-0528 (TP=8, fp8 KV) 0.9431 0.94 PASS
MiniMax-M2.5 (TP=2, fp8 KV) 0.9340 0.92 PASS
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) 0.8795 0.87 PASS
GLM-5-FP8 (TP=8, fp8 KV) 0.9431 0.93 PASS
Kimi-K2.5-MXFP4 (TP=4, fp8 KV) 0.9340 0.93 PASS

5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base.

Wheel Matrix (6 wheels)

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

ROCm Python torch ABI Size
7.0 3.10 2.10 466 MB
7.0 3.12 2.10 467 MB
7.1 3.10 2.10 459 MB
7.1 3.12 2.10 459 MB
7.2 3.10 2.11 452 MB
7.2 3.12 2.11 453 MB

Install

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.15/<wheel-filename>

The --extra-index-url is required — see "Partner dependencies" below.

Partner dependencies (READ BEFORE INSTALLING)

1. flydsl==0.1.9.dev599 (REQUIRED)

setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:

https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/

Always pass --extra-index-url <above> to pip. Without it, ROCm/HIP JIT silently disables.

2. triton>=3.6.0 (REQUIRED)

aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels. Use the paired ATOM v0.1.4 container which ships triton 3.6.0, or before installing the aiter wheel:

pip install --force-reinstall triton==3.6.0

Paired container

rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4 ships this AITER wheel pre-installed with matching triton + flydsl. Recommended pin point for partners.

Known Issues

  • rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
  • pip 26.0.1 "wrong number of parts in filename": workaround — download the wheel and rename to drop the .manylinux.2.28 infix from the version segment before pip install. Tracked for v0.1.16.

Feedback