Release v0.1.16 · ROCm/aiter

What's Changed

LDS-aware num_stages selection for gfx950 MoE GEMM by @xiaohuguo2023 in #3372
[GLUON] Moe a8w4 optimizations by @lburzawa in #3317
[Triton] Flash Attention Nightly Build Issues by @micmelesse in #3363
CI: increase Aiter test shards to 8 by @gyohuangxin in #3399
[module_fused_qk_norm_rope_cache_quant_shuffle] hip kl refactor by @amd-ruitang3 in #3159
Add GFX950 BF16 MLA persistent kernels for GQA 16/32 qseqlen4 by @fangche123 in #3338
Remove sorting for fmoe by @JohnNikolay84 in #3001
[Triton] mhc fix by @k50112113 in #3411
[FLYDSL] Add streaming-k with mixed slice-k and split-k for flydsl hgemm by @xytpai in #3279
[tune] enhance glm47 tune file for both gemm and moe by @gbyu-amd in #3371
Revert "Fix TP4 bf16 MLA persistent dispatch (#3233)"-MI355 mla bf16 nhead32,1 go through m32x4 kernel by @minmengdie in #3365
support partial rope with fused_qk_norm_rope_cache_quant_shuffle by @gbyu-amd in #3358
moe enable padding params as runtime by @zhiding512 in #3404
[OPUS] [ATOM] Add 16mx1_16nx4 sparse prefill variant with H-based kernel dispatch by @kaiyang-1 in #3415
Fix mhc_pre_big_fuse accuracy issue in rocm7.2.3 by @junhaha666 in #3417
Fix causal f8 mha kernel to properly support causal mask by @JohnNikolay84 in #3416
re-tune flydsl bf16 gemm tuned config by @XiaobingSuper in #3434
Revert "moe enable padding params as runtime" by @valarLip in #3435
[FLYDSL MOE] fix MoE accuracy with openning pad by @Zzz9990 in #3401
Add DO runner smoke test (workflow_dispatch on linux-aiter-do-mi350x-8) by @okakarpa in #3436
Revert DO runner smoke test by @okakarpa in #3437
sglang downstream: run 8-GPU tests on the DO MI350X runner label by @okakarpa in #3438
[PERF]update mi300x moe qwen bf16 configs by @lalala-sh in #3384
ew kernels support on gfx12. by @stevenshenyj in #3330
[gfx1250] prepare for gfx12 asm kernel porting by @feifei14119 in #3448
Fix moe topk gating test by @yzhou103 in #3449
[OPUS]opus_gemm acc fix by @demonsan in #3445
ci: Kimi-K2.5-MXFP4 downstream accuracy gates for AITER (ATOM + vLLM + SGLang) on MI350X by @sunway513 in #3441
add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4 by @xiaohuguo2023 in #3428
ci(release): install flydsl from AMD mirror + fix install_triton.sh dpkg/pipefail by @sunway513 in #3440
[[Triton] [Gluon] [GFX12] add FP4 support for UA3D, MLA, KV cache flushing by @k50112113 in https://github.com//pull/2888
Align aiter prebuild jobs with build runner CPU limit by @gyohuangxin in #3433
feat: support pad args for rmsnorm kernel by @PerryZhang01 in #3265
ci(atom-downstream): fix MiniMax-M2.7 via HSA_NO_SCRATCH_RECLAIM by @sunway513 in #3474
Add triton-kernels support and improve triton install messaging by @mengfei-jiang in #3335
MI350 mla ps fp8 mode suppport nhead*qseqlen >=128 through kernel mla_a8w8_qh32_qseqlen4_gqaratio32_ps by @minmengdie in #3380
feat(quant): align MXFP4 / MXFP8 E8M0 scale to ROUND_UP across HIP / Python ref / FlyDSL by @yzhou103 in #3212
fix(dist): route fused qknorm through compile guard by @XiaobingSuper in #3478
[FlyDSL][MOE][BugFix] Support moe inter_dim align with 128. by @lalala-sh in #3476
configs: tune gfx1201 Qwen3-8B-FP8 blockscale GEMMs by @chuanbowang2026 in #3484
[Triton] GFX12 import fix by @k50112113 in #3490
[Triton] Fix Flash Attention Graph capture issues by @micmelesse in #2764
[OPUS]Opus gemm 4G reject by @demonsan in #3403
Add fused Allreduce + RMSNorm + MXFP4 quant by @hubertlu-tw in #3229
[Triton-Gluon-MLA-GFX950] add stage1 only kernel wrapper for MLA decode by @Dewei-Wang-sh in #3402
tune deepseel v3.2 ptpc a8w8 moe by @XiaobingSuper in #3487
[CK] Adapt aiter to CK changes from ROCm/rocm-libraries#6978 by @DDEle in #3392
support global load by @fangche123 in #3461
CI: remove internal pip index from vLLM benchmark by @gyohuangxin in #3511
fix gpt oss unified attention unbounded error by @HaonanWang98 in #3509
CI: use Triton wheelhouse for multi-GPU tests by @gyohuangxin in #3512
Add lru cache for flydsl a16w16 gemm compile by @xytpai in #3503
Add correct varlen causal kernel for f8 mha for gfx950 by @JohnNikolay84 in #3499
[Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization by @k50112113 in #3292
[Triton] skip Flash attention v3 tests on RDNA by @micmelesse in #3497
[TRITON]perf(sage_attention): defer q_descale * k_descale in no-mask kernel to fuse with row-max subtract by @Chi-Chu319 in #3247
[Triton] SAGE: compensate softmax_lse for K-smoothing shift to enable ring-attention compatibility by @ksikiric in #3334
[triton-mha] add gfx1151 tuning config by @mgehre-amd in #3423
opus_gemm splitk: per-stream workspace ownership for TBO by @demonsan in #3516
[TRITON] Add gfx1250 Gluon fused RMSNorm kernel by @vgokhale in #3444
[module_pa_v1] dead code by @amd-ruitang3 in #3508
[module_pa] dead code by @amd-ruitang3 in #3502
Revert "tune deepseel v3.2 ptpc a8w8 moe" by @XiaobingSuper in #3534
[FLYDSL] bump version to 0.2.0 by @coderfeli in #3485
[OPUS] gfx942 a16w16 bf16 gemm full pipeline family by @yifehuan in #3491
Revert "[CK] Adapt aiter to CK changes from ROCm/rocm-libraries#6978" by @valarLip in #3531
add per-(batch, head) fp8 quant ops for fused QK norm/rope and V by @LiuYinfeng01 in #3353
[HIP] add chunk_gated_delta_rule_fwd_h_hip kernel for prefill GDN support and optimize triton kernel impl for AWS cases by @yiijin in #2774
Fix: add missing end_sync barrier in cross_device_reduce_1stage by @zovonoir in #3514
Fix dsv4_rotate(Hadamard) and top_k_per_row oon wave32 by @junhaha666 in #3528
[module_pa_ragged] dead code by @amd-ruitang3 in #3507
CI: Set ATOM image build timeout by @gyohuangxin in #3544
Enabled stride-aware KV-cache block dim for non-contiguous layouts for fused_qk_norm_rope_cache_pts_quant_shuffle() by @jhu960213 in #3492
fix(triton/decode): honor real paged KV block stride (support non-contiguous cache) by @lorri-rao in #3498
Jim/dev/mi400 mha bwd by @slippedJim in #3281
[FLYDSL][GEMM] Full re-tuning for mixed stream-k a16w16 gemm & enhance co-issue by @xytpai in #3469
Fix OPUS warning on gfx1250 by @Boss2002n in #3524
flydsl moe: EP reduce path with masked gather + tuned-key fix by @yadaish in #3377
Rebuild 32x384 kernel from new sources by @JohnNikolay84 in #3540
CI: add timeouts for artifact downloads by @gyohuangxin in #3550
[Triton-Gluon-MLA-GFX950] return_lse: full decode + merged fp32 lse by @Dewei-Wang-sh in #3542
fix multithread_reduce_max_dpp on gfx1250 by @yzhou103 in #3554
CI: add fallback Triton wheel download by @gyohuangxin in #3561
[Triton] [Gluon] fused_qk_rope_cat_and_cache_mla new grid layout by @k50112113 in #3546
Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9) by @xaguilar-amd in #3287
CI: extend large artifact download timeouts by @gyohuangxin in #3563
[Gluon][gfx1250] Gemm MXFP4 preshuffled by @Boss2002n in #3359
Moe a8w4 optimization for decode by @lburzawa in #3504
Integrate DS R1 GroupedTopk + Sigmoid Routing Into DS Routing by @amirumoAMD in #3522
fix(ci): build multi-Python wheels and publish versioned S3 manifest by @Jasen2201 in #3572
[module_aiter_operator] refactor by @amd-ruitang3 in #3559
feat Support gfx1250 in deepgemm fp8 paged MQA logits by @charlieguo1106 in #3501
[gfx1250][FlyDSL] Add PTPC FP8 GEMM by @aoli26 in #3106
CI: retry Aiter wheel artifact download by @gyohuangxin in #3584
[gfx1250][FLYDSL] mha with qkdim 192 for DSv3 & KIMI K2 by @Zzz9990 in #3576
[Triton-Gluon-MLA-GFX950] optimize lse store and add support for small ctxlen(1-256) by @Dewei-Wang-sh in #3555
[REFACTOR] move non-hipblaslt bf16 GEMM tuning to csrc/gemm_a16w16 by @yzhou103 in #3482
CI: schedule daily tuning tests by @gyohuangxin in #3592
CI: install dependencies for vLLM benchmark wheel by @gyohuangxin in #3590
Fix flydsl chunk_gdn_h aot failure by @huizzhan in #3385
CI: auto-update split test FILE_TIMES by @aiter-gh-app[bot] in #3447
Validate paged_attention input by @JohnNikolay84 in #3455
[OPUS] align opus gemm tuning tolerance with other solutions by @demonsan in #3558
[pa_mqa_logits] Replace IS_GFX1250 bool constexpr with an ARCH string by @charlieguo1106 in #3621
[Triton] fused_flatten_fp8_group_quant: add transpose_scale param by @Jacob0226 in #3041
Introduce new 64x384 kernel by @JohnNikolay84 in #3630
[gfx12] enable per-tensor scaled-Q/K/V attention by @quintinwang5 in #3543
[gfx1250][FlyDSL] Add Ragged-M OOB Support for PTPC FP8 GEMM by @aoli26 in #3582
[TRITON][GLUON] Unified attention 2d gluon kernel by @cagrikymk in #3577
Tune gfx1151 MHA forward default tile config by @mgehre-amd in #3560
Add hip mhc_fused_post_pre by @junhaha666 in #3623
[Fix] fix MI355 mha fwd_v3 hd192x128 kernel wait LDS bug by @shay-li77 in #3633
[module_custom] refactor by @amd-ruitang3 in #3625
[module_causal_conv1d_update] refactor hip kernel by @amd-ruitang3 in #3595
[FlyDSL MoE] Add no combine feature by @zx3xyy in #3408
[OPUS] gfx942 a16w16 bf16 GEMM pipeline family for DSV4 by @yifehuan in #3594
CI: map ATOM MI350X runner label by @gyohuangxin in #3647
add env var for kernel arg preload by @HaonanWang98 in #3649
Refine flydsl gemm config selection code by @xytpai in #3608
Fix opus gemm aiter check by @yzhou103 in #3622
mha_native: native HIP D64 BF16 split-K forward backend for flash_attn_func by @rocking5566 in #3581
[Feat] Support FP4 gather_kv_b_proj by @qichu-yun in #3597
[Triton][gfx1250] gemm a16w16 cleanup by @azaidy in #3646
add torch compile guard + tdm descriptor fix in routing.py by @ahmed-bsod in #3530
test: skip pa_decode_bf16_asm off gfx1250 by @yhl-amd in #3660
[Triton] [Gluon] [GFX12] UA3D update config by @k50112113 in #3612
fea: reduce_scatter support all dim by @TennyWang1223 in #3464
[TRITON] Fix Assert in Triton fused_kv_cache by @leonling-ll in #3601
Updated MiniMax M2.5 FMoE tuned configs with new 64x384 kernels by @akii96 in #3644
Fix HK MLA decode fwd: per-batch output bounds check + reduce-time per-tile split cap by @ruanjm in #3391
fix: fix hang issue by @PerryZhang01 in #3664
add gptoss and ds gemm config for gfx1250 by @HaonanWang98 in #3676
readd #3117, EP prefill optimization by @inkcherry in #3537
[Triton] Sage MXFP4 return LSE by @ksikiric in #3349
Add and tune fused GEMM A8W8 blockscale A16W16 benchmark by @nidal567 in #3568
gfx950 MoE A8W4: tuned entries for gpt-oss shapes + fallback hardening by @xiaohuguo2023 in #3580
[Triton] Add New Features and Performance Improvement for GMM Kernel by @brunomazzottiamd in #3407
[triton-mha] hint head-stride div-by-8 for vectorized global load by @mgehre-amd in #3424
Remove FP8 varlen MHA async-copy compiler skips by @nidal567 in #3643
Drop the loop carried percentage by @Boss2002n in #3661
[Triton][CDNA4] Optimize gluon blockscale a8w8 gemm kernel by @lijinpei-amd in #3307
Add GLM GQA FP8 KV paged attention test by @ThomasNing in #3609
Mhc large m by @LiuYinfeng01 in #3651
fix(opus): guard gfx942 bf16ws splitk reduce by @yifehuan in #3684
Optimize qk norm rope quant FlyDSL launch path by @yhl-amd in #3618
flydsl: skip unsupported architectures instead of crashing at import by @mgehre-amd in #3683
[Triton] Support non-interleaved tensor layout in fused reshape causal conv1d update kernel for Qwen3.5 by @hellozhuo-amd in #3251
use stride check by @HaonanWang98 in #3697
[Triton] Add fused_rms_mxfp4_quant to model benchmarking tool by @vgokhale in #3687
[Triton] [Gluon] gather_kv_b_proj with shuffled kv_buffer support by @k50112113 in #3688
[Gluon][GFX950][MLA] Fix mla decode accuracy issue with empty kv split by @leonling-ll in #3641
Enabled stride-aware KV-cache block dim for non-contiguous layouts for fused_qk_norm_rope_cache_pts_quant_shuffle() part 2 by @jhu960213 in #3640
Temporary GPT OSS MoE tuning fix by @azaidy in #3701
Add GLM-4.7-FP8 tuned/untuned BF16 GEMM configs (gfx950) by @omirosh in #3285
feat(fmoe): key tuned configs by (gfx, cu_num) to disambiguate archs by @yzhou103 in #3703
refactor: de-torch module_fused_qk_norm_mrope_cache_quant_shuffle by @amd-ruitang3 in #3696
fmha f16 by @feifei14119 in #3039
disable MXFP4 1250 gluon path by @Boss2002n in #3704

New Contributors

@stevenshenyj made their first contribution in #3330
@yifehuan made their first contribution in #3491
@jhu960213 made their first contribution in #3492
@lorri-rao made their first contribution in #3498
@charlieguo1106 made their first contribution in #3501
@aoli26 made their first contribution in #3106
@quintinwang5 made their first contribution in #3543
@zx3xyy made their first contribution in #3408
@qichu-yun made their first contribution in #3597
@leonling-ll made their first contribution in #3601

Full Changelog: v0.1.15...v0.1.16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.16

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!