v0.1.16
What's Changed
- LDS-aware num_stages selection for gfx950 MoE GEMM by @xiaohuguo2023 in #3372
- [GLUON] Moe a8w4 optimizations by @lburzawa in #3317
- [Triton] Flash Attention Nightly Build Issues by @micmelesse in #3363
- CI: increase Aiter test shards to 8 by @gyohuangxin in #3399
- [module_fused_qk_norm_rope_cache_quant_shuffle] hip kl refactor by @amd-ruitang3 in #3159
- Add GFX950 BF16 MLA persistent kernels for GQA 16/32 qseqlen4 by @fangche123 in #3338
- Remove sorting for fmoe by @JohnNikolay84 in #3001
- [Triton] mhc fix by @k50112113 in #3411
- [FLYDSL] Add streaming-k with mixed slice-k and split-k for flydsl hgemm by @xytpai in #3279
- [tune] enhance glm47 tune file for both gemm and moe by @gbyu-amd in #3371
- Revert "Fix TP4 bf16 MLA persistent dispatch (#3233)"-MI355 mla bf16 nhead32,1 go through m32x4 kernel by @minmengdie in #3365
- support partial rope with fused_qk_norm_rope_cache_quant_shuffle by @gbyu-amd in #3358
- moe enable padding params as runtime by @zhiding512 in #3404
- [OPUS] [ATOM] Add 16mx1_16nx4 sparse prefill variant with H-based kernel dispatch by @kaiyang-1 in #3415
- Fix mhc_pre_big_fuse accuracy issue in rocm7.2.3 by @junhaha666 in #3417
- Fix causal f8 mha kernel to properly support causal mask by @JohnNikolay84 in #3416
- re-tune flydsl bf16 gemm tuned config by @XiaobingSuper in #3434
- Revert "moe enable padding params as runtime" by @valarLip in #3435
- [FLYDSL MOE] fix MoE accuracy with openning pad by @Zzz9990 in #3401
- Add DO runner smoke test (workflow_dispatch on linux-aiter-do-mi350x-8) by @okakarpa in #3436
- Revert DO runner smoke test by @okakarpa in #3437
- sglang downstream: run 8-GPU tests on the DO MI350X runner label by @okakarpa in #3438
- [PERF]update mi300x moe qwen bf16 configs by @lalala-sh in #3384
- ew kernels support on gfx12. by @stevenshenyj in #3330
- [gfx1250] prepare for gfx12 asm kernel porting by @feifei14119 in #3448
- Fix moe topk gating test by @yzhou103 in #3449
- [OPUS]opus_gemm acc fix by @demonsan in #3445
- ci: Kimi-K2.5-MXFP4 downstream accuracy gates for AITER (ATOM + vLLM + SGLang) on MI350X by @sunway513 in #3441
- add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4 by @xiaohuguo2023 in #3428
- ci(release): install flydsl from AMD mirror + fix install_triton.sh dpkg/pipefail by @sunway513 in #3440
- [[Triton] [Gluon] [GFX12] add FP4 support for UA3D, MLA, KV cache flushing by @k50112113 in https://github.com//pull/2888
- Align aiter prebuild jobs with build runner CPU limit by @gyohuangxin in #3433
- feat: support pad args for rmsnorm kernel by @PerryZhang01 in #3265
- ci(atom-downstream): fix MiniMax-M2.7 via HSA_NO_SCRATCH_RECLAIM by @sunway513 in #3474
- Add triton-kernels support and improve triton install messaging by @mengfei-jiang in #3335
- MI350 mla ps fp8 mode suppport nhead*qseqlen >=128 through kernel mla_a8w8_qh32_qseqlen4_gqaratio32_ps by @minmengdie in #3380
- feat(quant): align MXFP4 / MXFP8 E8M0 scale to ROUND_UP across HIP / Python ref / FlyDSL by @yzhou103 in #3212
- fix(dist): route fused qknorm through compile guard by @XiaobingSuper in #3478
- [FlyDSL][MOE][BugFix] Support moe inter_dim align with 128. by @lalala-sh in #3476
- configs: tune gfx1201 Qwen3-8B-FP8 blockscale GEMMs by @chuanbowang2026 in #3484
- [Triton] GFX12 import fix by @k50112113 in #3490
- [Triton] Fix Flash Attention Graph capture issues by @micmelesse in #2764
- [OPUS]Opus gemm 4G reject by @demonsan in #3403
- Add fused Allreduce + RMSNorm + MXFP4 quant by @hubertlu-tw in #3229
- [Triton-Gluon-MLA-GFX950] add stage1 only kernel wrapper for MLA decode by @Dewei-Wang-sh in #3402
- tune deepseel v3.2 ptpc a8w8 moe by @XiaobingSuper in #3487
- [CK] Adapt aiter to CK changes from ROCm/rocm-libraries#6978 by @DDEle in #3392
- support global load by @fangche123 in #3461
- CI: remove internal pip index from vLLM benchmark by @gyohuangxin in #3511
- fix gpt oss unified attention unbounded error by @HaonanWang98 in #3509
- CI: use Triton wheelhouse for multi-GPU tests by @gyohuangxin in #3512
- Add lru cache for flydsl a16w16 gemm compile by @xytpai in #3503
- Add correct varlen causal kernel for f8 mha for gfx950 by @JohnNikolay84 in #3499
- [Triton] [ATOM] MXFP8 GEMM and A8W4 MOE optimization by @k50112113 in #3292
- [Triton] skip Flash attention v3 tests on RDNA by @micmelesse in #3497
- [TRITON]perf(sage_attention): defer q_descale * k_descale in no-mask kernel to fuse with row-max subtract by @Chi-Chu319 in #3247
- [Triton] SAGE: compensate softmax_lse for K-smoothing shift to enable ring-attention compatibility by @ksikiric in #3334
- [triton-mha] add gfx1151 tuning config by @mgehre-amd in #3423
- opus_gemm splitk: per-stream workspace ownership for TBO by @demonsan in #3516
- [TRITON] Add gfx1250 Gluon fused RMSNorm kernel by @vgokhale in #3444
- [module_pa_v1] dead code by @amd-ruitang3 in #3508
- [module_pa] dead code by @amd-ruitang3 in #3502
- Revert "tune deepseel v3.2 ptpc a8w8 moe" by @XiaobingSuper in #3534
- [FLYDSL] bump version to 0.2.0 by @coderfeli in #3485
- [OPUS] gfx942 a16w16 bf16 gemm full pipeline family by @yifehuan in #3491
- Revert "[CK] Adapt aiter to CK changes from ROCm/rocm-libraries#6978" by @valarLip in #3531
- add per-(batch, head) fp8 quant ops for fused QK norm/rope and V by @LiuYinfeng01 in #3353
- [HIP] add chunk_gated_delta_rule_fwd_h_hip kernel for prefill GDN support and optimize triton kernel impl for AWS cases by @yiijin in #2774
- Fix: add missing end_sync barrier in cross_device_reduce_1stage by @zovonoir in #3514
- Fix dsv4_rotate(Hadamard) and top_k_per_row oon wave32 by @junhaha666 in #3528
- [module_pa_ragged] dead code by @amd-ruitang3 in #3507
- CI: Set ATOM image build timeout by @gyohuangxin in #3544
- Enabled stride-aware KV-cache block dim for non-contiguous layouts for fused_qk_norm_rope_cache_pts_quant_shuffle() by @jhu960213 in #3492
- fix(triton/decode): honor real paged KV block stride (support non-contiguous cache) by @lorri-rao in #3498
- Jim/dev/mi400 mha bwd by @slippedJim in #3281
- [FLYDSL][GEMM] Full re-tuning for mixed stream-k a16w16 gemm & enhance co-issue by @xytpai in #3469
- Fix OPUS warning on gfx1250 by @Boss2002n in #3524
- flydsl moe: EP reduce path with masked gather + tuned-key fix by @yadaish in #3377
- Rebuild 32x384 kernel from new sources by @JohnNikolay84 in #3540
- CI: add timeouts for artifact downloads by @gyohuangxin in #3550
- [Triton-Gluon-MLA-GFX950] return_lse: full decode + merged fp32 lse by @Dewei-Wang-sh in #3542
- fix multithread_reduce_max_dpp on gfx1250 by @yzhou103 in #3554
- CI: add fallback Triton wheel download by @gyohuangxin in #3561
- [Triton] [Gluon] fused_qk_rope_cat_and_cache_mla new grid layout by @k50112113 in #3546
- Add Kimi K2.5/K2.6 FP4 fused MoE tunings for TP2 (inter_dim=1024, 385 experts, top-9) by @xaguilar-amd in #3287
- CI: extend large artifact download timeouts by @gyohuangxin in #3563
- [Gluon][gfx1250] Gemm MXFP4 preshuffled by @Boss2002n in #3359
- Moe a8w4 optimization for decode by @lburzawa in #3504
- Integrate DS R1 GroupedTopk + Sigmoid Routing Into DS Routing by @amirumoAMD in #3522
- fix(ci): build multi-Python wheels and publish versioned S3 manifest by @Jasen2201 in #3572
- [module_aiter_operator] refactor by @amd-ruitang3 in #3559
- feat Support gfx1250 in deepgemm fp8 paged MQA logits by @charlieguo1106 in #3501
- [gfx1250][FlyDSL] Add PTPC FP8 GEMM by @aoli26 in #3106
- CI: retry Aiter wheel artifact download by @gyohuangxin in #3584
- [gfx1250][FLYDSL] mha with qkdim 192 for DSv3 & KIMI K2 by @Zzz9990 in #3576
- [Triton-Gluon-MLA-GFX950] optimize lse store and add support for small ctxlen(1-256) by @Dewei-Wang-sh in #3555
- [REFACTOR] move non-hipblaslt bf16 GEMM tuning to csrc/gemm_a16w16 by @yzhou103 in #3482
- CI: schedule daily tuning tests by @gyohuangxin in #3592
- CI: install dependencies for vLLM benchmark wheel by @gyohuangxin in #3590
- Fix flydsl chunk_gdn_h aot failure by @huizzhan in #3385
- CI: auto-update split test FILE_TIMES by @aiter-gh-app[bot] in #3447
- Validate paged_attention input by @JohnNikolay84 in #3455
- [OPUS] align opus gemm tuning tolerance with other solutions by @demonsan in #3558
- [pa_mqa_logits] Replace IS_GFX1250 bool constexpr with an ARCH string by @charlieguo1106 in #3621
- [Triton] fused_flatten_fp8_group_quant: add transpose_scale param by @Jacob0226 in #3041
- Introduce new 64x384 kernel by @JohnNikolay84 in #3630
- [gfx12] enable per-tensor scaled-Q/K/V attention by @quintinwang5 in #3543
- [gfx1250][FlyDSL] Add Ragged-M OOB Support for PTPC FP8 GEMM by @aoli26 in #3582
- [TRITON][GLUON] Unified attention 2d gluon kernel by @cagrikymk in #3577
- Tune gfx1151 MHA forward default tile config by @mgehre-amd in #3560
- Add hip mhc_fused_post_pre by @junhaha666 in #3623
- [Fix] fix MI355 mha fwd_v3 hd192x128 kernel wait LDS bug by @shay-li77 in #3633
- [module_custom] refactor by @amd-ruitang3 in #3625
- [module_causal_conv1d_update] refactor hip kernel by @amd-ruitang3 in #3595
- [FlyDSL MoE] Add no combine feature by @zx3xyy in #3408
- [OPUS] gfx942 a16w16 bf16 GEMM pipeline family for DSV4 by @yifehuan in #3594
- CI: map ATOM MI350X runner label by @gyohuangxin in #3647
- add env var for kernel arg preload by @HaonanWang98 in #3649
- Refine flydsl gemm config selection code by @xytpai in #3608
- Fix opus gemm aiter check by @yzhou103 in #3622
- mha_native: native HIP D64 BF16 split-K forward backend for flash_attn_func by @rocking5566 in #3581
- [Feat] Support FP4 gather_kv_b_proj by @qichu-yun in #3597
- [Triton][gfx1250] gemm a16w16 cleanup by @azaidy in #3646
- add torch compile guard + tdm descriptor fix in routing.py by @ahmed-bsod in #3530
- test: skip pa_decode_bf16_asm off gfx1250 by @yhl-amd in #3660
- [Triton] [Gluon] [GFX12] UA3D update config by @k50112113 in #3612
- fea: reduce_scatter support all dim by @TennyWang1223 in #3464
- [TRITON] Fix Assert in Triton
fused_kv_cacheby @leonling-ll in #3601 - Updated MiniMax M2.5 FMoE tuned configs with new 64x384 kernels by @akii96 in #3644
- Fix HK MLA decode fwd: per-batch output bounds check + reduce-time per-tile split cap by @ruanjm in #3391
- fix: fix hang issue by @PerryZhang01 in #3664
- add gptoss and ds gemm config for gfx1250 by @HaonanWang98 in #3676
- readd #3117, EP prefill optimization by @inkcherry in #3537
- [Triton] Sage MXFP4 return LSE by @ksikiric in #3349
- Add and tune fused GEMM A8W8 blockscale A16W16 benchmark by @nidal567 in #3568
- gfx950 MoE A8W4: tuned entries for gpt-oss shapes + fallback hardening by @xiaohuguo2023 in #3580
- [Triton] Add New Features and Performance Improvement for GMM Kernel by @brunomazzottiamd in #3407
- [triton-mha] hint head-stride div-by-8 for vectorized global load by @mgehre-amd in #3424
- Remove FP8 varlen MHA async-copy compiler skips by @nidal567 in #3643
- Drop the loop carried percentage by @Boss2002n in #3661
- [Triton][CDNA4] Optimize gluon blockscale a8w8 gemm kernel by @lijinpei-amd in #3307
- Add GLM GQA FP8 KV paged attention test by @ThomasNing in #3609
- Mhc large m by @LiuYinfeng01 in #3651
- fix(opus): guard gfx942 bf16ws splitk reduce by @yifehuan in #3684
- Optimize qk norm rope quant FlyDSL launch path by @yhl-amd in #3618
- flydsl: skip unsupported architectures instead of crashing at import by @mgehre-amd in #3683
- [Triton] Support non-interleaved tensor layout in fused reshape causal conv1d update kernel for Qwen3.5 by @hellozhuo-amd in #3251
- use stride check by @HaonanWang98 in #3697
- [Triton] Add fused_rms_mxfp4_quant to model benchmarking tool by @vgokhale in #3687
- [Triton] [Gluon] gather_kv_b_proj with shuffled kv_buffer support by @k50112113 in #3688
- [Gluon][GFX950][MLA] Fix mla decode accuracy issue with empty kv split by @leonling-ll in #3641
- Enabled stride-aware KV-cache block dim for non-contiguous layouts for fused_qk_norm_rope_cache_pts_quant_shuffle() part 2 by @jhu960213 in #3640
- Temporary GPT OSS MoE tuning fix by @azaidy in #3701
- Add GLM-4.7-FP8 tuned/untuned BF16 GEMM configs (gfx950) by @omirosh in #3285
- feat(fmoe): key tuned configs by (gfx, cu_num) to disambiguate archs by @yzhou103 in #3703
- refactor: de-torch module_fused_qk_norm_mrope_cache_quant_shuffle by @amd-ruitang3 in #3696
- fmha f16 by @feifei14119 in #3039
- disable MXFP4 1250 gluon path by @Boss2002n in #3704
New Contributors
- @stevenshenyj made their first contribution in #3330
- @yifehuan made their first contribution in #3491
- @jhu960213 made their first contribution in #3492
- @lorri-rao made their first contribution in #3498
- @charlieguo1106 made their first contribution in #3501
- @aoli26 made their first contribution in #3106
- @quintinwang5 made their first contribution in #3543
- @zx3xyy made their first contribution in #3408
- @qichu-yun made their first contribution in #3597
- @leonling-ll made their first contribution in #3601
Full Changelog: v0.1.15...v0.1.16