Add HIP MLA reduce kernel dispatch logic by ftyghome · Pull Request #1018 · ROCm/ATOM

ftyghome · 2026-06-01T14:56:52Z

Add HIP MLA reduce kernel dispatch logic. Rely on ROCm/aiter#3468.

Performance Result

E2E (rocprof, Kimi K2.5, conc128): the decode reduce drops from 7997 ns
(stock kn_mla_reduce_v1_ps) to 6176 ns (ours), ~1.3×.
Microbench (conc64, plain TP4, H=16 / K=512): stock mla_reduce_v1 ~4.7 µs →
ours (adaptive) ~2.5 µs, ~1.9×. Correctness checked bit-faithful against a
torch reference (atol/rtol 2e-2).

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in ROCm/HIP MLA decode reduction path controlled by a new environment flag, falling back to the existing mla_decode_fwd implementation when not applicable.

Changes:

Introduces ATOM_ENABLE_HIP_MLA_REDUCE env toggle (default enabled).
Adds a new decode fast-path using mla_reduce_decode under ROCm/fp4bmm + head-count constraints.
Retains the existing mla_decode_fwd path as a fallback.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
atom/utils/envs.py	Adds an environment flag to control enabling the HIP MLA reduce decode path.
atom/model_ops/attention_mla.py	Imports and conditionally uses `mla_reduce_decode` to accelerate decode on supported ROCm configurations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant as _aiter_triton_fp8_bmm,
 )

+from aiter.mla_v_up_proj import mla_reduce_decode


+                reduced = mla_reduce_decode(
+                    q,
+                    kv_buffer,
+                    o,


mla reduce only

939452c

Copilot AI review requested due to automatic review settings June 1, 2026 14:56

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HIP MLA reduce kernel dispatch logic#1018

Add HIP MLA reduce kernel dispatch logic#1018
ftyghome wants to merge 1 commit into
ROCm:mainfrom
RadeonFlow:rf-mla

ftyghome commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ftyghome commented Jun 1, 2026

Performance Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants