[TRITON] Fix unit tests on `gfx950` - part 2 by brunomazzottiamd · Pull Request #2491 · ROCm/aiter

brunomazzottiamd · 2026-03-26T20:09:21Z

Motivation

Triton test suite isn't passing on gfx950. This PR fixes test_gemm_afp4wfp4.py and test_gemm_a8w8.py, slightly improving the situation.

Technical Details

Fix `test_gemm_afp4wfp4.py`

Triton commit de2ba3946b changed AMDMFMALayout.instr_shape from a 2-element [M, N] to a 3-element [M, N, K] list. Extend the previously 2-element [32, 32] to [32, 32, 64]. K=64 is the K dimension of the mfma_scale_f32_32x32x64_f8f6f4 hardware instruction used for FP4 on gfx950.

Restrict AFP4/WFP4 AOT tests to Triton 3.5. Avoid using prebuilt AOT kernels on newer Triton versions where the metadata format is incompatible.

Fix `test_gemm_a8w8.py`

Fix `_gemm_a8w8_kernel`

Same instr_shape API break (Triton de2ba3946b). The kernel uses mfma_scaled for FP8 and plain mfma for INT8, which target different hardware instructions with different K dimensions:

FP8 mfma_scale_f32_16x16x128_f8f6f4 (K=128, K_WIDTH=32)
INT8 mfma_i32_16x16x64_i8 (K=64, K_WIDTH=16)

SwizzledSharedLayout.vec is updated to match K_WIDTH per data type specialization.

Fix `_gemm_a8w8_preshuffled_kernel`

The linear_nk layout and its reshape - permute - reshape - trans unshuffle sequence were designed for K=32 / K_WIDTH=16, so applying K=128 breaks the layout conversion. Since mfma_scaled was already invoked with a_scale=None and b_scale=None (per-tensor scale applied to the accumulator separately), replace it with plain mfma, targeting the unscaled mfma_f32_16x16x32_fp8_fp8 (K=32) that the preshuffled layout was built for.

Fix `test_gemm_a8w8.py`

Relax absolute tolerance from 0.02 to 0.03 to accommodate the preshuffled FP8 path (unscaled dot + software accumulator scale).

Fix `ff_a16w16_fused_ungated.py`

The k-loop staggers each N-block's start position by k_cyclic_offset = pid_n % cdiv(K, BLOCK_SIZE_K) to reducetl.atomic_add contention on y_ptrs. The y_mask K-boundary check incorrectly used the raw loop counter k (always starting at 0) instead of k_cyclic_offset (the actual K position). When the cyclic offset is non-zero, k understates the real offset, producing a wrong mask and corrupting partial sums near the K boundary.

Compatibility fixes for older Gluon API (Triton < 3.6.0)

This PR also implements compatibility for old Gluon API, supporting Gluon of a Triton compiler older than version 3.6.

Test Plan

Run respective tests on gfx950:

for t in \
op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py \
op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py \
op_tests/triton_tests/test_pa_decode_gluon.py \
; do echo "${t}"; pytest -q --no-header "${t}" | tail -1; done

The tests should pass on latest Triton TOT and Triton 3.5.0 (< 3.6.0).

Test Result

test_gemm_afp4wfp4.py, test_gemm_a8w8.py and test_pa_decode_gluon.py pass on gfx950.

TOT Triton - all test cases:

op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py
10656 passed, 3680 skipped in 570.54s (0:09:30)
op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py
9382 passed, 1856 skipped in 1024.83s (0:17:04)
op_tests/triton_tests/test_pa_decode_gluon.py
4 passed, 1 warning in 618.54s (0:10:18)

Triton 3.5.0 - only test cases of Gluon kernels, to check compatibility with older API:

op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py
3584 passed, 3584 skipped in 223.31s (0:03:43)
op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py
6414 passed, 1272 skipped in 689.21s (0::1::29)
op_tests/triton_tests/test_pa_decode_gluon.py
4 passed, 1 warning in 618.00s (0:10:18)

Execution of Gluon kernels only was achieved thought the following patch:

```diff
diff --git a/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py b/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py
index 10d8f99d8..2010a2ec0 100644
--- a/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py
+++ b/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py
@@ -175,7 +175,6 @@ def generate_gemm_a8w8_inputs(
 @pytest.mark.parametrize(
     "impl",
     [
-        "triton",
         "gluon",
         "gluon_shuffle",
     ],
diff --git a/op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py b/op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py
index 8a4953811..671f516ef 100644
--- a/op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py
+++ b/op_tests/triton_tests/gemm/basic/test_gemm_afp4wfp4.py
@@ -241,7 +241,7 @@ def run_triton(
     [True, False],
 )
 @pytest.mark.parametrize("skip_reduce", [True, False])
-@pytest.mark.parametrize("impl", ["triton", "gluon"])
+@pytest.mark.parametrize("impl", ["gluon"])
 def test_gemm_afp4_wfp4(
     M: int,
     N: int,

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Triton commit de2ba3946b ("[AMD] Refactor mfma layout") changed `AMDMFMALayout.instr_shape` from a 2-element `[M, N]` to a 3-element `[M, N, K]` list. Extend the previously 2-element `[32, 32]` to `[32, 32, 64]`. K=64 is the K dimension of the `mfma_scale_f32_32x32x64_f8f6f4` hardware instruction used for FP4 on `gfx950`.

* Fix `_gemm_a8w8_kernel`: Same `instr_shape` API break (Triton de2ba3946b). The kernel uses `mfma_scaled` for FP8 and plain `mfma` for INT8, which target different hardware instructions with different K dimensions: - FP8 `mfma_scale_f32_16x16x128_f8f6f4` (K=128, K_WIDTH=32) - INT8 `mfma_i32_16x16x64_i8` (K=64, K_WIDTH=16) `SwizzledSharedLayout.vec` is updated to match K_WIDTH per data type specialisation. * Fix `_gemm_a8w8_preshuffled_kernel`: The `linear_nk` layout and its `reshape - permute - reshape - trans` unshuffle sequence were designed for K=32 / K_WIDTH=16, so applying K=128 breaks the layout conversion. Since `mfma_scaled` was already invoked with `a_scale=None` and `b_scale=None` (per-tensor scale applied to the accumulator separately), replace it with plain `mfma`, targeting the unscaled `mfma_f32_16x16x32_fp8_fp8` (K=32) that the preshuffled layout was built for. * Fix `test_gemm_a8w8.py`: Relax absolute tolerance from 0.02 to 0.03 to accommodate the preshuffled FP8 path (unscaled dot + software accumulator scale).

This aspect should be also used by other Gluon kernels, namely `gemm_afp4wfp4.py` and `gemm_a8w8.py`.

* Restrict AFP4/WFP4 AOT tests to Triton 3.5. Avoid using prebuilt AOT kernels on newer Triton versions where the metadata format is incompatible.

* Support Gluon API for Triton compiler older than 3.6. * Conditionally skip some cases of `test_gemm_a8w8.py::test_gemm_splitk` on Triton 3.5. Ragged FP8 split-K lowering fails in Triton 3.5.

The k-loop staggers each N-block's start position by `k_cyclic_offset = pid_n % cdiv(K, BLOCK_SIZE_K)` to reduce`tl.atomic_add` contention on `y_ptrs`. The `y_mask` K-boundary check incorrectly used the raw loop counter `k` (always starting at 0) instead of `k_cyclic_offset` (the actual K position). When the cyclic offset is non-zero, `k` understates the real offset, producing a wrong mask and corrupting partial sums near the K boundary. Replace `k` with `k_cyclic_offset`, consistent with the analogous bound already used in the `w2` load mask.

azaidy

LGTM!

brunomazzottiamd · 2026-04-07T20:57:33Z

The only test failures are the expected ones, i.e. test_mha_backward on Triton Tests (MI35X) / Shard 0:

=========================== short test summary info ============================
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-4]
==== 12 failed, 5269 passed, 2016 skipped, 6 warnings in 3920.96s (1:05:20) ====

These failures aren't in the scope of this PR. We're good to merge, everything else passed.

FYI: @gyohuangxin

* Fix `test_gemm_afp4wfp4.py` Triton commit de2ba3946b ("[AMD] Refactor mfma layout") changed `AMDMFMALayout.instr_shape` from a 2-element `[M, N]` to a 3-element `[M, N, K]` list. Extend the previously 2-element `[32, 32]` to `[32, 32, 64]`. K=64 is the K dimension of the `mfma_scale_f32_32x32x64_f8f6f4` hardware instruction used for FP4 on `gfx950`. * Fix `test_gemm_a8w8.py` * Fix `_gemm_a8w8_kernel`: Same `instr_shape` API break (Triton de2ba3946b). The kernel uses `mfma_scaled` for FP8 and plain `mfma` for INT8, which target different hardware instructions with different K dimensions: - FP8 `mfma_scale_f32_16x16x128_f8f6f4` (K=128, K_WIDTH=32) - INT8 `mfma_i32_16x16x64_i8` (K=64, K_WIDTH=16) `SwizzledSharedLayout.vec` is updated to match K_WIDTH per data type specialisation. * Fix `_gemm_a8w8_preshuffled_kernel`: The `linear_nk` layout and its `reshape - permute - reshape - trans` unshuffle sequence were designed for K=32 / K_WIDTH=16, so applying K=128 breaks the layout conversion. Since `mfma_scaled` was already invoked with `a_scale=None` and `b_scale=None` (per-tensor scale applied to the accumulator separately), replace it with plain `mfma`, targeting the unscaled `mfma_f32_16x16x32_fp8_fp8` (K=32) that the preshuffled layout was built for. * Fix `test_gemm_a8w8.py`: Relax absolute tolerance from 0.02 to 0.03 to accommodate the preshuffled FP8 path (unscaled dot + software accumulator scale). * Refactor Triton version detection logic out of `pa_decode_gluon.py` This aspect should be also used by other Gluon kernels, namely `gemm_afp4wfp4.py` and `gemm_a8w8.py`. * Fix `test_gemm_afp4wfp4.py` * Restrict AFP4/WFP4 AOT tests to Triton 3.5. Avoid using prebuilt AOT kernels on newer Triton versions where the metadata format is incompatible. * Implement compatibility for old Gluon API * Support Gluon API for Triton compiler older than 3.6. * Conditionally skip some cases of `test_gemm_a8w8.py::test_gemm_splitk` on Triton 3.5. Ragged FP8 split-K lowering fails in Triton 3.5. * Fix `ff_a16w16_fused_ungated.py` The k-loop staggers each N-block's start position by `k_cyclic_offset = pid_n % cdiv(K, BLOCK_SIZE_K)` to reduce`tl.atomic_add` contention on `y_ptrs`. The `y_mask` K-boundary check incorrectly used the raw loop counter `k` (always starting at 0) instead of `k_cyclic_offset` (the actual K position). When the cyclic offset is non-zero, `k` understates the real offset, producing a wrong mask and corrupting partial sums near the K boundary. Replace `k` with `k_cyclic_offset`, consistent with the analogous bound already used in the `w2` load mask. * Set RNG seed in `test_pa_decode.py`

brunomazzottiamd self-assigned this Mar 26, 2026

brunomazzottiamd requested a review from a team March 26, 2026 20:09

brunomazzottiamd added bug Something isn't working triton ci:all labels Mar 26, 2026

brunomazzottiamd requested review from azaidy and gyohuangxin March 26, 2026 20:09

This comment was marked as spam.

Sign in to view

brunomazzottiamd mentioned this pull request Mar 26, 2026

CI: Run Triton tests on MI355 #2312

Open

11 tasks

brunomazzottiamd requested a review from nidal567 March 26, 2026 20:35

azaidy requested a review from ahmed-bsod March 27, 2026 01:03

This comment was marked as resolved.

Sign in to view

gyohuangxin added the ci:triton-355 label Mar 27, 2026

This comment was marked as resolved.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from 1fa2e5a to 3564334 Compare March 27, 2026 18:06

This comment was marked as outdated.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch 4 times, most recently from e4fae56 to 67bde7c Compare March 30, 2026 19:22

This comment was marked as resolved.

Sign in to view

brunomazzottiamd requested a review from lucas-santos-amd March 30, 2026 19:37

This comment was marked as resolved.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from 32f1746 to d2035ae Compare March 31, 2026 16:44

This comment was marked as outdated.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from 7fec184 to 9a16391 Compare April 1, 2026 13:41

This comment was marked as outdated.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from 9a16391 to 6cd4cc0 Compare April 1, 2026 18:36

This comment was marked as outdated.

Sign in to view

nidal567 mentioned this pull request Apr 1, 2026

Fix nondeterministic RNG in test_fused_mxfp4_quant #2562

Merged

brunomazzottiamd mentioned this pull request Apr 1, 2026

[TRITON] Reduce GEMM unit tests #2584

Merged

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from 6cd4cc0 to 4bd8467 Compare April 1, 2026 20:41

brunomazzottiamd requested a review from azaidy April 1, 2026 20:41

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from df7ac2c to a5c73f9 Compare April 2, 2026 12:57

This comment was marked as resolved.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch from a5c73f9 to 067cc79 Compare April 6, 2026 15:50

brunomazzottiamd mentioned this pull request Apr 6, 2026

Fix GEMM test failures and retune with latest triton #2434

Merged

This comment was marked as resolved.

Sign in to view

brunomazzottiamd force-pushed the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch 3 times, most recently from 8986535 to dd453ba Compare April 7, 2026 14:39

brunomazzottiamd added 7 commits April 7, 2026 11:40

Refactor Triton version detection logic out of pa_decode_gluon.py

fa7a071

This aspect should be also used by other Gluon kernels, namely `gemm_afp4wfp4.py` and `gemm_a8w8.py`.

Fix test_gemm_afp4wfp4.py

3143ec2

* Restrict AFP4/WFP4 AOT tests to Triton 3.5. Avoid using prebuilt AOT kernels on newer Triton versions where the metadata format is incompatible.

Implement compatibility for old Gluon API

ad169f0

* Support Gluon API for Triton compiler older than 3.6. * Conditionally skip some cases of `test_gemm_a8w8.py::test_gemm_splitk` on Triton 3.5. Ragged FP8 split-K lowering fails in Triton 3.5.

Set RNG seed in test_pa_decode.py

65bf414

azaidy approved these changes Apr 7, 2026

View reviewed changes

brunomazzottiamd merged commit 957c1aa into main Apr 7, 2026
65 of 69 checks passed

brunomazzottiamd deleted the bmazzott/fix-gluon-mfma-layout-on-gfx950 branch April 7, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRITON] Fix unit tests on `gfx950` - part 2#2491

[TRITON] Fix unit tests on `gfx950` - part 2#2491
brunomazzottiamd merged 7 commits intomainfrom
bmazzott/fix-gluon-mfma-layout-on-gfx950

brunomazzottiamd commented Mar 26, 2026 •

edited

Loading

Uh oh!

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

azaidy left a comment

Uh oh!

brunomazzottiamd commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brunomazzottiamd commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Fix test_gemm_afp4wfp4.py

Fix test_gemm_a8w8.py

Fix _gemm_a8w8_kernel

Fix _gemm_a8w8_preshuffled_kernel

Fix test_gemm_a8w8.py

Fix ff_a16w16_fused_ungated.py

Compatibility fixes for older Gluon API (Triton < 3.6.0)

Test Plan

Test Result

Submission Checklist

Uh oh!

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

azaidy left a comment

Choose a reason for hiding this comment

Uh oh!

brunomazzottiamd commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brunomazzottiamd commented Mar 26, 2026 •

edited

Loading

Fix `test_gemm_afp4wfp4.py`

Fix `test_gemm_a8w8.py`

Fix `_gemm_a8w8_kernel`

Fix `_gemm_a8w8_preshuffled_kernel`

Fix `test_gemm_a8w8.py`

Fix `ff_a16w16_fused_ungated.py`