Ensure GemmTuner doesn't generate invalid input combinations for SplitK kernels by apicciau · Pull Request #2721 · ROCm/aiter

apicciau · 2026-04-13T13:28:06Z

Motivation

ASM SplitK kernels allocate a semaphore array of size gdx * gdy (grid X × grid Y). When the grid exceeds 1024 entries, the semaphore write goes out-of-bounds, causing silent corruption or crashes during tuning.

Large matrix shapes (e.g. M=4096, N=4096 with a 64×64 tile) produce gdx * gdy = 4096, well above the limit.

Technical Details

Adds a guard in Gemm.asm_gemm_all_solutions() (gradlib/gradlib/GemmTuner.py) that skips any SplitK candidate where gdx * gdy > 1024 before appending to the task list.

The check only applies when splitK > 1 (clean kernels are unaffected).

Test Plan

Added offline unit tests in gradlib/test_gemm_tuner_splitk.py covering three cases:

Large grid (gdx*gdy > 1024): all generated tasks must satisfy the constraint
Small grid (gdx*gdy = 1): SplitK tasks must still be generated (no false filtering)
Boundary (gdx*gdy = 1024): tasks at the exact limit must be kept

Tests stub the full aiter/ROCm stack and run without a GPU:

cd gradlib && python -m pytest test_gemm_tuner_splitk.py -v

Test Result

3 passed in 0.93s

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-04-13T13:28:54Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2721 --add-label <label>

…tK kernels

Tests verify that asm_gemm_all_solutions correctly filters candidates where gdx*gdy exceeds the 1024-entry semaphore array limit, while preserving valid small-grid and boundary-grid combinations. No GPU required; aiter stack is stubbed for offline execution.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

Prevents GemmTuner.asm_gemm_all_solutions() from generating SplitK ASM tuning tasks whose grid size would exceed the semaphore workspace capacity, avoiding potential out-of-bounds writes during tuning for large GEMM shapes.

Changes:

Add a SplitK-only guard in asm_gemm_all_solutions() to skip candidates where gdx * gdy > 1024.
Add offline unit tests validating filtering behavior for large, small, and boundary grid sizes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`gradlib/gradlib/GemmTuner.py`	Adds SplitK task filtering based on computed grid size to avoid semaphore overflow scenarios.
`gradlib/test_gemm_tuner_splitk.py`	Adds offline tests (with stubs) to validate the new SplitK grid-size guard behavior.

Comments suppressed due to low confidence (1)

gradlib/gradlib/GemmTuner.py:405

The 1024 limit is a hard-coded magic number here. Since the semaphore workspace size appears to be defined elsewhere (e.g. aiter/ops/gemm_op_a16w16.py allocates a fixed (16, 64) semaphore tensor = 1024 entries), consider centralizing this as a named constant (or deriving it from the allocator) to prevent future drift and to document why 1024 is the correct bound.

        ) and get_gfx() == "gfx950":
            logger.warning(
                f"ASM gemm only supports indtype=bf16 and outdtype=bf16 and k%256==0 and not scaleAB is supported in {get_gfx()}, but actual indtype is {self.indtype}, outdtype is {self.outdtype}, k is  {self.k}, scaleAB is {self.scaleAB}"
            )

            self.asm_gtimedf = pd.DataFrame(columns=["gtimems", "libtype"])
            return []

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T15:13:35Z

+    }
+    for name, mod in stubs.items():
+        sys.modules.setdefault(name, mod)
+
+
+_install_stubs()


Installing stub modules into sys.modules at import time can leak into the rest of the test process (e.g., later tests may unexpectedly import the stubs instead of the real aiter package). Consider scoping this to the test module via unittest.mock.patch.dict(sys.modules, ...) (and/or only stubbing when aiter cannot be imported) so other tests aren’t impacted.

@copilot apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-04-17T15:27:59Z

+
+class TestSplitKSemaphoreGuard(unittest.TestCase):
+
+    @patch("gradlib.GemmTuner.get_gfx", return_value="gfx942")