[OPUS] Add gfx950 smem transpose load by kaiyang-1 · Pull Request #2480 · ROCm/aiter

kaiyang-1 · 2026-03-26T07:55:25Z

Summary

This change adds shared memory transpose load support in opus.hpp for AMDGPU gfx950, aligned with ds_read_tr* builtins from Clang.

Changes

Add smem::_tr_load dispatch by scalar_type and template vec (vec*vector_size): tr6 / tr4 vs tr8 for 2×i32, tr16 for i16/u16 (u16 when __clang_major__ >= 20), fp16, bf16.
Public API: tr_load, tr_load(const Layout&), tr_load_if, plus free-function tr_load / tr_load_if for is_smem_v.
Non-gfx950 device builds get a clear static_assert when tr_load is instantiated; host falls back to ordinary load.

Testing

Added a new test kernel and host wrapper in test_tr_load_f16.cu to exercise and validate the tr_load functionality for 16x32 fp16 tiles, storing the result in MFMA B layout (32x16).
Registered the new test in the Python test runner (test_opus_device.py), including device-side invocation, architecture gating (only runs on gfx950), and result verification.

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

github-actions · 2026-03-26T07:56:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2480 --add-label <label>

Copilot

Pull request overview

Adds gfx950-specific shared-memory transpose load support to opus::smem, exposing new tr_load / tr_load_if APIs that map to Clang ds_read_tr* builtins on device, while providing host fallback and clear non-gfx950 device diagnostics.

Changes:

Implement smem::_tr_load using __builtin_amdgcn_ds_read_tr* intrinsics for supported scalar/vector combinations on __gfx950__.
Add smem::tr_load(...) overloads for scalar offsets and for Layout, plus smem::tr_load_if(...) for predicated bulk loads.
Add free-function wrappers opus::tr_load / opus::tr_load_if for is_smem_v.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

carlushuang

LGTM

* OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test

#2498) * Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary. New header aiter_ctypes_error.h provides: - AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe - AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper with automatic try/catch bridging (developer writes normal function body) - aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1 Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect the new int-returning convention and raises RuntimeError on failure. asm_moe_2stage.cu is the first kernel converted as a reference implementation. * update gemm * add _VOID marco to define function without return value * [OPUS] Add gfx950 smem transpose load (#2480) * OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test * Fix error checking in aiter_hip_common.h (#2225) * replace ck_tile api with opus api in some hip kernels (#2533) * replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample) * update * rm ck_tile in topk_softmax_kernels_group.cu --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Fix some benchmark scripts so that they generate the output CSVs (#2555) * Fix some benchmark scripts so that they generate the output CSVs Affects the following Triton-based benchmarks: * bench_moe_gemm_a4w4.py * bench_moe_gemm_a8w4.py * bench_moe_gemm_a8w8.py * bench_moe_gemm_a8w8_blockscale.py * bench_moe_gemm_int8_smoothquant.py * Reformat some MoE GEMM benchmarks with Black * Change comments to proper type annotations * fix conflict * keep abort behavior if not wrap with aiter_safe_call * abort when hip_call error * fix format * rm changes not related --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com> Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>

* OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test

#2498) * Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary. New header aiter_ctypes_error.h provides: - AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe - AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper with automatic try/catch bridging (developer writes normal function body) - aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1 Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect the new int-returning convention and raises RuntimeError on failure. asm_moe_2stage.cu is the first kernel converted as a reference implementation. * update gemm * add _VOID marco to define function without return value * [OPUS] Add gfx950 smem transpose load (#2480) * OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test * Fix error checking in aiter_hip_common.h (#2225) * replace ck_tile api with opus api in some hip kernels (#2533) * replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample) * update * rm ck_tile in topk_softmax_kernels_group.cu --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Fix some benchmark scripts so that they generate the output CSVs (#2555) * Fix some benchmark scripts so that they generate the output CSVs Affects the following Triton-based benchmarks: * bench_moe_gemm_a4w4.py * bench_moe_gemm_a8w4.py * bench_moe_gemm_a8w8.py * bench_moe_gemm_a8w8_blockscale.py * bench_moe_gemm_int8_smoothquant.py * Reformat some MoE GEMM benchmarks with Black * Change comments to proper type annotations * fix conflict * keep abort behavior if not wrap with aiter_safe_call * abort when hip_call error * fix format * rm changes not related --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com> Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>

OPUS: add gfx950 smem transpose load path

2d4d627

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

kaiyang-1 requested review from a team and Copilot March 26, 2026 07:55

Copilot started reviewing on behalf of kaiyang-1 March 26, 2026 07:57 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

kaiyang-1 and others added 3 commits March 30, 2026 20:07

tr_load example layout and unit test

0a86605

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

f800d28

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

ca4ddbd

kaiyang-1 requested review from carlushuang and valarLip March 31, 2026 06:08

kaiyang-1 added 5 commits March 31, 2026 15:41

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

3f12b95

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

faaa630

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

81a0f10

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

854ab24

Merge branch 'main' into kaiyang-1/opus_smem_tr_load

e6802e3

carlushuang approved these changes Apr 2, 2026

View reviewed changes

carlushuang merged commit b879f21 into main Apr 2, 2026
24 checks passed

carlushuang deleted the kaiyang-1/opus_smem_tr_load branch April 2, 2026 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPUS] Add gfx950 smem transpose load#2480

[OPUS] Add gfx950 smem transpose load#2480
carlushuang merged 9 commits intomainfrom
kaiyang-1/opus_smem_tr_load

kaiyang-1 commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

carlushuang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaiyang-1 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

github-actions bot commented Mar 26, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

carlushuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaiyang-1 commented Mar 26, 2026 •

edited

Loading