[OPUS] Add gfx950 smem transpose load#2480
Merged
carlushuang merged 9 commits intomainfrom Apr 2, 2026
Merged
Conversation
Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds gfx950-specific shared-memory transpose load support to opus::smem, exposing new tr_load / tr_load_if APIs that map to Clang ds_read_tr* builtins on device, while providing host fallback and clear non-gfx950 device diagnostics.
Changes:
- Implement
smem::_tr_loadusing__builtin_amdgcn_ds_read_tr*intrinsics for supported scalar/vector combinations on__gfx950__. - Add
smem::tr_load(...)overloads for scalar offsets and forLayout, plussmem::tr_load_if(...)for predicated bulk loads. - Add free-function wrappers
opus::tr_load/opus::tr_load_ifforis_smem_v.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yzhou103
pushed a commit
that referenced
this pull request
Apr 2, 2026
* OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test
yzhou103
added a commit
that referenced
this pull request
Apr 8, 2026
#2498) * Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary. New header aiter_ctypes_error.h provides: - AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe - AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper with automatic try/catch bridging (developer writes normal function body) - aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1 Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect the new int-returning convention and raises RuntimeError on failure. asm_moe_2stage.cu is the first kernel converted as a reference implementation. * update gemm * add _VOID marco to define function without return value * [OPUS] Add gfx950 smem transpose load (#2480) * OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test * Fix error checking in aiter_hip_common.h (#2225) * replace ck_tile api with opus api in some hip kernels (#2533) * replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample) * update * rm ck_tile in topk_softmax_kernels_group.cu --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Fix some benchmark scripts so that they generate the output CSVs (#2555) * Fix some benchmark scripts so that they generate the output CSVs Affects the following Triton-based benchmarks: * bench_moe_gemm_a4w4.py * bench_moe_gemm_a8w4.py * bench_moe_gemm_a8w8.py * bench_moe_gemm_a8w8_blockscale.py * bench_moe_gemm_int8_smoothquant.py * Reformat some MoE GEMM benchmarks with Black * Change comments to proper type annotations * fix conflict * keep abort behavior if not wrap with aiter_safe_call * abort when hip_call error * fix format * rm changes not related --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com> Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>
yzhou103
pushed a commit
that referenced
this pull request
Apr 8, 2026
* OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test
yzhou103
added a commit
that referenced
this pull request
Apr 8, 2026
#2498) * Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary. New header aiter_ctypes_error.h provides: - AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe - AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper with automatic try/catch bridging (developer writes normal function body) - aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1 Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect the new int-returning convention and raises RuntimeError on failure. asm_moe_2stage.cu is the first kernel converted as a reference implementation. * update gemm * add _VOID marco to define function without return value * [OPUS] Add gfx950 smem transpose load (#2480) * OPUS: add gfx950 smem transpose load path Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics. * tr_load example layout and unit test * Fix error checking in aiter_hip_common.h (#2225) * replace ck_tile api with opus api in some hip kernels (#2533) * replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample) * update * rm ck_tile in topk_softmax_kernels_group.cu --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Fix some benchmark scripts so that they generate the output CSVs (#2555) * Fix some benchmark scripts so that they generate the output CSVs Affects the following Triton-based benchmarks: * bench_moe_gemm_a4w4.py * bench_moe_gemm_a8w4.py * bench_moe_gemm_a8w8.py * bench_moe_gemm_a8w8_blockscale.py * bench_moe_gemm_int8_smoothquant.py * Reformat some MoE GEMM benchmarks with Black * Change comments to proper type annotations * fix conflict * keep abort behavior if not wrap with aiter_safe_call * abort when hip_call error * fix format * rm changes not related --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com> Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This change adds shared memory transpose load support in
opus.hppfor AMDGPU gfx950, aligned withds_read_tr*builtins from Clang.Changes
smem::_tr_loaddispatch byscalar_typeand templatevec(vec*vector_size):tr6/tr4vstr8for 2×i32,tr16for i16/u16 (u16 when__clang_major__ >= 20), fp16, bf16.tr_load,tr_load(const Layout&),tr_load_if, plus free-functiontr_load/tr_load_ifforis_smem_v.static_assertwhentr_loadis instantiated; host falls back to ordinary load.Testing
test_tr_load_f16.cuto exercise and validate thetr_loadfunctionality for 16x32 fp16 tiles, storing the result in MFMA B layout (32x16).test_opus_device.py), including device-side invocation, architecture gating (only runs on gfx950), and result verification.