Skip to content

[OPUS] Add gfx950 smem transpose load#2480

Merged
carlushuang merged 9 commits intomainfrom
kaiyang-1/opus_smem_tr_load
Apr 2, 2026
Merged

[OPUS] Add gfx950 smem transpose load#2480
carlushuang merged 9 commits intomainfrom
kaiyang-1/opus_smem_tr_load

Conversation

@kaiyang-1
Copy link
Copy Markdown
Contributor

@kaiyang-1 kaiyang-1 commented Mar 26, 2026

Summary

This change adds shared memory transpose load support in opus.hpp for AMDGPU gfx950, aligned with ds_read_tr* builtins from Clang.

Changes

  • Add smem::_tr_load dispatch by scalar_type and template vec (vec*vector_size): tr6 / tr4 vs tr8 for 2×i32, tr16 for i16/u16 (u16 when __clang_major__ >= 20), fp16, bf16.
  • Public API: tr_load, tr_load(const Layout&), tr_load_if, plus free-function tr_load / tr_load_if for is_smem_v.
  • Non-gfx950 device builds get a clear static_assert when tr_load is instantiated; host falls back to ordinary load.

Testing

  • Added a new test kernel and host wrapper in test_tr_load_f16.cu to exercise and validate the tr_load functionality for 16x32 fp16 tiles, storing the result in MFMA B layout (32x16).
  • Registered the new test in the Python test runner (test_opus_device.py), including device-side invocation, architecture gating (only runs on gfx950), and result verification.

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.
@kaiyang-1 kaiyang-1 requested review from a team and Copilot March 26, 2026 07:55
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2480 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds gfx950-specific shared-memory transpose load support to opus::smem, exposing new tr_load / tr_load_if APIs that map to Clang ds_read_tr* builtins on device, while providing host fallback and clear non-gfx950 device diagnostics.

Changes:

  • Implement smem::_tr_load using __builtin_amdgcn_ds_read_tr* intrinsics for supported scalar/vector combinations on __gfx950__.
  • Add smem::tr_load(...) overloads for scalar offsets and for Layout, plus smem::tr_load_if(...) for predicated bulk loads.
  • Add free-function wrappers opus::tr_load / opus::tr_load_if for is_smem_v.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@carlushuang carlushuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carlushuang carlushuang merged commit b879f21 into main Apr 2, 2026
24 checks passed
@carlushuang carlushuang deleted the kaiyang-1/opus_smem_tr_load branch April 2, 2026 05:37
yzhou103 pushed a commit that referenced this pull request Apr 2, 2026
* OPUS: add gfx950 smem transpose load path

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

* tr_load example layout and unit test
yzhou103 added a commit that referenced this pull request Apr 8, 2026
#2498)

* Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning

AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling
std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary.

New header aiter_ctypes_error.h provides:
- AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe
- AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper
  with automatic try/catch bridging (developer writes normal function body)
- aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1

Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect
the new int-returning convention and raises RuntimeError on failure.

asm_moe_2stage.cu is the first kernel converted as a reference implementation.

* update gemm

* add _VOID marco to define function without return value

* [OPUS] Add gfx950 smem transpose load (#2480)

* OPUS: add gfx950 smem transpose load path

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

* tr_load example layout and unit test

* Fix error checking in aiter_hip_common.h (#2225)

* replace ck_tile api with opus api in some hip kernels (#2533)

* replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample)

* update

* rm ck_tile in topk_softmax_kernels_group.cu

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Fix some benchmark scripts so that they generate the output CSVs (#2555)

* Fix some benchmark scripts so that they generate the output CSVs

Affects the following Triton-based benchmarks:
* bench_moe_gemm_a4w4.py
* bench_moe_gemm_a8w4.py
* bench_moe_gemm_a8w8.py
* bench_moe_gemm_a8w8_blockscale.py
* bench_moe_gemm_int8_smoothquant.py

* Reformat some MoE GEMM benchmarks with Black

* Change comments to proper type annotations

* fix conflict

* keep abort behavior if not wrap with aiter_safe_call

* abort when hip_call error

* fix format

* rm changes not related

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com>
Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>
yzhou103 pushed a commit that referenced this pull request Apr 8, 2026
* OPUS: add gfx950 smem transpose load path

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

* tr_load example layout and unit test
yzhou103 added a commit that referenced this pull request Apr 8, 2026
#2498)

* Add ctypes C-ABI error bridging to prevent worker crashes during kernel tuning

AITER_CHECK and HIP_CALL now throw std::runtime_error instead of calling
std::terminate()/exit(0), so exceptions can be caught at the C-ABI boundary.

New header aiter_ctypes_error.h provides:
- AITER_CTYPES_ERROR_DEF: per-TU thread-local error storage + ABI version probe
- AITER_CTYPES_DEFINE_ENTRYPOINT: macro that generates extern "C" int wrapper
  with automatic try/catch bridging (developer writes normal function body)
- aiter_safe_call: template that catches C++ exceptions, stores in TLS, returns -1

Python side (core.py) probes each .so for aiter_ctypes_abi_version to auto-detect
the new int-returning convention and raises RuntimeError on failure.

asm_moe_2stage.cu is the first kernel converted as a reference implementation.

* update gemm

* add _VOID marco to define function without return value

* [OPUS] Add gfx950 smem transpose load (#2480)

* OPUS: add gfx950 smem transpose load path

Add smem tr_load/tr_load_if APIs and wire _tr_load to gfx950 ds_read_tr* builtins with scalar/vec dispatch, including clang>=20 u16 support and simplified diagnostics.

* tr_load example layout and unit test

* Fix error checking in aiter_hip_common.h (#2225)

* replace ck_tile api with opus api in some hip kernels (#2533)

* replace ck_tile api with opus api in some hip kernels(topk_softmax, moe_fused_gate. sample)

* update

* rm ck_tile in topk_softmax_kernels_group.cu

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Fix some benchmark scripts so that they generate the output CSVs (#2555)

* Fix some benchmark scripts so that they generate the output CSVs

Affects the following Triton-based benchmarks:
* bench_moe_gemm_a4w4.py
* bench_moe_gemm_a8w4.py
* bench_moe_gemm_a8w8.py
* bench_moe_gemm_a8w8_blockscale.py
* bench_moe_gemm_int8_smoothquant.py

* Reformat some MoE GEMM benchmarks with Black

* Change comments to proper type annotations

* fix conflict

* keep abort behavior if not wrap with aiter_safe_call

* abort when hip_call error

* fix format

* rm changes not related

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: YANG Kai <106952055+kaiyang-1@users.noreply.github.com>
Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: Andrea Picciau <andrea.picciau@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants