add limit parameter to silu_and_mul for input clamping by yzhou103 · Pull Request #3104 · ROCm/aiter

yzhou103 · 2026-05-09T08:57:44Z

Add compile-time if constexpr (HAS_LIMIT) specialization to avoid runtime branch overhead. Use v_med3_f32 intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Add compile-time `if constexpr (HAS_LIMIT)` specialization to avoid runtime branch overhead. Use `v_med3_f32` intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.

github-actions · 2026-05-09T08:58:04Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3104 --add-label <label>

Copilot

Pull request overview

This PR extends the silu_and_mul activation/gating op with an optional limit parameter to clamp inputs, using a compile-time if constexpr (HAS_LIMIT) specialization in the GPU kernel to avoid per-element runtime branching.

Changes:

Add limit parameter to silu_and_mul across the C++ kernel, C++ header, pybind interface, and Python stub.
Implement a specialized kernel path (HAS_LIMIT=true) that clamps x (max) and y (to [-limit, limit]) using AMDGCN intrinsics.
Extend op_tests/test_activation.py to run/record an additional benchmark case with limit > 0.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
op_tests/test_activation.py	Updates reference implementation and test harness to pass/record the new `limit` parameter and adds a limited benchmark run.
csrc/kernels/activation_kernels.cu	Adds `HAS_LIMIT` specialization and host-side dispatch for limited vs non-limited `silu_and_mul`.
csrc/include/rocm_ops.hpp	Exposes `limit` to Python via pybind with a default value.
csrc/include/activation.h	Updates C++ API signature for `silu_and_mul` with a defaulted `limit`.
aiter/ops/activation.py	Updates the compiled-op Python stub signature to include `limit`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    ]
+]
+df_md = df.to_markdown(index=False)
+aiter.logger.info("silu_and_mul with limit=30.0 summary (markdown):\n%s", df_md)


    m.def("silu_and_mul",                               \
          &aiter::silu_and_mul,                         \
          "Activation function used in SwiGLU.",        \
          py::arg("out"),                               \
-          py::arg("input"));                            \
+          py::arg("input"),                             \
+          py::arg("limit") = 0.0f);                    \


-                  const aiter_tensor_t& input) // [..., 2 * d]
+                  const aiter_tensor_t& input, // [..., 2 * d]
+                  float limit)
 {


- Add AITER_CHECK for negative limit values - Document limit parameter behavior in pybind11 docstring - Fix log message typo (limit=30.0 -> limit=10.0)

Replace the chunk + double torch.clamp + F.silu * up sequence in Expert.forward with a single aiter.silu_and_mul(out, combined, limit) call. The new limit parameter folds the swiglu_limit clamp (gate <= limit, up in [-limit, limit]) into the kernel via the v_med3_f32 intrinsic, removing several launch-bound ops on the per-token critical path. Requires aiter PR ROCm/aiter#3104 (already merged), which adds the limit parameter and HAS_LIMIT compile-time specialization to silu_and_mul. Verified on DeepSeek-V4-Pro tp=8 --level 0: GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1): run 1: 0.9515 / 0.9522 (flexible / strict) run 2: 0.9522 / 0.9530 Matches V4-Pro baseline (0.9522 / 0.9530), within 1 sigma stderr.

add limit parameter to silu_and_mul for input clamping

de14bd6

Add compile-time `if constexpr (HAS_LIMIT)` specialization to avoid runtime branch overhead. Use `v_med3_f32` intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.

yzhou103 requested review from a team and Copilot May 9, 2026 08:57

Copilot started reviewing on behalf of yzhou103 May 9, 2026 08:58 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

yzhou103 added 2 commits May 9, 2026 17:08

Merge branch 'main' into feature/silu_and_mul_limit

74aab8f

address review: add limit validation and docstring

101e9d6

- Add AITER_CHECK for negative limit values - Document limit parameter behavior in pybind11 docstring - Fix log message typo (limit=30.0 -> limit=10.0)

valarLip approved these changes May 9, 2026

View reviewed changes

valarLip merged commit b9a477c into main May 9, 2026
29 checks passed

valarLip deleted the feature/silu_and_mul_limit branch May 9, 2026 14:32

valarLip mentioned this pull request May 9, 2026

perf(deepseek_v4): use aiter silu_and_mul with in-kernel clamp ROCm/ATOM#731

Merged

3 tasks

bingxche mentioned this pull request May 14, 2026

[AITER-Upgrade] PR readiness sgl-project/sglang#21302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add limit parameter to silu_and_mul for input clamping#3104

add limit parameter to silu_and_mul for input clamping#3104
valarLip merged 3 commits into
mainfrom
feature/silu_and_mul_limit

yzhou103 commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzhou103 commented May 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 9, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants