Skip to content

add limit parameter to silu_and_mul for input clamping#3104

Merged
valarLip merged 3 commits into
mainfrom
feature/silu_and_mul_limit
May 9, 2026
Merged

add limit parameter to silu_and_mul for input clamping#3104
valarLip merged 3 commits into
mainfrom
feature/silu_and_mul_limit

Conversation

@yzhou103
Copy link
Copy Markdown
Contributor

@yzhou103 yzhou103 commented May 9, 2026

Add compile-time if constexpr (HAS_LIMIT) specialization to avoid runtime branch overhead. Use v_med3_f32 intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Add compile-time `if constexpr (HAS_LIMIT)` specialization to
avoid runtime branch overhead. Use `v_med3_f32` intrinsic for
efficient y-clamping. Tested on MI300X with all paths passing.
@yzhou103 yzhou103 requested review from a team and Copilot May 9, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All of the above

Add labels via the sidebar or gh pr edit 3104 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the silu_and_mul activation/gating op with an optional limit parameter to clamp inputs, using a compile-time if constexpr (HAS_LIMIT) specialization in the GPU kernel to avoid per-element runtime branching.

Changes:

  • Add limit parameter to silu_and_mul across the C++ kernel, C++ header, pybind interface, and Python stub.
  • Implement a specialized kernel path (HAS_LIMIT=true) that clamps x (max) and y (to [-limit, limit]) using AMDGCN intrinsics.
  • Extend op_tests/test_activation.py to run/record an additional benchmark case with limit > 0.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
op_tests/test_activation.py Updates reference implementation and test harness to pass/record the new limit parameter and adds a limited benchmark run.
csrc/kernels/activation_kernels.cu Adds HAS_LIMIT specialization and host-side dispatch for limited vs non-limited silu_and_mul.
csrc/include/rocm_ops.hpp Exposes limit to Python via pybind with a default value.
csrc/include/activation.h Updates C++ API signature for silu_and_mul with a defaulted limit.
aiter/ops/activation.py Updates the compiled-op Python stub signature to include limit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread op_tests/test_activation.py Outdated
]
]
df_md = df.to_markdown(index=False)
aiter.logger.info("silu_and_mul with limit=30.0 summary (markdown):\n%s", df_md)
Comment thread csrc/include/rocm_ops.hpp
Comment on lines 68 to +73
m.def("silu_and_mul", \
&aiter::silu_and_mul, \
"Activation function used in SwiGLU.", \
py::arg("out"), \
py::arg("input")); \
py::arg("input"), \
py::arg("limit") = 0.0f); \
const aiter_tensor_t& input) // [..., 2 * d]
const aiter_tensor_t& input, // [..., 2 * d]
float limit)
{
yzhou103 added 2 commits May 9, 2026 17:08
- Add AITER_CHECK for negative limit values
- Document limit parameter behavior in pybind11 docstring
- Fix log message typo (limit=30.0 -> limit=10.0)
@valarLip valarLip merged commit b9a477c into main May 9, 2026
29 checks passed
@valarLip valarLip deleted the feature/silu_and_mul_limit branch May 9, 2026 14:32
valarLip added a commit to ROCm/ATOM that referenced this pull request May 9, 2026
Replace the chunk + double torch.clamp + F.silu * up sequence in
Expert.forward with a single aiter.silu_and_mul(out, combined, limit)
call. The new limit parameter folds the swiglu_limit clamp (gate <=
limit, up in [-limit, limit]) into the kernel via the v_med3_f32
intrinsic, removing several launch-bound ops on the per-token critical
path.

Requires aiter PR ROCm/aiter#3104 (already merged), which adds the
limit parameter and HAS_LIMIT compile-time specialization to
silu_and_mul.

Verified on DeepSeek-V4-Pro tp=8 --level 0:
  GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1):
    run 1: 0.9515 / 0.9522 (flexible / strict)
    run 2: 0.9522 / 0.9530
  Matches V4-Pro baseline (0.9522 / 0.9530), within 1 sigma stderr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants