add limit parameter to silu_and_mul for input clamping#3104
Merged
Conversation
Add compile-time `if constexpr (HAS_LIMIT)` specialization to avoid runtime branch overhead. Use `v_med3_f32` intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the silu_and_mul activation/gating op with an optional limit parameter to clamp inputs, using a compile-time if constexpr (HAS_LIMIT) specialization in the GPU kernel to avoid per-element runtime branching.
Changes:
- Add
limitparameter tosilu_and_mulacross the C++ kernel, C++ header, pybind interface, and Python stub. - Implement a specialized kernel path (
HAS_LIMIT=true) that clampsx(max) andy(to[-limit, limit]) using AMDGCN intrinsics. - Extend
op_tests/test_activation.pyto run/record an additional benchmark case withlimit > 0.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| op_tests/test_activation.py | Updates reference implementation and test harness to pass/record the new limit parameter and adds a limited benchmark run. |
| csrc/kernels/activation_kernels.cu | Adds HAS_LIMIT specialization and host-side dispatch for limited vs non-limited silu_and_mul. |
| csrc/include/rocm_ops.hpp | Exposes limit to Python via pybind with a default value. |
| csrc/include/activation.h | Updates C++ API signature for silu_and_mul with a defaulted limit. |
| aiter/ops/activation.py | Updates the compiled-op Python stub signature to include limit. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ] | ||
| ] | ||
| df_md = df.to_markdown(index=False) | ||
| aiter.logger.info("silu_and_mul with limit=30.0 summary (markdown):\n%s", df_md) |
Comment on lines
68
to
+73
| m.def("silu_and_mul", \ | ||
| &aiter::silu_and_mul, \ | ||
| "Activation function used in SwiGLU.", \ | ||
| py::arg("out"), \ | ||
| py::arg("input")); \ | ||
| py::arg("input"), \ | ||
| py::arg("limit") = 0.0f); \ |
| const aiter_tensor_t& input) // [..., 2 * d] | ||
| const aiter_tensor_t& input, // [..., 2 * d] | ||
| float limit) | ||
| { |
- Add AITER_CHECK for negative limit values - Document limit parameter behavior in pybind11 docstring - Fix log message typo (limit=30.0 -> limit=10.0)
valarLip
approved these changes
May 9, 2026
3 tasks
valarLip
added a commit
to ROCm/ATOM
that referenced
this pull request
May 9, 2026
Replace the chunk + double torch.clamp + F.silu * up sequence in Expert.forward with a single aiter.silu_and_mul(out, combined, limit) call. The new limit parameter folds the swiglu_limit clamp (gate <= limit, up in [-limit, limit]) into the kernel via the v_med3_f32 intrinsic, removing several launch-bound ops on the per-token critical path. Requires aiter PR ROCm/aiter#3104 (already merged), which adds the limit parameter and HAS_LIMIT compile-time specialization to silu_and_mul. Verified on DeepSeek-V4-Pro tp=8 --level 0: GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1): run 1: 0.9515 / 0.9522 (flexible / strict) run 2: 0.9522 / 0.9530 Matches V4-Pro baseline (0.9522 / 0.9530), within 1 sigma stderr.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add compile-time
if constexpr (HAS_LIMIT)specialization to avoid runtime branch overhead. Usev_med3_f32intrinsic for efficient y-clamping. Tested on MI300X with all paths passing.Motivation
Technical Details
Test Plan
Test Result
Submission Checklist