[#11094][feat] AutoDeploy transform to fuse silu+mul#12497
[#11094][feat] AutoDeploy transform to fuse silu+mul#12497MrGeva merged 5 commits intoNVIDIA:mainfrom
Conversation
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughAdds a new Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration System
participant Model as Model/Export
participant Optimizer as InferenceOptimizer
participant Transform as FuseSiluMul Transform
participant Graph as FX Graph Module
participant CustomOp as Custom silu_and_mul Op
Config->>Optimizer: Load fuse_silu_mul config (enabled)
Model->>Optimizer: Export model to FX graph
Optimizer->>Transform: Apply FuseSiluMul transform
Transform->>Graph: Scan for mul nodes
Graph-->>Transform: Identify mul(silu(narrow), narrow) pattern
Transform->>CustomOp: Replace with silu_and_mul custom op
CustomOp-->>Graph: Fused operation inserted
Transform->>Graph: eliminate_dead_code() & recompile()
Graph-->>Optimizer: Optimized FX graph returned
Optimizer-->>Model: Model ready for inference
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.py (1)
120-140: Prefix unused variable with underscore.The
half_sizereturned from_try_fuse_mulis unpacked but not used.♻️ Proposed fix
- fused_parent, half_size = result + fused_parent, _half_size = result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.py` around lines 120 - 140, The unpacked variable `half_size` from _try_fuse_mul is unused; change the unpacking in the loop where result is assigned (currently "fused_parent, half_size = result") to use a prefixed underscore (e.g., "fused_parent, _half_size = result" or simply "fused_parent, _ = result") so the linter-no-unused-variable issue is resolved while keeping the intent of fuse_silu_mul's logic in the function that calls _try_fuse_mul.tensorrt_llm/_torch/auto_deploy/custom_ops/linear/silu_mul.py (1)
45-50: Consider using unpacking for the output shape construction.The static analysis tool suggests a minor improvement for readability.
♻️ Optional refactor
`@silu_and_mul.register_fake` def _(x: torch.Tensor) -> torch.Tensor: """Fake implementation for tracing.""" half_size = x.shape[-1] // 2 - output_shape = list(x.shape[:-1]) + [half_size] + output_shape = [*x.shape[:-1], half_size] return x.new_empty(output_shape, dtype=x.dtype)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/linear/silu_mul.py` around lines 45 - 50, The fake implementation registered with silu_and_mul (the function defined as "_") builds output_shape by concatenating list(x.shape[:-1]) and [half_size]; change this to use Python unpacking for clarity and brevity (e.g., construct output_shape with [*x.shape[:-1], half_size] or tuple form) while keeping half_size = x.shape[-1] // 2 and returning x.new_empty(output_shape, dtype=x.dtype).tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py (1)
128-144: Unused variable and potential missing assertion.The variable
xfrom_export_modelis unpacked but never used intest_fuse_silu_mul_disabled. Either prefix with underscore or remove. Also, consider adding a correctness check similar to other tests for completeness.♻️ Proposed fix
`@pytest.mark.skipif`(not torch.cuda.is_available(), reason="requires CUDA") def test_fuse_silu_mul_disabled(): """Test that fusion is skipped when disabled.""" model = TestModel() - gm, x = _export_model(model) + gm, _ = _export_model(model) transforms = { "fuse_gemms_mixed_children": {"stage": "post_load_fusion", "enabled": True}, "fuse_silu_mul": {"stage": "post_load_fusion", "enabled": False}, } io = InferenceOptimizer(transforms) gm = io.run(gm) # silu_and_mul should NOT be present when disabled assert _count_ops(gm, torch.ops.auto_deploy.silu_and_mul.default) == 0 # Original ops should still be there assert _count_ops(gm, torch.ops.aten.silu.default) >= 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py` around lines 128 - 144, The test unpacks x from _export_model but never uses it; rename x to _x (or remove it) to avoid the unused-variable warning, and add a correctness assertion: before calling InferenceOptimizer.run save original_gm = gm, then after io.run(gm) use the example input (_x) to run both original_gm and gm and assert their outputs are equal (e.g., with torch.allclose); keep the existing op-count assertions using _count_ops and the transform names (fuse_silu_mul, fuse_gemms_mixed_children) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py`:
- Around line 1-9: Add the NVIDIA Apache-2.0 license header to the top of this
test module (test_fuse_silu_mul.py) so it matches project licensing
requirements; insert the standard NVIDIA Apache 2.0 copyright/license block as
the first lines of the file before any imports (the file currently imports
pytest, torch, Dim and modules like
tensorrt_llm._torch.auto_deploy.custom_ops.linear.silu_mul and others), ensuring
the exact header text used elsewhere in the repo is copied verbatim.
- Line 5: Replace the wildcard import from
tensorrt_llm._torch.auto_deploy.custom_ops.linear.silu_mul with an explicit
module import to preserve namespace and still trigger registration (e.g., import
tensorrt_llm._torch.auto_deploy.custom_ops.linear.silu_mul as silu_mul); update
the test to reference the module if needed and remove the wildcard import to
satisfy linting and the coding guideline that modules—not wildcard names—should
be imported.
---
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/linear/silu_mul.py`:
- Around line 45-50: The fake implementation registered with silu_and_mul (the
function defined as "_") builds output_shape by concatenating list(x.shape[:-1])
and [half_size]; change this to use Python unpacking for clarity and brevity
(e.g., construct output_shape with [*x.shape[:-1], half_size] or tuple form)
while keeping half_size = x.shape[-1] // 2 and returning
x.new_empty(output_shape, dtype=x.dtype).
In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.py`:
- Around line 120-140: The unpacked variable `half_size` from _try_fuse_mul is
unused; change the unpacking in the loop where result is assigned (currently
"fused_parent, half_size = result") to use a prefixed underscore (e.g.,
"fused_parent, _half_size = result" or simply "fused_parent, _ = result") so the
linter-no-unused-variable issue is resolved while keeping the intent of
fuse_silu_mul's logic in the function that calls _try_fuse_mul.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py`:
- Around line 128-144: The test unpacks x from _export_model but never uses it;
rename x to _x (or remove it) to avoid the unused-variable warning, and add a
correctness assertion: before calling InferenceOptimizer.run save original_gm =
gm, then after io.run(gm) use the example input (_x) to run both original_gm and
gm and assert their outputs are equal (e.g., with torch.allclose); keep the
existing op-count assertions using _count_ops and the transform names
(fuse_silu_mul, fuse_gemms_mixed_children) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9c74f8e1-89a1-412c-8620-420fade508c6
📒 Files selected for processing (6)
examples/auto_deploy/model_registry/configs/llama3_1_8b.yamltensorrt_llm/_torch/auto_deploy/config/default.yamltensorrt_llm/_torch/auto_deploy/custom_ops/linear/__init__.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear/silu_mul.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_silu_mul.pytests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py
tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py
Show resolved
Hide resolved
tests/unittest/auto_deploy/singlegpu/transformations/library/test_fuse_silu_mul.py
Outdated
Show resolved
Hide resolved
|
PR_Github #40107 [ run ] triggered by Bot. Commit: |
|
PR_Github #40107 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #40333 [ run ] triggered by Bot. Commit: |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #40349 [ run ] triggered by Bot. Commit: |
|
PR_Github #40333 [ run ] completed with state |
|
PR_Github #40349 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #40705 [ run ] triggered by Bot. Commit: |
|
PR_Github #40705 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #40766 [ run ] triggered by Bot. Commit: |
|
PR_Github #40766 [ run ] completed with state
|
|
PR_Github #40949 [ run ] triggered by Bot. Commit: |
|
PR_Github #40940 [ run ] completed with state |
|
PR_Github #40949 [ run ] completed with state
|
7e9299b to
f16676f
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #41877 [ run ] triggered by Bot. Commit: |
|
PR_Github #41877 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #41905 [ run ] triggered by Bot. Commit: |
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
InferenceOptimizer API changed to require factory and config args. Update tests to apply transforms directly via TransformRegistry, and replace wildcard import with explicit module import. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
…8 support - Custom op: only register when FlashInfer is available (no fallback) - Transform: use ADPatternMatcherPass for narrow+silu+mul variant, direct graph walk for quantized getitem+silu+mul variant - Move fuse_silu_mul before fuse_fp8_linear in default.yaml and enable by default; remove redundant override from llama3_1_8b config - Test: skip when FlashInfer unavailable, use TransformRegistry directly Benchmarked on Llama-3.1-8B-Instruct-FP8 (1k ISL, 2k OSL, 64 reqs): +2.7% token throughput (10669 vs 10383 tok/s) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
- Add test for Variant 2 (getitem+silu+mul from contiguous-split GEMM fusion) - Skip run_shape_prop when no torch.narrow ops exist (avoids ~1s overhead on FP8-only configs where only Variant 2 matches) - Add even-size assertion in register_fake for silu_and_mul custom op - Add NVIDIA copyright header to test file Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #41918 [ run ] triggered by Bot. Commit: |
|
PR_Github #41918 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #41939 [ run ] triggered by Bot. Commit: |
|
PR_Github #41939 [ run ] completed with state |
The fusion replaces separate narrow → silu +
narrow → mul ops with a single silu_and_mul kernel (using
FlashInfer's fused implementation when available).
llama 8B fp8 1k/2k/64 H100 CW perf:
+3.3% output throughput (7,626 → 7,880 tok/s) and -3.2%
latency (8,702 → 8,420 ms avg) with the SiLU+Mul fusion enabled
across all 32 Llama layers.
Summary by CodeRabbit
New Features
Chores
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.