Skip to content

[None][perf] Fuse add + norm + fp8 quant pattern#12674

Merged
amukkara merged 2 commits intoNVIDIA:mainfrom
amukkara:rms-quant-fusion
May 4, 2026
Merged

[None][perf] Fuse add + norm + fp8 quant pattern#12674
amukkara merged 2 commits intoNVIDIA:mainfrom
amukkara:rms-quant-fusion

Conversation

@amukkara
Copy link
Copy Markdown
Collaborator

@amukkara amukkara commented Apr 2, 2026

Summary by CodeRabbit

Release Notes

  • New Features
    • Added quantization-aware fusion optimization for residual add and RMSNorm operations, enabling improved inference performance through fused kernel execution. Supports float16 and bfloat16 precision levels.

Description

Add an additional PatternMatcherPass() in compilation backend to fuse (residual_add, rms_norm, fp8 static quantization) pattern. This pattern is replaced by a fused kernel from FlashInfer.

~8% speedup for Qwen3-4B-FP8 checkpoint

ISL=OSL=1000, concurrency=1, requests=10

 Config req/sec P50 e2e
Before 0.31 3255
After 0.33 3020

ISL=OSL=1000, concurrency=8 requests=80

Config req/sec P50 e2e
Before 2.09 3824
After 2.26 3542

Test Coverage

New unittest in unittest/_torch/compilation/test_add_norm_quant.py
Existing integration tests will cover e2e FP8 accuracy.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@amukkara amukkara changed the title [none][perf] Fuse add + rmsnorm + fp8 quant pattern [none][perf] Fuse add + norm + fp8 quant pattern Apr 2, 2026
@amukkara amukkara changed the title [none][perf] Fuse add + norm + fp8 quant pattern [None][perf] Fuse add + norm + fp8 quant pattern Apr 2, 2026
@amukkara amukkara marked this pull request as ready for review April 2, 2026 20:00
@amukkara amukkara requested review from a team as code owners April 2, 2026 20:00
@amukkara amukkara requested a review from liji-nv April 2, 2026 20:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a fused add+RMSNorm+quantization operation integrated into the TensorRT-LLM compilation pipeline. The changes add a new custom operator, pattern matcher registration, backend integration, utility mappings, and tests to support flashinfer's fused quantized normalization kernel for single-process execution.

Changes

Cohort / File(s) Summary
Dependency Update
requirements.txt
Updated flashinfer-python from version 0.6.6 to 0.6.7.
Custom Operator
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py
Added new custom operator trtllm::flashinfer_fused_add_rmsnorm_quant that wraps flashinfer's fused kernel, mutating out and residual arguments, with a no-op register_fake implementation.
Pattern Matcher
tensorrt_llm/_torch/compilation/patterns/residual_add_norm.py
Introduced register_add_norm_quant() function that registers a pattern to fuse addrmsnormstatic_quantize chains into the new fused operator, including an extra structural check to ensure inplace mutation safety.
Compilation Integration
tensorrt_llm/_torch/compilation/backend.py, tensorrt_llm/_torch/compilation/utils.py
Modified backend's custom pass registration to conditionally insert quantization pattern for single-process cases (world_size <= 1); updated inplace_info() mapping to track the new operator's mutated arguments.
Test Coverage
tests/unittest/_torch/compilation/test_add_norm_quant.py
Added parametrized GPU test verifying the fusion matches patterns and produces correct quantized and residual outputs across float16/bfloat16 and with/without inductor backends.

Sequence Diagram

sequenceDiagram
    participant torch as torch.compile
    participant backend as Backend
    participant matcher as Pattern Matcher
    participant op as Custom Operator
    participant kernel as Flashinfer Kernel

    torch->>backend: get_custom_pass(world_size=1)
    backend->>matcher: register_add_norm_quant()
    matcher->>matcher: Match: add → rmsnorm → quantize
    backend->>matcher: register_add_norm()
    torch->>matcher: Apply pattern matching
    matcher->>op: Detected fusion pattern
    op->>kernel: Dispatch to fused_add_rmsnorm_quant
    kernel-->>op: fp8_out, updated_residual
    op-->>torch: Return fused result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fusing add + norm + fp8 quant pattern for performance optimization.
Description check ✅ Passed PR description includes all major required sections: clear title with ticket/type format, detailed description of changes with performance metrics, test coverage explanation, and completed checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unittest/_torch/compilation/test_add_norm_quant.py (2)

53-53: Assertion message could be more descriptive.

Consider including the actual match count in the assertion message for easier debugging when the test fails.

💡 Suggested improvement
-    assert backend.match_count[0] == 1, "Pattern Matching Failed"
+    assert backend.match_count[0] == 1, f"Pattern Matching Failed: expected 1 match, got {backend.match_count[0]}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/compilation/test_add_norm_quant.py` at line 53, The
assertion at backend.match_count[0] in test_add_norm_quant.py uses a generic
message; update it to include the actual match count and expected value for
easier debugging (e.g., reference backend.match_count[0] and the expected 1 in
the message). Modify the assertion in the test function that contains "assert
backend.match_count[0] == 1" so the failure message interpolates
backend.match_count[0] (and optionally the expected 1) into the string to show
both actual and expected counts.

50-67: Consider tightening tolerances for inter_out comparison.

The inter_out result (the residual after the add operation) should be numerically identical between the fused and unfused paths since both compute x + residual. The current tolerances (rtol=0.05, atol=0.15) are quite loose for a simple element-wise addition.

If the fused kernel produces bit-identical results for the residual update, consider using stricter tolerances (e.g., rtol=1e-5, atol=1e-5 for float16/bfloat16). If there are known numerical differences due to the kernel implementation, documenting this in a comment would be helpful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/compilation/test_add_norm_quant.py` around lines 50 -
67, The inter_out comparison is too loose—tighten the tolerances for
actual_inter vs ref_inter in this test: replace
torch.testing.assert_close(actual_inter, ref_inter, rtol=0.05, atol=0.15) with
much stricter tolerances (e.g., rtol=1e-5, atol=1e-5) for the elementwise add
check, or if dtype-specific differences exist, set dtype-aware tolerances (check
dtype and use 1e-5 for float16/bfloat16, else appropriate tighter values) and
add a brief comment above the assertion explaining why a relaxed tolerance would
be used only for known kernel-induced differences; keep references to
ref_func/func/actual_inter/ref_inter and ensure backend.match_count[0] assertion
remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/_torch/compilation/test_add_norm_quant.py`:
- Line 53: The assertion at backend.match_count[0] in test_add_norm_quant.py
uses a generic message; update it to include the actual match count and expected
value for easier debugging (e.g., reference backend.match_count[0] and the
expected 1 in the message). Modify the assertion in the test function that
contains "assert backend.match_count[0] == 1" so the failure message
interpolates backend.match_count[0] (and optionally the expected 1) into the
string to show both actual and expected counts.
- Around line 50-67: The inter_out comparison is too loose—tighten the
tolerances for actual_inter vs ref_inter in this test: replace
torch.testing.assert_close(actual_inter, ref_inter, rtol=0.05, atol=0.15) with
much stricter tolerances (e.g., rtol=1e-5, atol=1e-5) for the elementwise add
check, or if dtype-specific differences exist, set dtype-aware tolerances (check
dtype and use 1e-5 for float16/bfloat16, else appropriate tighter values) and
add a brief comment above the assertion explaining why a relaxed tolerance would
be used only for known kernel-induced differences; keep references to
ref_func/func/actual_inter/ref_inter and ensure backend.match_count[0] assertion
remains.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cbc06d65-16ae-4ef2-8e9b-1ac4cab2c992

📥 Commits

Reviewing files that changed from the base of the PR and between 4c97a03 and 81e4fbe.

📒 Files selected for processing (6)
  • requirements.txt
  • tensorrt_llm/_torch/compilation/backend.py
  • tensorrt_llm/_torch/compilation/patterns/residual_add_norm.py
  • tensorrt_llm/_torch/compilation/utils.py
  • tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py
  • tests/unittest/_torch/compilation/test_add_norm_quant.py

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented Apr 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41577 [ run ] triggered by Bot. Commit: 37befd4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41577 [ run ] completed with state SUCCESS. Commit: 37befd4
/LLM/main/L0_MergeRequest_PR pipeline #32488 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented Apr 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41717 [ run ] triggered by Bot. Commit: 37befd4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41717 [ run ] completed with state SUCCESS. Commit: 37befd4
/LLM/main/L0_MergeRequest_PR pipeline #32619 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@amukkara amukkara requested a review from a team as a code owner April 3, 2026 20:27
@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented Apr 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41738 [ run ] triggered by Bot. Commit: c3910b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41738 [ run ] completed with state SUCCESS. Commit: c3910b9
/LLM/main/L0_MergeRequest_PR pipeline #32639 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented Apr 4, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41767 [ run ] triggered by Bot. Commit: 4554a23 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41767 [ run ] completed with state SUCCESS. Commit: 4554a23
/LLM/main/L0_MergeRequest_PR pipeline #32664 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented Apr 6, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41959 [ run ] triggered by Bot. Commit: 4554a23 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41959 [ run ] completed with state SUCCESS. Commit: 4554a23
/LLM/main/L0_MergeRequest_PR pipeline #32813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@amukkara amukkara enabled auto-merge (squash) April 30, 2026 17:13
@amukkara
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46430 [ run ] triggered by Bot. Commit: a4194eb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46430 [ run ] completed with state SUCCESS. Commit: a4194eb
/LLM/main/L0_MergeRequest_PR pipeline #36501 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara added 2 commits May 1, 2026 10:13
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
@amukkara amukkara force-pushed the rms-quant-fusion branch from a4194eb to 504b5b9 Compare May 1, 2026 17:23
@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented May 1, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46516 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46516 [ run ] completed with state SUCCESS. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented May 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46528 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46528 [ run ] completed with state FAILURE. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36587 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@amukkara
Copy link
Copy Markdown
Collaborator Author

amukkara commented May 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46669 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46669 [ run ] completed with state SUCCESS. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36709 completed with status: 'SUCCESS'

CI Report

Link to invocation

@amukkara amukkara merged commit 72cd7d8 into NVIDIA:main May 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants