[None][perf] Fuse add + norm + fp8 quant pattern by amukkara · Pull Request #12674 · NVIDIA/TensorRT-LLM

amukkara · 2026-04-02T01:36:46Z

Summary by CodeRabbit

Release Notes

New Features
- Added quantization-aware fusion optimization for residual add and RMSNorm operations, enabling improved inference performance through fused kernel execution. Supports float16 and bfloat16 precision levels.

Description

Add an additional PatternMatcherPass() in compilation backend to fuse (residual_add, rms_norm, fp8 static quantization) pattern. This pattern is replaced by a fused kernel from FlashInfer.

~8% speedup for Qwen3-4B-FP8 checkpoint

ISL=OSL=1000, concurrency=1, requests=10

Config	req/sec	P50 e2e
Before	0.31	3255
After	0.33	3020

ISL=OSL=1000, concurrency=8 requests=80

Config	req/sec	P50 e2e
Before	2.09	3824
After	2.26	3542

Test Coverage

New unittest in unittest/_torch/compilation/test_add_norm_quant.py
Existing integration tests will cover e2e FP8 accuracy.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-02T20:06:45Z

📝 Walkthrough

Walkthrough

This pull request introduces a fused add+RMSNorm+quantization operation integrated into the TensorRT-LLM compilation pipeline. The changes add a new custom operator, pattern matcher registration, backend integration, utility mappings, and tests to support flashinfer's fused quantized normalization kernel for single-process execution.

Changes

Cohort / File(s)	Summary
Dependency Update `requirements.txt`	Updated flashinfer-python from version 0.6.6 to 0.6.7.
Custom Operator `tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py`	Added new custom operator `trtllm::flashinfer_fused_add_rmsnorm_quant` that wraps flashinfer's fused kernel, mutating `out` and `residual` arguments, with a no-op `register_fake` implementation.
Pattern Matcher `tensorrt_llm/_torch/compilation/patterns/residual_add_norm.py`	Introduced `register_add_norm_quant()` function that registers a pattern to fuse `add` → `rmsnorm` → `static_quantize` chains into the new fused operator, including an extra structural check to ensure inplace mutation safety.
Compilation Integration `tensorrt_llm/_torch/compilation/backend.py`, `tensorrt_llm/_torch/compilation/utils.py`	Modified backend's custom pass registration to conditionally insert quantization pattern for single-process cases (`world_size <= 1`); updated `inplace_info()` mapping to track the new operator's mutated arguments.
Test Coverage `tests/unittest/_torch/compilation/test_add_norm_quant.py`	Added parametrized GPU test verifying the fusion matches patterns and produces correct quantized and residual outputs across float16/bfloat16 and with/without inductor backends.

Sequence Diagram

sequenceDiagram
    participant torch as torch.compile
    participant backend as Backend
    participant matcher as Pattern Matcher
    participant op as Custom Operator
    participant kernel as Flashinfer Kernel

    torch->>backend: get_custom_pass(world_size=1)
    backend->>matcher: register_add_norm_quant()
    matcher->>matcher: Match: add → rmsnorm → quantize
    backend->>matcher: register_add_norm()
    torch->>matcher: Apply pattern matching
    matcher->>op: Detected fusion pattern
    op->>kernel: Dispatch to fused_add_rmsnorm_quant
    kernel-->>op: fp8_out, updated_residual
    op-->>torch: Return fused result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fusing add + norm + fp8 quant pattern for performance optimization.
Description check	✅ Passed	PR description includes all major required sections: clear title with ticket/type format, detailed description of changes with performance metrics, test coverage explanation, and completed checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unittest/_torch/compilation/test_add_norm_quant.py (2)
53-53: Assertion message could be more descriptive.

Consider including the actual match count in the assertion message for easier debugging when the test fails.
💡 Suggested improvement
-    assert backend.match_count[0] == 1, "Pattern Matching Failed"
+    assert backend.match_count[0] == 1, f"Pattern Matching Failed: expected 1 match, got {backend.match_count[0]}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/compilation/test_add_norm_quant.py` at line 53, The
assertion at backend.match_count[0] in test_add_norm_quant.py uses a generic
message; update it to include the actual match count and expected value for
easier debugging (e.g., reference backend.match_count[0] and the expected 1 in
the message). Modify the assertion in the test function that contains "assert
backend.match_count[0] == 1" so the failure message interpolates
backend.match_count[0] (and optionally the expected 1) into the string to show
both actual and expected counts.
50-67: Consider tightening tolerances for inter_out comparison.

The inter_out result (the residual after the add operation) should be numerically identical between the fused and unfused paths since both compute x + residual. The current tolerances (rtol=0.05, atol=0.15) are quite loose for a simple element-wise addition.

If the fused kernel produces bit-identical results for the residual update, consider using stricter tolerances (e.g., rtol=1e-5, atol=1e-5 for float16/bfloat16). If there are known numerical differences due to the kernel implementation, documenting this in a comment would be helpful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/compilation/test_add_norm_quant.py` around lines 50 -
67, The inter_out comparison is too loose—tighten the tolerances for
actual_inter vs ref_inter in this test: replace
torch.testing.assert_close(actual_inter, ref_inter, rtol=0.05, atol=0.15) with
much stricter tolerances (e.g., rtol=1e-5, atol=1e-5) for the elementwise add
check, or if dtype-specific differences exist, set dtype-aware tolerances (check
dtype and use 1e-5 for float16/bfloat16, else appropriate tighter values) and
add a brief comment above the assertion explaining why a relaxed tolerance would
be used only for known kernel-induced differences; keep references to
ref_func/func/actual_inter/ref_inter and ensure backend.match_count[0] assertion
remains.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/_torch/compilation/test_add_norm_quant.py`:
- Line 53: The assertion at backend.match_count[0] in test_add_norm_quant.py
uses a generic message; update it to include the actual match count and expected
value for easier debugging (e.g., reference backend.match_count[0] and the
expected 1 in the message). Modify the assertion in the test function that
contains "assert backend.match_count[0] == 1" so the failure message
interpolates backend.match_count[0] (and optionally the expected 1) into the
string to show both actual and expected counts.
- Around line 50-67: The inter_out comparison is too loose—tighten the
tolerances for actual_inter vs ref_inter in this test: replace
torch.testing.assert_close(actual_inter, ref_inter, rtol=0.05, atol=0.15) with
much stricter tolerances (e.g., rtol=1e-5, atol=1e-5) for the elementwise add
check, or if dtype-specific differences exist, set dtype-aware tolerances (check
dtype and use 1e-5 for float16/bfloat16, else appropriate tighter values) and
add a brief comment above the assertion explaining why a relaxed tolerance would
be used only for known kernel-induced differences; keep references to
ref_func/func/actual_inter/ref_inter and ensure backend.match_count[0] assertion
remains.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cbc06d65-16ae-4ef2-8e9b-1ac4cab2c992

📥 Commits

Reviewing files that changed from the base of the PR and between 4c97a03 and 81e4fbe.

📒 Files selected for processing (6)

requirements.txt
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/compilation/patterns/residual_add_norm.py
tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py
tests/unittest/_torch/compilation/test_add_norm_quant.py

amukkara · 2026-04-03T03:57:44Z

/bot run

tensorrt-cicd · 2026-04-03T04:03:59Z

PR_Github #41577 [ run ] triggered by Bot. Commit: 37befd4 Link to invocation

tensorrt-cicd · 2026-04-03T05:25:46Z

PR_Github #41577 [ run ] completed with state SUCCESS. Commit: 37befd4
/LLM/main/L0_MergeRequest_PR pipeline #32488 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara · 2026-04-03T17:11:53Z

/bot run

tensorrt-cicd · 2026-04-03T17:17:49Z

PR_Github #41717 [ run ] triggered by Bot. Commit: 37befd4 Link to invocation

tensorrt-cicd · 2026-04-03T18:40:51Z

PR_Github #41717 [ run ] completed with state SUCCESS. Commit: 37befd4
/LLM/main/L0_MergeRequest_PR pipeline #32619 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara · 2026-04-03T20:31:44Z

/bot run

tensorrt-cicd · 2026-04-03T20:38:06Z

PR_Github #41738 [ run ] triggered by Bot. Commit: c3910b9 Link to invocation

tensorrt-cicd · 2026-04-03T22:24:04Z

PR_Github #41738 [ run ] completed with state SUCCESS. Commit: c3910b9
/LLM/main/L0_MergeRequest_PR pipeline #32639 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara · 2026-04-04T00:12:07Z

/bot run

tensorrt-cicd · 2026-04-04T00:18:36Z

PR_Github #41767 [ run ] triggered by Bot. Commit: 4554a23 Link to invocation

tensorrt-cicd · 2026-04-04T04:46:31Z

PR_Github #41767 [ run ] completed with state SUCCESS. Commit: 4554a23
/LLM/main/L0_MergeRequest_PR pipeline #32664 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara · 2026-04-06T17:08:27Z

/bot run

tensorrt-cicd · 2026-04-06T17:14:57Z

PR_Github #41959 [ run ] triggered by Bot. Commit: 4554a23 Link to invocation

tensorrt-cicd · 2026-04-06T21:17:05Z

PR_Github #41959 [ run ] completed with state SUCCESS. Commit: 4554a23
/LLM/main/L0_MergeRequest_PR pipeline #32813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

amukkara · 2026-04-30T18:55:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T19:02:16Z

PR_Github #46430 [ run ] triggered by Bot. Commit: a4194eb Link to invocation

tensorrt-cicd · 2026-05-01T06:51:26Z

PR_Github #46430 [ run ] completed with state SUCCESS. Commit: a4194eb
/LLM/main/L0_MergeRequest_PR pipeline #36501 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

amukkara · 2026-05-01T17:33:57Z

/bot run

tensorrt-cicd · 2026-05-01T17:40:33Z

PR_Github #46516 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

tensorrt-cicd · 2026-05-01T21:33:41Z

PR_Github #46516 [ run ] completed with state SUCCESS. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36575 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

amukkara · 2026-05-01T22:02:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-01T22:10:08Z

PR_Github #46528 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

tensorrt-cicd · 2026-05-02T06:56:00Z

PR_Github #46528 [ run ] completed with state FAILURE. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36587 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

amukkara · 2026-05-04T17:27:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-04T17:33:50Z

PR_Github #46669 [ run ] triggered by Bot. Commit: 504b5b9 Link to invocation

tensorrt-cicd · 2026-05-04T20:54:47Z

PR_Github #46669 [ run ] completed with state SUCCESS. Commit: 504b5b9
/LLM/main/L0_MergeRequest_PR pipeline #36709 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned amukkara Apr 2, 2026

amukkara changed the title ~~[none][perf] Fuse add + rmsnorm + fp8 quant pattern~~ [none][perf] Fuse add + norm + fp8 quant pattern Apr 2, 2026

amukkara force-pushed the rms-quant-fusion branch from a47bca0 to 9c5dd30 Compare April 2, 2026 19:51

amukkara changed the title ~~[none][perf] Fuse add + norm + fp8 quant pattern~~ [None][perf] Fuse add + norm + fp8 quant pattern Apr 2, 2026

amukkara force-pushed the rms-quant-fusion branch from 9c5dd30 to 81e4fbe Compare April 2, 2026 19:58

amukkara marked this pull request as ready for review April 2, 2026 20:00

amukkara requested review from a team as code owners April 2, 2026 20:00

amukkara requested a review from liji-nv April 2, 2026 20:00

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

amukkara force-pushed the rms-quant-fusion branch from 81e4fbe to ddebe9d Compare April 2, 2026 22:08

liji-nv approved these changes Apr 3, 2026

View reviewed changes

amukkara force-pushed the rms-quant-fusion branch from ddebe9d to 37befd4 Compare April 3, 2026 03:56

amukkara requested a review from a team as a code owner April 3, 2026 20:27

amukkara requested review from dongxuy04 and leslie-fang25 April 3, 2026 20:27

amukkara force-pushed the rms-quant-fusion branch from c3910b9 to 4554a23 Compare April 4, 2026 00:11

amukkara force-pushed the rms-quant-fusion branch from 4554a23 to db73338 Compare April 30, 2026 05:52

amukkara enabled auto-merge (squash) April 30, 2026 17:13

amukkara force-pushed the rms-quant-fusion branch from db73338 to a4194eb Compare April 30, 2026 18:54

amukkara added 2 commits May 1, 2026 10:13

Fuse add norm + fp8 quant pattern

01e4271

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

Unit test for add norm quant fusion

504b5b9

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

amukkara force-pushed the rms-quant-fusion branch from a4194eb to 504b5b9 Compare May 1, 2026 17:23

amukkara merged commit 72cd7d8 into NVIDIA:main May 4, 2026
6 checks passed

Conversation

amukkara commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

amukkara commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

amukkara commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

amukkara commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

amukkara commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

amukkara commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

amukkara commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

amukkara commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

amukkara commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

amukkara commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

amukkara commented Apr 2, 2026 •

edited

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading