[None][perf] Add CUDA q_b norm for DeepSeek V4 by mingyangHao · Pull Request #13975 · NVIDIA/TensorRT-LLM

mingyangHao · 2026-05-11T07:23:23Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

lfr-0531 · 2026-05-11T08:18:49Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-11T08:26:27Z

PR_Github #47698 [ run ] triggered by Bot. Commit: 2b6494d Link to invocation

tensorrt-cicd · 2026-05-11T11:23:39Z

PR_Github #47698 [ run ] completed with state SUCCESS. Commit: 2b6494d
/LLM/main/L0_MergeRequest_PR pipeline #37594 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

qiaoxj07 · 2026-05-11T16:03:21Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-11T16:10:30Z

PR_Github #47771 [ run ] triggered by Bot. Commit: 2b6494d Link to invocation

tensorrt-cicd · 2026-05-11T17:53:38Z

PR_Github #47771 [ run ] completed with state SUCCESS. Commit: 2b6494d
/LLM/main/L0_MergeRequest_PR pipeline #37663 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liji-nv · 2026-05-12T08:58:40Z

+        assert q.dim() == 2 and q.shape[
+            1] == self.num_heads_tp * self.qk_head_dim
+        total_rows = q.shape[0] * self.num_heads_tp
+        if (q_norm_op is not None and q.is_cuda and q.is_contiguous()


Dispatching according to q.is_cuda may have issue for dynamo(Although currently the whole dsv 4 op is under a custom op and cannot be seen by dynamo, but we still better not introduce this for possible future improvement like extend piecewise cuda graph range).

Fixed -- only new norm kernel is enabled and removed all those branches.

lfr-0531 · 2026-05-14T12:27:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T12:33:18Z

PR_Github #48363 [ run ] triggered by Bot. Commit: a53999f Link to invocation

tensorrt-cicd · 2026-05-14T19:58:46Z

PR_Github #48363 [ run ] completed with state SUCCESS. Commit: a53999f
/LLM/main/L0_MergeRequest_PR pipeline #38169 completed with status: 'SUCCESS'

CI Report

Link to invocation

lfr-0531 · 2026-05-15T03:40:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T03:48:17Z

PR_Github #48508 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

tensorrt-cicd · 2026-05-15T08:11:58Z

PR_Github #48508 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-15T08:36:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T08:42:45Z

PR_Github #48566 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

tensorrt-cicd · 2026-05-15T11:47:38Z

PR_Github #48566 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38354 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-17T05:07:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-17T05:15:27Z

PR_Github #48737 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

tensorrt-cicd · 2026-05-17T06:11:58Z

PR_Github #48737 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38503 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-17T16:00:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-17T16:07:54Z

PR_Github #48774 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

tensorrt-cicd · 2026-05-17T16:32:50Z

PR_Github #48774 [ run ] completed with state FAILURE. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38537 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mingyangHao · 2026-05-18T01:35:49Z

/bot run --disable-fail-fast

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

lfr-0531 · 2026-05-18T03:30:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T03:36:55Z

PR_Github #48826 [ run ] triggered by Bot. Commit: 2145cdc Link to invocation

tensorrt-cicd · 2026-05-18T07:00:08Z

PR_Github #48826 [ run ] completed with state SUCCESS. Commit: 2145cdc
/LLM/main/L0_MergeRequest_PR pipeline #38586 completed with status: 'SUCCESS'

CI Report

Link to invocation

…uashed) Enables by default: - TRTLLM_FUSED_FP8_QUANT_PACK - TRTLLM_MLA_EXTRA_OVERLAP - use_cute_dsl_blockscaling_bmm Conflicts resolved by keeping pr14120's new defaults but preserving the fused CUDA q-norm path from NVIDIA#13975 (the older inline reshape path is gone). Source: NVIDIA#14120 (open PR)

Falls back to per-head RMS reshape when torch.ops.trtllm.deepseek_v4_q_norm is not registered (e.g., Python-only patched image w/o C++ rebuild). Lets us deploy PRs NVIDIA#14074 + NVIDIA#13975 via build_python_changes.sh without a full Docker rebuild. C++ rebuild later will pick up the perf benefit of the fused CUDA kernel automatically.

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com> Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com> (cherry picked from commit f833ad7) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

github-actions Bot assigned mingyangHao May 11, 2026

mingyangHao added the deepseek-v4 label May 11, 2026

mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch 2 times, most recently from 0ea44df to 2b6494d Compare May 11, 2026 07:32

mingyangHao marked this pull request as ready for review May 11, 2026 07:34

mingyangHao requested review from a team as code owners May 11, 2026 07:34

mingyangHao requested review from QiJune, lfr-0531 and liji-nv and removed request for a team, QiJune and liji-nv May 11, 2026 07:34

liji-nv reviewed May 12, 2026

View reviewed changes

mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from 2b6494d to 2df3924 Compare May 13, 2026 06:23

lfr-0531 force-pushed the feat/deepseek_v4 branch from 0a93d10 to 118e7a5 Compare May 14, 2026 07:44

lfr-0531 requested review from a team as code owners May 14, 2026 07:44

lfr-0531 requested review from mzweilz and yiqingy0 and removed request for a team May 14, 2026 07:44

mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch 2 times, most recently from be8c3f1 to a53999f Compare May 14, 2026 08:51

mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from a53999f to 7099307 Compare May 15, 2026 03:27

mingyangHao added 2 commits May 17, 2026 20:17

[None][perf] Add CUDA q norm for DeepSeek V4

487e680

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

[perf] Always use DeepSeek V4 q norm op

2145cdc

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from 7099307 to 2145cdc Compare May 18, 2026 03:28

lfr-0531 approved these changes May 18, 2026

View reviewed changes

lfr-0531 merged commit f833ad7 into NVIDIA:feat/deepseek_v4 May 18, 2026
6 checks passed

Conversation

mingyangHao commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

lfr-0531 commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

qiaoxj07 commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

liji-nv May 12, 2026

Choose a reason for hiding this comment

Uh oh!

mingyangHao May 14, 2026

Choose a reason for hiding this comment

Uh oh!

lfr-0531 commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

lfr-0531 commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

lfr-0531 commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

mingyangHao commented May 18, 2026

Uh oh!

lfr-0531 commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mingyangHao commented May 11, 2026 •

edited

Loading