Skip to content

[None][perf] Add CUDA q_b norm for DeepSeek V4#13975

Merged
lfr-0531 merged 2 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:mingyangh/deepseek-v4-qnorm-cuda
May 18, 2026
Merged

[None][perf] Add CUDA q_b norm for DeepSeek V4#13975
lfr-0531 merged 2 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:mingyangh/deepseek-v4-qnorm-cuda

Conversation

@mingyangHao
Copy link
Copy Markdown
Collaborator

@mingyangHao mingyangHao commented May 11, 2026

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@mingyangHao mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch 2 times, most recently from 0ea44df to 2b6494d Compare May 11, 2026 07:32
@mingyangHao mingyangHao marked this pull request as ready for review May 11, 2026 07:34
@mingyangHao mingyangHao requested review from a team as code owners May 11, 2026 07:34
@mingyangHao mingyangHao requested review from QiJune, lfr-0531 and liji-nv and removed request for a team, QiJune and liji-nv May 11, 2026 07:34
@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47698 [ run ] triggered by Bot. Commit: 2b6494d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47698 [ run ] completed with state SUCCESS. Commit: 2b6494d
/LLM/main/L0_MergeRequest_PR pipeline #37594 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@qiaoxj07
Copy link
Copy Markdown
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47771 [ run ] triggered by Bot. Commit: 2b6494d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47771 [ run ] completed with state SUCCESS. Commit: 2b6494d
/LLM/main/L0_MergeRequest_PR pipeline #37663 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

assert q.dim() == 2 and q.shape[
1] == self.num_heads_tp * self.qk_head_dim
total_rows = q.shape[0] * self.num_heads_tp
if (q_norm_op is not None and q.is_cuda and q.is_contiguous()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching according to q.is_cuda may have issue for dynamo(Although currently the whole dsv 4 op is under a custom op and cannot be seen by dynamo, but we still better not introduce this for possible future improvement like extend piecewise cuda graph range).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed -- only new norm kernel is enabled and removed all those branches.

@mingyangHao mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from 2b6494d to 2df3924 Compare May 13, 2026 06:23
@lfr-0531 lfr-0531 force-pushed the feat/deepseek_v4 branch from 0a93d10 to 118e7a5 Compare May 14, 2026 07:44
@lfr-0531 lfr-0531 requested review from a team as code owners May 14, 2026 07:44
@lfr-0531 lfr-0531 requested review from mzweilz and yiqingy0 and removed request for a team May 14, 2026 07:44
@mingyangHao mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch 2 times, most recently from be8c3f1 to a53999f Compare May 14, 2026 08:51
@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48363 [ run ] triggered by Bot. Commit: a53999f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48363 [ run ] completed with state SUCCESS. Commit: a53999f
/LLM/main/L0_MergeRequest_PR pipeline #38169 completed with status: 'SUCCESS'

CI Report

Link to invocation

@mingyangHao mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from a53999f to 7099307 Compare May 15, 2026 03:27
@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48508 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48508 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48566 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48566 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38354 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48737 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48737 [ run ] completed with state SUCCESS. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38503 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48774 [ run ] triggered by Bot. Commit: 7099307 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48774 [ run ] completed with state FAILURE. Commit: 7099307
/LLM/main/L0_MergeRequest_PR pipeline #38537 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@mingyangHao
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
@mingyangHao mingyangHao force-pushed the mingyangh/deepseek-v4-qnorm-cuda branch from 7099307 to 2145cdc Compare May 18, 2026 03:28
@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48826 [ run ] triggered by Bot. Commit: 2145cdc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48826 [ run ] completed with state SUCCESS. Commit: 2145cdc
/LLM/main/L0_MergeRequest_PR pipeline #38586 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 merged commit f833ad7 into NVIDIA:feat/deepseek_v4 May 18, 2026
6 checks passed
Thachnh added a commit to deepinfra/TensorRT-LLM that referenced this pull request May 19, 2026
…uashed)

Enables by default:
- TRTLLM_FUSED_FP8_QUANT_PACK
- TRTLLM_MLA_EXTRA_OVERLAP
- use_cute_dsl_blockscaling_bmm

Conflicts resolved by keeping pr14120's new defaults but preserving the
fused CUDA q-norm path from NVIDIA#13975 (the older inline reshape path is
gone).

Source: NVIDIA#14120 (open PR)
Thachnh added a commit to deepinfra/TensorRT-LLM that referenced this pull request May 19, 2026
Falls back to per-head RMS reshape when torch.ops.trtllm.deepseek_v4_q_norm
is not registered (e.g., Python-only patched image w/o C++ rebuild).
Lets us deploy PRs NVIDIA#14074 + NVIDIA#13975 via build_python_changes.sh without
a full Docker rebuild. C++ rebuild later will pick up the perf benefit
of the fused CUDA kernel automatically.
lfr-0531 pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 29, 2026
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
(cherry picked from commit f833ad7)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants