[None][perf] Add CUDA q_b norm for DeepSeek V4#13975
Conversation
0ea44df to
2b6494d
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #47698 [ run ] triggered by Bot. Commit: |
|
PR_Github #47698 [ run ] completed with state
|
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #47771 [ run ] triggered by Bot. Commit: |
|
PR_Github #47771 [ run ] completed with state
|
| assert q.dim() == 2 and q.shape[ | ||
| 1] == self.num_heads_tp * self.qk_head_dim | ||
| total_rows = q.shape[0] * self.num_heads_tp | ||
| if (q_norm_op is not None and q.is_cuda and q.is_contiguous() |
There was a problem hiding this comment.
Dispatching according to q.is_cuda may have issue for dynamo(Although currently the whole dsv 4 op is under a custom op and cannot be seen by dynamo, but we still better not introduce this for possible future improvement like extend piecewise cuda graph range).
There was a problem hiding this comment.
Fixed -- only new norm kernel is enabled and removed all those branches.
2b6494d to
2df3924
Compare
0a93d10 to
118e7a5
Compare
be8c3f1 to
a53999f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48363 [ run ] triggered by Bot. Commit: |
|
PR_Github #48363 [ run ] completed with state |
a53999f to
7099307
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48508 [ run ] triggered by Bot. Commit: |
|
PR_Github #48508 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48566 [ run ] triggered by Bot. Commit: |
|
PR_Github #48566 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48737 [ run ] triggered by Bot. Commit: |
|
PR_Github #48737 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48774 [ run ] triggered by Bot. Commit: |
|
PR_Github #48774 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
7099307 to
2145cdc
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48826 [ run ] triggered by Bot. Commit: |
|
PR_Github #48826 [ run ] completed with state |
…uashed) Enables by default: - TRTLLM_FUSED_FP8_QUANT_PACK - TRTLLM_MLA_EXTRA_OVERLAP - use_cute_dsl_blockscaling_bmm Conflicts resolved by keeping pr14120's new defaults but preserving the fused CUDA q-norm path from NVIDIA#13975 (the older inline reshape path is gone). Source: NVIDIA#14120 (open PR)
Falls back to per-head RMS reshape when torch.ops.trtllm.deepseek_v4_q_norm is not registered (e.g., Python-only patched image w/o C++ rebuild). Lets us deploy PRs NVIDIA#14074 + NVIDIA#13975 via build_python_changes.sh without a full Docker rebuild. C++ rebuild later will pick up the perf benefit of the fused CUDA kernel automatically.
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Signed-off-by: Mingyang Hao <mingyangh@nvidia.com> Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com> (cherry picked from commit f833ad7) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.