[None][feat] Enable 2 DSv4 perf optimizations by default#14120
Conversation
0a93d10 to
118e7a5
Compare
340938e to
aae02ea
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #48336 [ run ] triggered by Bot. Commit: |
|
PR_Github #48336 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48399 [ run ] triggered by Bot. Commit: |
|
PR_Github #48399 [ run ] completed with state
|
30ef664 to
a871c4e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48471 [ run ] triggered by Bot. Commit: |
|
PR_Github #48471 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48552 [ run ] triggered by Bot. Commit: |
|
PR_Github #48552 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48724 [ run ] triggered by Bot. Commit: |
|
PR_Github #48724 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48735 [ run ] triggered by Bot. Commit: |
|
PR_Github #48735 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48773 [ run ] triggered by Bot. Commit: |
|
PR_Github #48773 [ run ] completed with state |
Switch defaults so DeepSeek-V4 inference paths run with two previously opt-in optimizations enabled out of the box. Both remain user-disableable via the same env vars. 1. PR NVIDIA#13628 fused FP8 1x128 quantize + UE8M0 pack on SM100 - tensorrt_llm/_torch/custom_ops/torch_custom_ops.py - Env: TRTLLM_FUSED_FP8_QUANT_PACK (default '0' -> '1') - Disable: TRTLLM_FUSED_FP8_QUANT_PACK=0 2. PR NVIDIA#13629 MLA dependency-aware overlap on DSv4 - tensorrt_llm/_torch/modules/attention.py - Env: TRTLLM_MLA_EXTRA_OVERLAP (default '0' -> '1') - Disable: TRTLLM_MLA_EXTRA_OVERLAP=0 The third originally-proposed flip (use_cute_dsl_blockscaling_bmm) is dropped from this PR. The cute_dsl FP8 BMM path is also invoked for DSv3 K/V absorption BMMs on Blackwell + FP8 block-scales (the fp8_block_scaling_bmm_out dispatcher at attention.py:1161 is not gated on is_deepseek_v4), so flipping the default would change DSv3 perf behavior silently. Defer that flip until a DSv3 Blackwell-FP8 smoke confirms no regression. Signed-off-by: Shicheng Li <shicli@nvidia.com>
a871c4e to
c2a0ed6
Compare
…uashed) Enables by default: - TRTLLM_FUSED_FP8_QUANT_PACK - TRTLLM_MLA_EXTRA_OVERLAP - use_cute_dsl_blockscaling_bmm Conflicts resolved by keeping pr14120's new defaults but preserving the fused CUDA q-norm path from NVIDIA#13975 (the older inline reshape path is gone). Source: NVIDIA#14120 (open PR)
Signed-off-by: Shicheng Li <shicli@nvidia.com> (cherry picked from commit d7d9036) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Description
Flip the default of two previously opt-in DSv4 perf flags so users get them out of the box. Both remain user-disableable via the same env vars they always had.
TRTLLM_FUSED_FP8_QUANT_PACK"0" → "1"tensorrt_llm/_torch/custom_ops/torch_custom_ops.pyTRTLLM_FUSED_FP8_QUANT_PACK=0TRTLLM_MLA_EXTRA_OVERLAP"0" → "1"tensorrt_llm/_torch/modules/attention.pyTRTLLM_MLA_EXTRA_OVERLAP=0A third originally-proposed flip (
use_cute_dsl_blockscaling_bmmfromFalsetoTrue) is dropped from this PR and deferred. The cute_dsl FP8 BMM path also fires for DSv3 K/V absorption BMMs on Blackwell + FP8 block-scales — thefp8_block_scaling_bmm_outdispatcher attensorrt_llm/_torch/modules/attention.py:1161is not gated onis_deepseek_v4, so flipping the default would silently change DSv3 perf behavior (the K/V BMMs switch fromtorch.bmmagainst pre-dequanted bf16 weights tocute_dsl_fp8_bmm_blackwellconsuming native FP8). DSv3 Blackwell + FP8 block-scales hasn't been re-benched, so a separate PR will re-propose the flip after a DSv3 smoke confirms no regression.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.