[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP… by mingyangHao · Pull Request #14124 · NVIDIA/TensorRT-LLM

mingyangHao · 2026-05-14T07:31:36Z

Description

DSV4 GEN (decode) with MTP-3 runs pagedKvCompressKernel with NEXT_N=4. The current dispatch only enables the 4-warp Phase 3 reduction for NEXT_N <= 2, so the NEXT_N=3..4 path falls back to a single warp serially issuing 128 paged loads. With the kernel launched at (batch, HEAD_BLOCKS=4) = tiny grid with 1 warp/block, per-SM resident warps are <14 and the 300-cycle DRAM latency on every load is
fully exposed — slow-path latency balloons to ~55-60 µs/call at batch ≥ 256.

Relaxing the dispatch condition to next_n <= 4 and instantiating the existing multi-warp template for NEXT_N=3,4 across both head_dim variants drops the slow path 2-3×:

batch	NEXT_N	before	after	speedup
16 (production per-rank)	4	20.9 µs	7.2 µs	2.90×
64	4	23.2 µs	9.9 µs	2.34×
128	4	40.1 µs	13.9 µs	2.88×
256	4	51.1 µs	25.6 µs	2.00×

On the source nsys trace (DSV4-Pro concurrency=512, MTP-3, batch=16 per rank), this kernel goes from 2.1% of GEN GPU time → ~0.6%.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

… NEXT_N=3..4 For DSV4 GEN (decode) with MTP-3, each step accepts up to NEXT_N=4 tokens per request, so pagedKvCompressKernel runs with NEXT_N=4. The current dispatch only enables the 4-warp Phase 3 reduction for NEXT_N<=2, leaving the NEXT_N=3..4 path as a single warp doing 128 serial paged loads. With the kernel launched at (batch, HEAD_BLOCKS=4)=tiny grid + 1 warp/block, per-SM resident warps are <14 and the 300-cycle DRAM latency on every load is fully exposed -- slow path balloons to ~55-60 us per call at batch>=256. Relaxing the dispatch condition from `next_n<=2` to `next_n<=4` and instantiating the existing multi-warp template for NEXT_N=3,4 across both head_dim variants drops the slow path 3x at the production batch=16 per-rank shape (20.9us -> 7.2us) and 2x at batch=128 (40.1us -> 13.9us). End-to-end this kernel goes from 2.1% of GEN GPU time to ~0.6% on the user's DSV4-Pro concurrency=512 trace. Phase 3 multi-warp merge logic is per-c (per emitted compressed token) and already handles NEXT_N >= 2 generically -- no algorithmic change. Tested: - tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py::test_decode_mtp passes all 4 MTP_CONFIGS (overlap_hd128_next4, overlap_hd512_next3, basic_hd128_multi_batch_next4, basic_hd512_next4) -> bit-identical numerical results. - Microbench (HD=512, KV=bf16, STATE=fp32, R=128) emit-path latency: batch=16 NEXT_N=4: 20.9us -> 7.2us (2.90x) batch=128 NEXT_N=4: 40.1us -> 13.9us (2.88x) batch=256 NEXT_N=4: 51.1us -> 25.6us (2.00x) Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

mingyangHao · 2026-05-14T08:38:16Z

/bot run --add-multi-gpu-test --disable-fail-fast

…-macro The decode-kernel side of compressorKernels.cu had grown 5 layered macros (INST_DECODE / INST_DECODE_NN / INST_DECODE_DTYPES / LAUNCH_DECODE / LAUNCH_DECODE_MW / DISPATCH_NN / DISPATCH_NN_MW / DISPATCH_DTYPE) plus a 7-arm if-else cascade in the launcher. Adding a new (HD, KV_EB, STATE_EB, CR, NN, NRW) config required touching 5 places, and the multi-warp NEXT_N=3..4 patch from the previous commit had to hand-write 16 INST_DECODE lines because the layered macros could not express "this dtype combo exists at NRW=4 only". Replace the layered dispatch with a single X-macro FOREACH_DECODE_CONFIG(F) listing every valid tuple once. Both the explicit template instantiations and the runtime dispatcher walk the same list -- adding/removing a config is a one-line edit, and instantiation/dispatch can never drift out of sync. Two small fan-out helpers (FOREACH_DECODE_NN, FOREACH_DECODE_DTYPE) keep the master list one row per (HD, CR, NRW) bucket. Net: -141 / +63 lines (78 fewer); the launcher dispatch shrinks from ~90 lines of nested switch/if-else to a single TRY_LAUNCH walk over the config list. No semantic change. Verified: - libtensorrt_llm.so links cleanly (compile + relink with sm_100f). - test_compressor_kernel.py full suite: 63 passed, 22 skipped (no regressions). - test_decode_mtp 4/4 passes for all MTP_CONFIGS -- bit-identical to pre-refactor. Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

lfr-0531 · 2026-05-14T12:26:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T12:34:00Z

PR_Github #48365 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

tensorrt-cicd · 2026-05-14T19:30:04Z

PR_Github #48365 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38170 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-15T01:40:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T01:47:22Z

PR_Github #48479 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

tensorrt-cicd · 2026-05-15T03:27:34Z

PR_Github #48479 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38275 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-15T04:53:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T04:59:24Z

PR_Github #48525 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

tensorrt-cicd · 2026-05-15T06:17:20Z

PR_Github #48525 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38318 completed with status: 'SUCCESS'

CI Report

Link to invocation

NVIDIA#14124) Signed-off-by: Mingyang Hao <mingyangh@nvidia.com> (cherry picked from commit 82ebfc7) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

github-actions Bot assigned mingyangHao May 14, 2026

lfr-0531 force-pushed the feat/deepseek_v4 branch from 0a93d10 to 118e7a5 Compare May 14, 2026 07:44

lfr-0531 requested review from a team as code owners May 14, 2026 07:44

lfr-0531 requested review from mzweilz and yiqingy0 and removed request for a team May 14, 2026 07:44

mingyangHao force-pushed the mingyangh/v4-compressor-multiwarp-mtp branch from 07295fb to f170cf1 Compare May 14, 2026 08:35

mingyangHao requested review from heyuhhh and removed request for a team, mzweilz and yiqingy0 May 14, 2026 08:36

mingyangHao added the deepseek-v4 label May 14, 2026

mingyangHao force-pushed the mingyangh/v4-compressor-multiwarp-mtp branch from f170cf1 to 8c36875 Compare May 14, 2026 08:57

lfr-0531 approved these changes May 15, 2026

View reviewed changes

lfr-0531 merged commit 82ebfc7 into NVIDIA:feat/deepseek_v4 May 15, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP…#14124

[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP…#14124
lfr-0531 merged 2 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:mingyangh/v4-compressor-multiwarp-mtp

mingyangHao commented May 14, 2026 •

edited

Loading

Uh oh!

mingyangHao commented May 14, 2026

Uh oh!

lfr-0531 commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mingyangHao commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

mingyangHao commented May 14, 2026

Uh oh!

lfr-0531 commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

lfr-0531 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mingyangHao commented May 14, 2026 •

edited

Loading