Skip to content

[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP…#14124

Merged
lfr-0531 merged 2 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:mingyangh/v4-compressor-multiwarp-mtp
May 15, 2026
Merged

[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP…#14124
lfr-0531 merged 2 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:mingyangh/v4-compressor-multiwarp-mtp

Conversation

@mingyangHao
Copy link
Copy Markdown
Collaborator

@mingyangHao mingyangHao commented May 14, 2026

@coderabbitai summary

Description

DSV4 GEN (decode) with MTP-3 runs pagedKvCompressKernel with NEXT_N=4. The current dispatch only enables the 4-warp Phase 3 reduction for NEXT_N <= 2, so the NEXT_N=3..4 path falls back to a single warp serially issuing 128 paged loads. With the kernel launched at (batch, HEAD_BLOCKS=4) = tiny grid with 1 warp/block, per-SM resident warps are <14 and the 300-cycle DRAM latency on every load is
fully exposed — slow-path latency balloons to ~55-60 µs/call at batch ≥ 256.

Relaxing the dispatch condition to next_n <= 4 and instantiating the existing multi-warp template for NEXT_N=3,4 across both head_dim variants drops the slow path 2-3×:

batch NEXT_N before after speedup
16 (production per-rank) 4 20.9 µs 7.2 µs 2.90×
64 4 23.2 µs 9.9 µs 2.34×
128 4 40.1 µs 13.9 µs 2.88×
256 4 51.1 µs 25.6 µs 2.00×

On the source nsys trace (DSV4-Pro concurrency=512, MTP-3, batch=16 per rank), this kernel goes from 2.1% of GEN GPU time → ~0.6%.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@lfr-0531 lfr-0531 force-pushed the feat/deepseek_v4 branch from 0a93d10 to 118e7a5 Compare May 14, 2026 07:44
@lfr-0531 lfr-0531 requested review from a team as code owners May 14, 2026 07:44
@lfr-0531 lfr-0531 requested review from mzweilz and yiqingy0 and removed request for a team May 14, 2026 07:44
… NEXT_N=3..4

For DSV4 GEN (decode) with MTP-3, each step accepts up to NEXT_N=4 tokens
per request, so pagedKvCompressKernel runs with NEXT_N=4. The current
dispatch only enables the 4-warp Phase 3 reduction for NEXT_N<=2, leaving
the NEXT_N=3..4 path as a single warp doing 128 serial paged loads. With
the kernel launched at (batch, HEAD_BLOCKS=4)=tiny grid + 1 warp/block,
per-SM resident warps are <14 and the 300-cycle DRAM latency on every
load is fully exposed -- slow path balloons to ~55-60 us per call at
batch>=256.

Relaxing the dispatch condition from `next_n<=2` to `next_n<=4` and
instantiating the existing multi-warp template for NEXT_N=3,4 across
both head_dim variants drops the slow path 3x at the production
batch=16 per-rank shape (20.9us -> 7.2us) and 2x at batch=128
(40.1us -> 13.9us). End-to-end this kernel goes from 2.1% of GEN GPU
time to ~0.6% on the user's DSV4-Pro concurrency=512 trace.

Phase 3 multi-warp merge logic is per-c (per emitted compressed token)
and already handles NEXT_N >= 2 generically -- no algorithmic change.

Tested:
- tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py::test_decode_mtp
  passes all 4 MTP_CONFIGS (overlap_hd128_next4, overlap_hd512_next3,
  basic_hd128_multi_batch_next4, basic_hd512_next4) -> bit-identical
  numerical results.
- Microbench (HD=512, KV=bf16, STATE=fp32, R=128) emit-path latency:
    batch=16  NEXT_N=4: 20.9us -> 7.2us   (2.90x)
    batch=128 NEXT_N=4: 40.1us -> 13.9us  (2.88x)
    batch=256 NEXT_N=4: 51.1us -> 25.6us  (2.00x)

Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
@mingyangHao mingyangHao force-pushed the mingyangh/v4-compressor-multiwarp-mtp branch from 07295fb to f170cf1 Compare May 14, 2026 08:35
@mingyangHao mingyangHao requested review from heyuhhh and removed request for a team, mzweilz and yiqingy0 May 14, 2026 08:36
@mingyangHao
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

…-macro

The decode-kernel side of compressorKernels.cu had grown 5 layered macros
(INST_DECODE / INST_DECODE_NN / INST_DECODE_DTYPES /
LAUNCH_DECODE / LAUNCH_DECODE_MW / DISPATCH_NN / DISPATCH_NN_MW /
DISPATCH_DTYPE) plus a 7-arm if-else cascade in the launcher. Adding a
new (HD, KV_EB, STATE_EB, CR, NN, NRW) config required touching 5 places,
and the multi-warp NEXT_N=3..4 patch from the previous commit had to
hand-write 16 INST_DECODE lines because the layered macros could not
express "this dtype combo exists at NRW=4 only".

Replace the layered dispatch with a single X-macro
FOREACH_DECODE_CONFIG(F) listing every valid tuple once. Both the
explicit template instantiations and the runtime dispatcher walk the
same list -- adding/removing a config is a one-line edit, and
instantiation/dispatch can never drift out of sync. Two small
fan-out helpers (FOREACH_DECODE_NN, FOREACH_DECODE_DTYPE) keep the
master list one row per (HD, CR, NRW) bucket.

Net: -141 / +63 lines (78 fewer); the launcher dispatch shrinks from
~90 lines of nested switch/if-else to a single TRY_LAUNCH walk over
the config list.

No semantic change. Verified:
- libtensorrt_llm.so links cleanly (compile + relink with sm_100f).
- test_compressor_kernel.py full suite: 63 passed, 22 skipped (no regressions).
- test_decode_mtp 4/4 passes for all MTP_CONFIGS -- bit-identical to
  pre-refactor.

Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
@mingyangHao mingyangHao force-pushed the mingyangh/v4-compressor-multiwarp-mtp branch from f170cf1 to 8c36875 Compare May 14, 2026 08:57
@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48365 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48365 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38170 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48479 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48479 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38275 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48525 [ run ] triggered by Bot. Commit: 8c36875 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48525 [ run ] completed with state SUCCESS. Commit: 8c36875
/LLM/main/L0_MergeRequest_PR pipeline #38318 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 merged commit 82ebfc7 into NVIDIA:feat/deepseek_v4 May 15, 2026
6 checks passed
lfr-0531 pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 29, 2026
NVIDIA#14124)

Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
(cherry picked from commit 82ebfc7)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants