[None][perf] DSV4 compressor: enable 4-warp Phase 3 reduction for MTP…#14124
Conversation
0a93d10 to
118e7a5
Compare
… NEXT_N=3..4
For DSV4 GEN (decode) with MTP-3, each step accepts up to NEXT_N=4 tokens
per request, so pagedKvCompressKernel runs with NEXT_N=4. The current
dispatch only enables the 4-warp Phase 3 reduction for NEXT_N<=2, leaving
the NEXT_N=3..4 path as a single warp doing 128 serial paged loads. With
the kernel launched at (batch, HEAD_BLOCKS=4)=tiny grid + 1 warp/block,
per-SM resident warps are <14 and the 300-cycle DRAM latency on every
load is fully exposed -- slow path balloons to ~55-60 us per call at
batch>=256.
Relaxing the dispatch condition from `next_n<=2` to `next_n<=4` and
instantiating the existing multi-warp template for NEXT_N=3,4 across
both head_dim variants drops the slow path 3x at the production
batch=16 per-rank shape (20.9us -> 7.2us) and 2x at batch=128
(40.1us -> 13.9us). End-to-end this kernel goes from 2.1% of GEN GPU
time to ~0.6% on the user's DSV4-Pro concurrency=512 trace.
Phase 3 multi-warp merge logic is per-c (per emitted compressed token)
and already handles NEXT_N >= 2 generically -- no algorithmic change.
Tested:
- tests/unittest/_torch/attention/sparse/deepseek_v4/test_compressor_kernel.py::test_decode_mtp
passes all 4 MTP_CONFIGS (overlap_hd128_next4, overlap_hd512_next3,
basic_hd128_multi_batch_next4, basic_hd512_next4) -> bit-identical
numerical results.
- Microbench (HD=512, KV=bf16, STATE=fp32, R=128) emit-path latency:
batch=16 NEXT_N=4: 20.9us -> 7.2us (2.90x)
batch=128 NEXT_N=4: 40.1us -> 13.9us (2.88x)
batch=256 NEXT_N=4: 51.1us -> 25.6us (2.00x)
Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
07295fb to
f170cf1
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
…-macro The decode-kernel side of compressorKernels.cu had grown 5 layered macros (INST_DECODE / INST_DECODE_NN / INST_DECODE_DTYPES / LAUNCH_DECODE / LAUNCH_DECODE_MW / DISPATCH_NN / DISPATCH_NN_MW / DISPATCH_DTYPE) plus a 7-arm if-else cascade in the launcher. Adding a new (HD, KV_EB, STATE_EB, CR, NN, NRW) config required touching 5 places, and the multi-warp NEXT_N=3..4 patch from the previous commit had to hand-write 16 INST_DECODE lines because the layered macros could not express "this dtype combo exists at NRW=4 only". Replace the layered dispatch with a single X-macro FOREACH_DECODE_CONFIG(F) listing every valid tuple once. Both the explicit template instantiations and the runtime dispatcher walk the same list -- adding/removing a config is a one-line edit, and instantiation/dispatch can never drift out of sync. Two small fan-out helpers (FOREACH_DECODE_NN, FOREACH_DECODE_DTYPE) keep the master list one row per (HD, CR, NRW) bucket. Net: -141 / +63 lines (78 fewer); the launcher dispatch shrinks from ~90 lines of nested switch/if-else to a single TRY_LAUNCH walk over the config list. No semantic change. Verified: - libtensorrt_llm.so links cleanly (compile + relink with sm_100f). - test_compressor_kernel.py full suite: 63 passed, 22 skipped (no regressions). - test_decode_mtp 4/4 passes for all MTP_CONFIGS -- bit-identical to pre-refactor. Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
f170cf1 to
8c36875
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48365 [ run ] triggered by Bot. Commit: |
|
PR_Github #48365 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48479 [ run ] triggered by Bot. Commit: |
|
PR_Github #48479 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48525 [ run ] triggered by Bot. Commit: |
|
PR_Github #48525 [ run ] completed with state |
NVIDIA#14124) Signed-off-by: Mingyang Hao <mingyangh@nvidia.com> (cherry picked from commit 82ebfc7) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@coderabbitai summary
Description
DSV4 GEN (decode) with MTP-3 runs
pagedKvCompressKernelwithNEXT_N=4. The current dispatch only enables the 4-warp Phase 3 reduction forNEXT_N <= 2, so theNEXT_N=3..4path falls back to a single warp serially issuing 128 paged loads. With the kernel launched at(batch, HEAD_BLOCKS=4)= tiny grid with 1 warp/block, per-SM resident warps are <14 and the 300-cycle DRAM latency on every load isfully exposed — slow-path latency balloons to ~55-60 µs/call at batch ≥ 256.
Relaxing the dispatch condition to
next_n <= 4and instantiating the existing multi-warp template forNEXT_N=3,4across both head_dim variants drops the slow path 2-3×:On the source nsys trace (DSV4-Pro concurrency=512, MTP-3, batch=16 per rank), this kernel goes from 2.1% of GEN GPU time → ~0.6%.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.