[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attention path by liji-nv · Pull Request #12691 · NVIDIA/TensorRT-LLM

liji-nv · 2026-04-02T12:26:28Z

Pass pre-computed num_contexts/num_ctx_tokens to thop::attention and
trtllm_gen_attention to eliminate per-layer sum().item() calls that
recompute batch structure from host_request_types/host_context_lengths.

Move view/slice/reinterpret ops from Python _update_k_cache into the
C++ indexer_k_cache_scatter_op kernel: accept original k_fp8 (FP8) and
k_scale (float32) tensors directly with num_tokens, avoiding per-layer
torch.empty, view, as_strided and slice overhead on the host.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…ion path Pass pre-computed num_contexts/num_ctx_tokens to thop::attention and trtllm_gen_attention to eliminate per-layer sum().item() calls that recompute batch structure from host_request_types/host_context_lengths. Move view/slice/reinterpret ops from Python _update_k_cache into the C++ indexer_k_cache_scatter_op kernel: accept original k_fp8 (FP8) and k_scale (float32) tensors directly with num_tokens, avoiding per-layer torch.empty, view, as_strided and slice overhead on the host. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Remove default values from num_contexts, num_generations, and num_ctx_tokens parameters in trtllm_gen_attention() so callers cannot silently omit them. Move them before the optional global_layer_idx parameter. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

- Remove std::optional from num_contexts/num_ctx_tokens in attention op (always passed by caller, never null) - Remove num_generations param from trtllm_gen_attention interface; compute it inside as host_request_types.size(0) - num_contexts - Remove unused _parse_request_types function - Restore tensor dimension/dtype validation in IndexerKCacheScatterOp with element_size checks to guard reinterpret_cast safety Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-04-02T12:27:10Z

Please hold the merge. After I address some comments from the previous PR, I haven't run E2E test.

…ttention call Store num_contexts and num_ctx_tokens in _TrtllmPlanner during plan_host and pass them as keyword args to thop.attention, matching the updated C++ binding signature. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-04-03T05:30:48Z

previous failing test for autodeploy passed. Merge now.

…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv added 3 commits April 2, 2026 04:52

liji-nv requested a review from a team as a code owner April 2, 2026 12:26

liji-nv requested review from PerkzZheng and removed request for a team April 2, 2026 12:26

github-actions bot assigned liji-nv Apr 2, 2026

liji-nv changed the title ~~Feat/bench y optimize host~~ [https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attention path Apr 2, 2026

liji-nv requested a review from a team as a code owner April 3, 2026 03:40

liji-nv requested review from Fridah-nv and removed request for a team April 3, 2026 03:40

liji-nv merged commit ca4ace1 into NVIDIA:feat/bench_y Apr 3, 2026
4 checks passed

SimengLiu-nv pushed a commit to SimengLiu-nv/TensorRT-LLM that referenced this pull request Apr 6, 2026

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent…

4ba4209

…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

dongfengy pushed a commit to dongfengy/TensorRT-LLM that referenced this pull request Apr 8, 2026

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent…

fa22e67

…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

dongfengy pushed a commit to dongfengy/TensorRT-LLM that referenced this pull request Apr 10, 2026

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent…

33ed5b2

…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

dongfengy pushed a commit to dongfengy/TensorRT-LLM that referenced this pull request Apr 10, 2026

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent…

7596a13

…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attention path#12691

[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attention path#12691
liji-nv merged 4 commits intoNVIDIA:feat/bench_yfrom
liji-nv:feat/bench_y_optimize_host

liji-nv commented Apr 2, 2026 •

edited

Loading

Uh oh!

liji-nv commented Apr 2, 2026

Uh oh!

liji-nv commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liji-nv commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

liji-nv commented Apr 2, 2026

Uh oh!

liji-nv commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liji-nv commented Apr 2, 2026 •

edited

Loading