Skip to content

[None][feat] Support tool-calling in KvCacheAwareRouter for disagg serving (feat/deepseek_v4)#14611

Merged
lfr-0531 merged 1 commit into
NVIDIA:feat/deepseek_v4from
lishicheng1996-nv:feat/kv-aware-router-tools-dsv4
May 28, 2026
Merged

[None][feat] Support tool-calling in KvCacheAwareRouter for disagg serving (feat/deepseek_v4)#14611
lfr-0531 merged 1 commit into
NVIDIA:feat/deepseek_v4from
lishicheng1996-nv:feat/kv-aware-router-tools-dsv4

Conversation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator

Description

Cherry-pick of #13232 onto feat/deepseek_v4.

KvCacheAwareRouter._tokenize pre-tokenizes ChatCompletionRequests to compute routing block-hashes, and sets request.prompt_token_ids so the worker server skips re-tokenization. Today it only forwards request.messages to apply_chat_template, ignoring request.tools and request.chat_template_kwargs.

For models whose chat templates materialize tools into the prompt (DeepSeek-V3.2 inlines a tool-schema system block via its custom template), this silently drops the entire tool block from the final prompt — even though the client provided tools in the API request. Tool-calling requests are then run against a prompt that never saw the tools, which both degrades tool-use behavior and produces incorrect cache-aware routing decisions (block hashes computed over a truncated prompt).

Change

Forward request.tools (unwrapped via model_dump()) and request.chat_template_kwargs into apply_chat_template, so router-side pre-tokenization produces the same prompt token IDs the worker would have produced on its own — tools and template flags included. No behavior change for requests without tools or chat_template_kwargs.

Test Coverage

Reproduced on DeepSeek-V3.2 1P1D (DEP8 CTX + TEP8 GEN) with an agentic coding benchmark (12 tools per request, chat_template_kwargs={"thinking": true}):

  • cache hit rate: 95.2% → 95.9%

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…rving

KvCacheAwareRouter._tokenize pre-tokenizes ChatCompletionRequests to
compute block hashes for cache-aware routing and sets
request.prompt_token_ids so the worker skips re-tokenization. The
current implementation only forwards 'messages' to
tokenizer.apply_chat_template, ignoring 'tools' and
chat_template_kwargs.

For models whose chat templates render tools into the prompt
(DeepSeek-V3.2 inlines a tool-schema system block via its custom
template), this silently drops the whole tool block from the final
prompt. Tool-calling requests are then run against a prompt that
never saw the tools, which both degrades tool-use behavior and
produces incorrect cache-aware routing decisions because block
hashes are computed over a truncated prompt.

Forward request.tools (unwrapped via model_dump) and
request.chat_template_kwargs to apply_chat_template so router-side
pre-tokenization matches what the worker would have produced.

Signed-off-by: Shicheng Li <shicli@nvidia.com>
@lishicheng1996-nv lishicheng1996-nv requested a review from a team as a code owner May 27, 2026 03:25
@lishicheng1996-nv lishicheng1996-nv requested review from JunyiXu-nv and removed request for a team May 27, 2026 03:25
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50446 [ run ] triggered by Bot. Commit: 06adbed Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50446 [ run ] completed with state SUCCESS. Commit: 06adbed
/LLM/main/L0_MergeRequest_PR pipeline #39965 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50566 [ run ] triggered by Bot. Commit: 06adbed Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50566 [ run ] completed with state SUCCESS. Commit: 06adbed
/LLM/main/L0_MergeRequest_PR pipeline #40067 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 merged commit 5cd9ec9 into NVIDIA:feat/deepseek_v4 May 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants