[None][feat] use multi thread for kv transfer by chuangz0 · Pull Request #13075 · NVIDIA/TensorRT-LLM

chuangz0 · 2026-04-15T08:08:04Z

…one thread

Summary by CodeRabbit

Bug Fixes
- Enhanced thread-safe protection for completion counter updates in concurrent KV cache transfer delivery paths.
Performance & Stability
- Optimized task routing strategy to better utilize multi-threaded execution while maintaining per-peer transaction ordering in distributed inference scenarios.

Description

ctx1dep4_gen4_tep8_deepseek r1

gen side kv transfer time (ms)
┌────────┬────────┬────────┬────────┬─────────┐
│ Metric │ thread 4 │ thread 1 │ │ perf imporment│
├────────┼────────┼────────┼─────────┼─────────┤
│ mean │ 12.772 │ 15.229 │ +2.457 │ +19.24% │
├────────┼────────┼────────┼─────────┼─────────┤
│ median │ 12.055 │ 14.631 │ +2.576 │ +21.37% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p90 │ 14.200 │ 17.535 │ +3.335 │ +23.49% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p95 │ 18.809 │ 20.630 │ +1.821 │ +9.68% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p99 │ 23.312 │ 29.193 │ +5.881 │ +25.23% │
├────────┼────────┼────────┼─────────┼─────────┤
│ max │ 30.318 │ 47.499 │ +17.181 │ +56.67% │
└────────┴────────┴────────┴─────────┴─────────┘

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

chuangz0 · 2026-04-15T08:08:31Z

/bot run

tensorrt-cicd · 2026-04-15T08:14:44Z

PR_Github #43449 [ run ] triggered by Bot. Commit: a6f6c00 Link to invocation

coderabbitai · 2026-04-15T08:18:08Z

📝 Walkthrough

Walkthrough

This PR adds thread-safe synchronization for concurrent KV transfer task completion counters using per-task locks and modifies task queue routing to consider both session ID and peer rank rather than session ID alone. Test files are updated to exercise multi-threaded code paths.

Changes

Cohort / File(s)	Summary
Core KV Transfer Logic `tensorrt_llm/_torch/disaggregation/native/transfer.py`	Added per-task `threading.Lock` instances to protect completion counter updates in `_deliver_kv_to_agent` and `_deliver_aux_to_agent` methods. Modified task queue routing in `_enqueue` from session-only-based (`unique_rid % num_threads`) to peer-aware (`hash((unique_rid, peer_rank)) % num_threads`).
Test Configuration `tests/unittest/disaggregated/test_kv_transfer.py`, `tests/unittest/disaggregated/test_kv_transfer_mp.py`	Added monkeypatch fixture and environment variable configuration to set KV transfer thread count to 4 during test execution, activating multi-threaded code paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is largely incomplete, lacking clear explanation of the issue, solution rationale, test coverage details, and checklist reasoning.	Complete the Description section with a clear explanation of why multi-threading is needed and how the request_id+peer_rank binding solves the problem. Add specific test coverage details and explain how each test validates the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: enabling multi-threading for KV transfer.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/disaggregation/native/transfer.py (1)
420-435: ⚠️ Potential issue | 🟠 Major

Make the count-to-future transition atomic.

The new lock only protects the increment. done() / set_result() / status updates still happen after releasing it, so another worker can call session.set_exception() in between and turn this into an InvalidStateError or a nondeterministic ERROR/TRANSFERRED outcome. The completion check and future/status resolution need to stay under the same per-task lock in both paths.

Also applies to: 483-495
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 420 -
435, The increment of task.transferred_count is raced with the future/status
checks because only the increment is inside with task.lock; move the entire
completion logic (the count comparison, both session.set_exception calls, the
write_meta.task_future.done() check,
write_meta.task_future.set_result(AgentResult.SUCCESS), and task.status =
TaskStatus.ERROR) inside the same with task.lock so the transition from counting
-> error/success is atomic; apply the same change to the other occurrence around
the code referenced (the block at ~483-495) to ensure both paths use the
per-task lock for checking and resolving the future.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/disaggregation/native/transfer.py (1)
153-157: Prefix this lock as internal.

SendTaskBase.lock is only consumed inside this module, so exposing it as a public-looking attribute unnecessarily widens the class surface.

As per coding guidelines, "Variables and functions not part of a class's or module's public interface should be prefixed with an underscore".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 153 -
157, The SendTaskBase exposes a public-looking lock attribute; rename
SendTaskBase.lock to a private attribute (e.g., SendTaskBase._lock) and update
all references in this module to use the new name. Change the constructor to
assign to self._lock and replace any usage sites that call or access self.lock
(including methods or external functions in this file that manipulate the lock)
to use self._lock so the attribute is treated as internal only.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 289-294: The current _enqueue routes by (unique_rid, peer_rank)
causing the same peer_endpoint to hit different threads while
_get_or_connect_dealer() shares a process-global ZMQMessenger per endpoint;
change routing so the same peer endpoint always maps to the same thread (e.g.,
compute thread_idx = hash(write_meta.peer_endpoint) % self._num_threads in
_enqueue) OR make the dealer cache thread-local (store ZMQMessenger cache per
worker thread keyed by endpoint, including thread_idx in the cache key used by
_get_or_connect_dealer()); update references to _send_task_queues, _num_threads,
_enqueue, and _get_or_connect_dealer to use the chosen approach so a DEALER
socket is never shared across threads when KV_TRANSFER_NUM_THREADS > 1.

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 420-435: The increment of task.transferred_count is raced with the
future/status checks because only the increment is inside with task.lock; move
the entire completion logic (the count comparison, both session.set_exception
calls, the write_meta.task_future.done() check,
write_meta.task_future.set_result(AgentResult.SUCCESS), and task.status =
TaskStatus.ERROR) inside the same with task.lock so the transition from counting
-> error/success is atomic; apply the same change to the other occurrence around
the code referenced (the block at ~483-495) to ensure both paths use the
per-task lock for checking and resolving the future.

---

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 153-157: The SendTaskBase exposes a public-looking lock attribute;
rename SendTaskBase.lock to a private attribute (e.g., SendTaskBase._lock) and
update all references in this module to use the new name. Change the constructor
to assign to self._lock and replace any usage sites that call or access
self.lock (including methods or external functions in this file that manipulate
the lock) to use self._lock so the attribute is treated as internal only.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f5daf00b-4bd8-4e3f-b25a-b8799816e863

📥 Commits

Reviewing files that changed from the base of the PR and between 4825da7 and a6f6c00.

📒 Files selected for processing (3)

tensorrt_llm/_torch/disaggregation/native/transfer.py
tests/unittest/disaggregated/test_kv_transfer.py
tests/unittest/disaggregated/test_kv_transfer_mp.py

tensorrt-cicd · 2026-04-15T11:18:58Z

PR_Github #43449 [ run ] completed with state SUCCESS. Commit: a6f6c00
/LLM/main/L0_MergeRequest_PR pipeline #33974 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-04-16T01:56:31Z

/bot run

tensorrt-cicd · 2026-04-16T02:02:38Z

PR_Github #43605 [ run ] triggered by Bot. Commit: d8eef88 Link to invocation

tensorrt-cicd · 2026-04-16T04:52:49Z

PR_Github #43605 [ run ] completed with state SUCCESS. Commit: d8eef88
/LLM/main/L0_MergeRequest_PR pipeline #34097 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-04-17T06:36:39Z

/bot run

tensorrt-cicd · 2026-04-17T06:42:00Z

PR_Github #43989 [ run ] triggered by Bot. Commit: d58455e Link to invocation

tensorrt-cicd · 2026-04-18T05:30:24Z

PR_Github #43989 [ run ] completed with state SUCCESS. Commit: d58455e
/LLM/main/L0_MergeRequest_PR pipeline #34428 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-04-20T06:05:12Z

/bot run

tensorrt-cicd · 2026-04-20T06:13:52Z

PR_Github #44346 [ run ] triggered by Bot. Commit: 56b8de6 Link to invocation

tensorrt-cicd · 2026-04-20T06:24:36Z

PR_Github #44346 [ run ] completed with state FAILURE. Commit: 56b8de6
/LLM/main/L0_MergeRequest_PR pipeline #34764 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-04-20T06:27:40Z

/bot run

chuangz0 · 2026-04-21T01:19:00Z

/bot run

tensorrt-cicd · 2026-04-21T01:24:43Z

PR_Github #44547 [ run ] triggered by Bot. Commit: 0020d55 Link to invocation

tensorrt-cicd · 2026-05-07T03:38:43Z

PR_Github #47100 [ run ] triggered by Bot. Commit: 51684f0 Link to invocation

tensorrt-cicd · 2026-05-07T12:44:24Z

PR_Github #47100 [ run ] completed with state SUCCESS. Commit: 51684f0
/LLM/main/L0_MergeRequest_PR pipeline #37069 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chuangz0 · 2026-05-09T05:08:38Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2, B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

tensorrt-cicd · 2026-05-09T05:14:41Z

PR_Github #47475 [ run ] triggered by Bot. Commit: ccbda89 Link to invocation

tensorrt-cicd · 2026-05-09T06:05:47Z

PR_Github #47475 [ run ] completed with state FAILURE. Commit: ccbda89
/LLM/main/L0_MergeRequest_PR pipeline #37396 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chuangz0 · 2026-05-11T01:31:35Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2, B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

tensorrt-cicd · 2026-05-11T01:37:47Z

PR_Github #47632 [ run ] triggered by Bot. Commit: 66b7806 Link to invocation

tensorrt-cicd · 2026-05-11T04:35:16Z

PR_Github #47632 [ run ] completed with state SUCCESS. Commit: 66b7806
/LLM/main/L0_MergeRequest_PR pipeline #37537 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-05-11T07:18:07Z

/bot run --stage-list "B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

github-actions · 2026-05-11T07:23:02Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
python-multipart	CVE-2024-53981	python-multipart is a streaming multipart parser for Python. When parsing form data, python-multipart skips line breaks (CR \r or LF \n) in front of the first boundary and any tailing bytes after the last boundary. This happens one byte at a time and emits a log event each time, which may cause excessive logging for certain inputs. An attacker could abuse this by sending a malicious request with lots of data before the first or after the last boundary, causing high CPU load and stalling the processing thread for a significant amount of time. In case of ASGI application, this could stall the event loop and prevent other requests from being processed, resulting in a denial of service (DoS). This vulnerability is fixed in 0.0.18.	HIGH

chuangz0 · 2026-05-12T08:15:46Z

/bot run --stage-list "B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

tensorrt-cicd · 2026-05-12T08:21:35Z

PR_Github #47939 [ run ] triggered by Bot. Commit: 6a3c985 Link to invocation

tensorrt-cicd · 2026-05-12T12:43:49Z

PR_Github #47939 [ run ] completed with state SUCCESS. Commit: 6a3c985
/LLM/main/L0_MergeRequest_PR pipeline #37785 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

…one thread Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 · 2026-05-13T03:39:26Z

/bot skip --comment "all test passed"

tensorrt-cicd · 2026-05-13T03:46:29Z

PR_Github #48094 [ skip ] triggered by Bot. Commit: af0a1ad Link to invocation

tensorrt-cicd · 2026-05-13T03:52:46Z

PR_Github #48094 [ skip ] completed with state SUCCESS. Commit: af0a1ad
Skipping testing for commit af0a1ad

Link to invocation

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 requested a review from a team as a code owner April 15, 2026 08:08

chuangz0 requested review from HuiGao-NV and leslie-fang25 April 15, 2026 08:08

github-actions Bot assigned chuangz0 Apr 15, 2026

chuangz0 requested a review from Shixiaowei02 April 15, 2026 08:08

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py

chuangz0 changed the title ~~[None][feat] use multi thread for kv transfer and binding request_id+peer_rank in …~~ [None][feat] use multi thread for kv transfer Apr 15, 2026

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from a6f6c00 to d8eef88 Compare April 16, 2026 01:55

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from c7e107e to d58455e Compare April 17, 2026 06:25

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch 2 times, most recently from 601220e to 56b8de6 Compare April 20, 2026 06:04

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 56b8de6 to 23a3410 Compare April 20, 2026 06:26

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 23a3410 to 0020d55 Compare April 21, 2026 01:18

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 0020d55 to c22eb68 Compare April 21, 2026 02:59

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch 2 times, most recently from 0a1709b to ccbda89 Compare May 9, 2026 05:04

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from ccbda89 to 66b7806 Compare May 11, 2026 01:31

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 66b7806 to 1357b80 Compare May 11, 2026 07:16

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 1357b80 to 6a3c985 Compare May 12, 2026 08:13

chuangz0 added 2 commits May 13, 2026 11:39

use multi thread for kv transfer and binding request_id+peer_rank in …

d09ac95

…one thread Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

zmq socker per thread

af0a1ad

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 6a3c985 to af0a1ad Compare May 13, 2026 03:39

chuangz0 requested a review from lfr-0531 May 13, 2026 03:39

chuangz0 enabled auto-merge (squash) May 13, 2026 03:40

Shixiaowei02 approved these changes May 13, 2026

View reviewed changes

lfr-0531 approved these changes May 13, 2026

View reviewed changes

chuangz0 merged commit e60f910 into NVIDIA:main May 13, 2026
7 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][feat] use multi thread for kv transfer (NVIDIA#13075)

f30b6bc

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

Conversation

chuangz0 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

chuangz0 commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

chuangz0 commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

chuangz0 commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 18, 2026

Uh oh!

chuangz0 commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

chuangz0 commented Apr 20, 2026

Uh oh!

chuangz0 commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

chuangz0 commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

chuangz0 commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

chuangz0 commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

👎 Promotion blocked, new vulnerability found

Vulnerability report

Uh oh!

chuangz0 commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

chuangz0 commented May 13, 2026

chuangz0 commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading