Skip to content

[None][feat] use multi thread for kv transfer#13075

Merged
chuangz0 merged 2 commits into
NVIDIA:mainfrom
chuangz0:multi_thread_for_multi_rank_in_py_cache_transceiver
May 13, 2026
Merged

[None][feat] use multi thread for kv transfer#13075
chuangz0 merged 2 commits into
NVIDIA:mainfrom
chuangz0:multi_thread_for_multi_rank_in_py_cache_transceiver

Conversation

@chuangz0
Copy link
Copy Markdown
Collaborator

@chuangz0 chuangz0 commented Apr 15, 2026

…one thread

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced thread-safe protection for completion counter updates in concurrent KV cache transfer delivery paths.
  • Performance & Stability

    • Optimized task routing strategy to better utilize multi-threaded execution while maintaining per-peer transaction ordering in distributed inference scenarios.

Description

ctx1dep4_gen4_tep8_deepseek r1

gen side kv transfer time (ms)
┌────────┬────────┬────────┬────────┬─────────┐
│ Metric │ thread 4 │ thread 1 │ │ perf imporment│
├────────┼────────┼────────┼─────────┼─────────┤
│ mean │ 12.772 │ 15.229 │ +2.457 │ +19.24% │
├────────┼────────┼────────┼─────────┼─────────┤
│ median │ 12.055 │ 14.631 │ +2.576 │ +21.37% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p90 │ 14.200 │ 17.535 │ +3.335 │ +23.49% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p95 │ 18.809 │ 20.630 │ +1.821 │ +9.68% │
├────────┼────────┼────────┼─────────┼─────────┤
│ p99 │ 23.312 │ 29.193 │ +5.881 │ +25.23% │
├────────┼────────┼────────┼─────────┼─────────┤
│ max │ 30.318 │ 47.499 │ +17.181 │ +56.67% │
└────────┴────────┴────────┴─────────┴─────────┘

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chuangz0 chuangz0 requested a review from a team as a code owner April 15, 2026 08:08
@chuangz0 chuangz0 requested a review from Shixiaowei02 April 15, 2026 08:08
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43449 [ run ] triggered by Bot. Commit: a6f6c00 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

This PR adds thread-safe synchronization for concurrent KV transfer task completion counters using per-task locks and modifies task queue routing to consider both session ID and peer rank rather than session ID alone. Test files are updated to exercise multi-threaded code paths.

Changes

Cohort / File(s) Summary
Core KV Transfer Logic
tensorrt_llm/_torch/disaggregation/native/transfer.py
Added per-task threading.Lock instances to protect completion counter updates in _deliver_kv_to_agent and _deliver_aux_to_agent methods. Modified task queue routing in _enqueue from session-only-based (unique_rid % num_threads) to peer-aware (hash((unique_rid, peer_rank)) % num_threads).
Test Configuration
tests/unittest/disaggregated/test_kv_transfer.py, tests/unittest/disaggregated/test_kv_transfer_mp.py
Added monkeypatch fixture and environment variable configuration to set KV transfer thread count to 4 during test execution, activating multi-threaded code paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is largely incomplete, lacking clear explanation of the issue, solution rationale, test coverage details, and checklist reasoning. Complete the Description section with a clear explanation of why multi-threading is needed and how the request_id+peer_rank binding solves the problem. Add specific test coverage details and explain how each test validates the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: enabling multi-threading for KV transfer.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/disaggregation/native/transfer.py (1)

420-435: ⚠️ Potential issue | 🟠 Major

Make the count-to-future transition atomic.

The new lock only protects the increment. done() / set_result() / status updates still happen after releasing it, so another worker can call session.set_exception() in between and turn this into an InvalidStateError or a nondeterministic ERROR/TRANSFERRED outcome. The completion check and future/status resolution need to stay under the same per-task lock in both paths.

Also applies to: 483-495

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 420 -
435, The increment of task.transferred_count is raced with the future/status
checks because only the increment is inside with task.lock; move the entire
completion logic (the count comparison, both session.set_exception calls, the
write_meta.task_future.done() check,
write_meta.task_future.set_result(AgentResult.SUCCESS), and task.status =
TaskStatus.ERROR) inside the same with task.lock so the transition from counting
-> error/success is atomic; apply the same change to the other occurrence around
the code referenced (the block at ~483-495) to ensure both paths use the
per-task lock for checking and resolving the future.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/disaggregation/native/transfer.py (1)

153-157: Prefix this lock as internal.

SendTaskBase.lock is only consumed inside this module, so exposing it as a public-looking attribute unnecessarily widens the class surface.

As per coding guidelines, "Variables and functions not part of a class's or module's public interface should be prefixed with an underscore".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 153 -
157, The SendTaskBase exposes a public-looking lock attribute; rename
SendTaskBase.lock to a private attribute (e.g., SendTaskBase._lock) and update
all references in this module to use the new name. Change the constructor to
assign to self._lock and replace any usage sites that call or access self.lock
(including methods or external functions in this file that manipulate the lock)
to use self._lock so the attribute is treated as internal only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 289-294: The current _enqueue routes by (unique_rid, peer_rank)
causing the same peer_endpoint to hit different threads while
_get_or_connect_dealer() shares a process-global ZMQMessenger per endpoint;
change routing so the same peer endpoint always maps to the same thread (e.g.,
compute thread_idx = hash(write_meta.peer_endpoint) % self._num_threads in
_enqueue) OR make the dealer cache thread-local (store ZMQMessenger cache per
worker thread keyed by endpoint, including thread_idx in the cache key used by
_get_or_connect_dealer()); update references to _send_task_queues, _num_threads,
_enqueue, and _get_or_connect_dealer to use the chosen approach so a DEALER
socket is never shared across threads when KV_TRANSFER_NUM_THREADS > 1.

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 420-435: The increment of task.transferred_count is raced with the
future/status checks because only the increment is inside with task.lock; move
the entire completion logic (the count comparison, both session.set_exception
calls, the write_meta.task_future.done() check,
write_meta.task_future.set_result(AgentResult.SUCCESS), and task.status =
TaskStatus.ERROR) inside the same with task.lock so the transition from counting
-> error/success is atomic; apply the same change to the other occurrence around
the code referenced (the block at ~483-495) to ensure both paths use the
per-task lock for checking and resolving the future.

---

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 153-157: The SendTaskBase exposes a public-looking lock attribute;
rename SendTaskBase.lock to a private attribute (e.g., SendTaskBase._lock) and
update all references in this module to use the new name. Change the constructor
to assign to self._lock and replace any usage sites that call or access
self.lock (including methods or external functions in this file that manipulate
the lock) to use self._lock so the attribute is treated as internal only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f5daf00b-4bd8-4e3f-b25a-b8799816e863

📥 Commits

Reviewing files that changed from the base of the PR and between 4825da7 and a6f6c00.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tests/unittest/disaggregated/test_kv_transfer.py
  • tests/unittest/disaggregated/test_kv_transfer_mp.py

Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py
@chuangz0 chuangz0 changed the title [None][feat] use multi thread for kv transfer and binding request_id+peer_rank in … [None][feat] use multi thread for kv transfer Apr 15, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43449 [ run ] completed with state SUCCESS. Commit: a6f6c00
/LLM/main/L0_MergeRequest_PR pipeline #33974 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from a6f6c00 to d8eef88 Compare April 16, 2026 01:55
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43605 [ run ] triggered by Bot. Commit: d8eef88 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43605 [ run ] completed with state SUCCESS. Commit: d8eef88
/LLM/main/L0_MergeRequest_PR pipeline #34097 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from c7e107e to d58455e Compare April 17, 2026 06:25
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43989 [ run ] triggered by Bot. Commit: d58455e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43989 [ run ] completed with state SUCCESS. Commit: d58455e
/LLM/main/L0_MergeRequest_PR pipeline #34428 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch 2 times, most recently from 601220e to 56b8de6 Compare April 20, 2026 06:04
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44346 [ run ] triggered by Bot. Commit: 56b8de6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44346 [ run ] completed with state FAILURE. Commit: 56b8de6
/LLM/main/L0_MergeRequest_PR pipeline #34764 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 56b8de6 to 23a3410 Compare April 20, 2026 06:26
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 23a3410 to 0020d55 Compare April 21, 2026 01:18
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44547 [ run ] triggered by Bot. Commit: 0020d55 Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 0020d55 to c22eb68 Compare April 21, 2026 02:59
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47100 [ run ] triggered by Bot. Commit: 51684f0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47100 [ run ] completed with state SUCCESS. Commit: 51684f0
/LLM/main/L0_MergeRequest_PR pipeline #37069 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch 2 times, most recently from 0a1709b to ccbda89 Compare May 9, 2026 05:04
@chuangz0
Copy link
Copy Markdown
Collaborator Author

chuangz0 commented May 9, 2026

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2, B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47475 [ run ] triggered by Bot. Commit: ccbda89 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47475 [ run ] completed with state FAILURE. Commit: ccbda89
/LLM/main/L0_MergeRequest_PR pipeline #37396 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2, B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from ccbda89 to 66b7806 Compare May 11, 2026 01:31
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47632 [ run ] triggered by Bot. Commit: 66b7806 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47632 [ run ] completed with state SUCCESS. Commit: 66b7806
/LLM/main/L0_MergeRequest_PR pipeline #37537 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 66b7806 to 1357b80 Compare May 11, 2026 07:16
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

@github-actions
Copy link
Copy Markdown

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
python-multipart CVE-2024-53981 python-multipart is a streaming multipart parser for Python. When parsing form data, python-multipart skips line breaks (CR \r or LF \n) in front of the first boundary and any tailing bytes after the last boundary. This happens one byte at a time and emits a log event each time, which may cause excessive logging for certain inputs. An attacker could abuse this by sending a malicious request with lots of data before the first or after the last boundary, causing high CPU load and stalling the processing thread for a significant amount of time. In case of ASGI application, this could stall the event loop and prevent other requests from being processed, resulting in a denial of service (DoS). This vulnerability is fixed in 0.0.18. HIGH

@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 1357b80 to 6a3c985 Compare May 12, 2026 08:13
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "B300-PyTorch-1, DGX_B200-PyTorch-1, DGX_B200-PyTorch-3"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47939 [ run ] triggered by Bot. Commit: 6a3c985 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47939 [ run ] completed with state SUCCESS. Commit: 6a3c985
/LLM/main/L0_MergeRequest_PR pipeline #37785 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

chuangz0 added 2 commits May 13, 2026 11:39
…one thread

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
@chuangz0 chuangz0 force-pushed the multi_thread_for_multi_rank_in_py_cache_transceiver branch from 6a3c985 to af0a1ad Compare May 13, 2026 03:39
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "all test passed"

@chuangz0 chuangz0 requested a review from lfr-0531 May 13, 2026 03:39
@chuangz0 chuangz0 enabled auto-merge (squash) May 13, 2026 03:40
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48094 [ skip ] triggered by Bot. Commit: af0a1ad Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48094 [ skip ] completed with state SUCCESS. Commit: af0a1ad
Skipping testing for commit af0a1ad

Link to invocation

@chuangz0 chuangz0 merged commit e60f910 into NVIDIA:main May 13, 2026
7 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants