[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768
[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768chienchunhung wants to merge 5 commits into
Conversation
…exists Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…ctx-side checkContextTransferStatus Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…cle integrity. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR migrates the cache transceiver request methods from raw pointer arguments to ChangesCache Transceiver and Disaggregated Safety
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 555-565: The timeout logging is computed using now - start before
verifying the transfer future completed, causing polling delay to be reported as
a timeout; update the polling/inspection logic (the loop that reads the transfer
future and emits WARNs) so you first check the transfer's completion/readiness
(e.g., future::wait_for(0) or equivalent) and skip timeout calculation/logging
if the future is already ready, then only compute elapsed = now - start and
consult kvTransferTimeoutMs and mTimedOutSenderIds when the transfer remains
incomplete; apply the same fix to the other similar blocks referenced around the
622-640 and 808-834 regions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8433b8e6-f940-404e-99bb-f9f6e4f6d3ae
📒 Files selected for processing (8)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.hcpp/tensorrt_llm/batch_manager/baseTransBuffer.cppcpp/tensorrt_llm/batch_manager/baseTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cppcpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpptensorrt_llm/_torch/pyexecutor/py_executor.py
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…ve entry Restores the pre-A9/A10 behavior in _check_disagg_gen_transfer_status, _prepare_disagg_gen_init, _executor_loop_pp, and _executor_loop to match upstream/main. Reduces this PR's delta vs main and isolates the rank-symmetric collective-entry changes for separate verification. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #51140 [ run ] triggered by Bot. Commit: |
|
PR_Github #51140 [ run ] completed with state
|
Summary
First step in landing the disaggregated KV cache transfer hardening work that has been iterating in PR #13713. This PR cherry-picks the subset of those changes that close pre-existing correctness and reliability hazards which exist today independent of in-flight cancellation. The cancellation surface and the related KV-block use-after-free fix are deferred to a follow-up PR.
This split lets the baseline land cleanly while the cancel-surface (a much larger blast radius) goes through review on top of a settled foundation.
What this PR changes and why
All changes are direct ports from PR #13713. Each addresses an issue that exists today regardless of whether in-flight cancellation is enabled.
C++
mSenderFutures/mRequesterFuturesholdstd::shared_ptr<LlmRequest>instead of rawLlmRequest*.Issue resolved: the async KV-transfer worker can outlive the request's scheduling lifetime. When the executor terminates the request and frees its
LlmRequestwhile the worker thread is still mid-transfer, the raw pointer dereferences freed memory — a use-after-free that manifests as rare crashes or KV-cache corruption under load. Shared ownership keeps the request alive until both the scheduler and the worker are done with it. Affects the three async entry points (respondAndSendAsync,requestAndReceiveSync,requestAndReceiveAsync) pluscancelRequest.NIXL agent kept alive while its
TransferStatusexists.Issue resolved: the NIXL agent can be torn down (e.g., during executor recycle or shutdown) while a pending
TransferStatusstill holds a non-owning reference to it. Subsequent operations on theTransferStatusdereference freed agent memory. Shared-ownership extension of the agent's lifetime closes the hazard.New
BufferIndexHolderRAII class inbaseTransBuffer.{h,cpp}, wired at the four buffer-index acquisition sites incacheFormatter.cppandmlaCacheFormatter.cpp(send + recv).Issue resolved: transfer buffer-slot indices acquired during setup are not released when a transfer throws partway through (transient I/O error, peer disconnect, allocation failure). Each leaked slot reduces the pool's usable capacity until the pool is effectively exhausted and new transfers refuse or hang. The RAII holder releases held indices on destruction regardless of normal-return or throw. The explicit
freeBufferIndexFor{Send,Recv}calls are replaced withholder.release()on the happy path; the destructor covers all other exit paths.Observe-only KV-transfer timeout WARN in
checkContextTransferStatus/checkGenTransferStatus, plus themTimedOutSenderIds/mTimedOutRequesterIdsdedup sets that suppress duplicate emissions.Issue resolved: today a wedged disagg KV transfer (peer crash, network partition, slow disk) is silent — the request blocks indefinitely with no operator-visible signal identifying which request is stuck or for how long. The new WARN gives operators a triage anchor. Strictly diagnostic: no state transition, no eviction, no cancellation triggered. Only emitted when
kv_transfer_timeout_msis configured by the user.What this PR explicitly excludes (deferred to follow-up)
No in-flight cancellation logic of any kind. Specifically excluded from PR #13713 to keep this PR's blast radius small:
TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCELenv var,is_disagg_inflight_cancel_enabled(),getEnvDisaggEnableInflightCancel(), and every code path they gated.cancel_requestmid-flight surface inkv_cache_transceiver.py.sendHolder.poison(),mInFlightCancelFlags,has_poisoned_transfer_buffer, per-request cancel-flag propagation throughAgentConnection::send/recvandsendRequestInfo.mSenderFutures/mRequesterFutures._can_terminate_request_now,_handle_errorsdeferred-termination).catch (std::future_error const&)aroundmPromise->set_exception)._disagg_gen_init_prepared_ids,_disagg_gen_kv_recv_started_ids). Those exist to make recv idempotent under the cancel-throw retry pattern; without that pattern in scope, the scheduler's state-based filter is sufficient.setState/receiveAsync/emplaceinrequestAndReceiveAsync(kept upstream ordering:receiveAsync → emplace → setState).setKvCacheTransferStart(LlmRequest::getSteadyClockNow())scaffolding inrequestAndReceiveAsync.Also excluded for now: rank-symmetric collective entry on the gen and ctx sides. PR #13713 removed the
if need_check:/if not recv_reqs: returnearly-exits in_check_disagg_gen_transfer_statusand theif num_fitting_reqs == 0 ...:gate in_executor_loop_pp/_executor_loopto ensure every rank enters the downstream C++gatherRequestIdsAllgather symmetrically. Reverted in this PR to reduce the delta against upstream/main and to isolate any cross-rank divergence introduced by those changes for separate verification. Will be reintroduced in a follow-up if CI shows they are needed.Exclusion verification (zero matches inside this PR's diff):
Follow-up
A subsequent PR will introduce the in-flight cancellation surface end-to-end, gated behind the
TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCELopt-in, including:Splitting the work this way isolates regression risk for the cancel surface to a single opt-in change on top of a settled baseline.