[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening#13713
Draft
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
Draft
Conversation
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 3, 2026
Adds 00-tldr.md as the front door to the investigation. Targets a 10-minute read with two inline Mermaid figures: - The disagg architecture + cancellation flow (where the bugs live). - The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which layer). Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen final solution. Doesn't compare approaches A/B/C — that comparison lives in 06-fix-approaches/README.md for readers who want depth on the landing decision. Section structure: 1. What is broken (the wedge symptom). 2. Where the bugs live (architecture figure). 3. Root cause: eight invariant gaps (L1-L8 table). 4. The fix (combo figure + per-piece breakdown). 5. Does it work (empirical recovery table). 6. Why this took 8 days to find (cascade pattern). 7. What is left to do. 8. Where to go from here (reader paths to deeper files). README.md updated to put 00-tldr.md at the top of the navigation table and add a 10-minute reading path entry. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
631975c to
262796c
Compare
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 5, 2026
…rt, and L9 invariant Update the NVBug 6104831 investigation report to reflect the current state of the combo fix path (PR NVIDIA#13713): - Fold PR NVIDIA#13728 (fail-closed on unquiesced disagg KV transfer) into the combo as a fifth piece, plus the MLA-formatter port that closes the same hazard for DeepSeek-style models. - Add L9 (transport quiescence on unsafe exit) to the defect-class stack as a defense-in-depth memory-safety invariant. L1-L8 are wedge-prevention; L9 is the rip-cord that prevents silent buffer-pool corruption on cancel/exception when transport quiescence is unknown. - Update the empirical results tables across README.md, 00-tldr.md, 06-fix-approaches/D-combo.md, and 06-fix-approaches/README.md to show CONC=128 (3-pair) review-fix-v3 5/5 PASS post-fold-in plus the prior pre-fold-in stress runs through CONC=256. - Add MLA-stress validation as a follow-up item and motivate the RAII-audit task with the MLA-formatter finding. - Update the PR map to include NVIDIA#13728 and the chained sig regression PRs that the combo retains as test scaffolding. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
5 tasks
yifjiang
added a commit
to yifjiang/TensorRT-LLM
that referenced
this pull request
May 6, 2026
…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/TensorRT-LLM
that referenced
this pull request
May 6, 2026
…g, structured cancel, bounded quarantine The C++ cache transceiver and Python executor cooperate on disaggregated KV transfer cancellation, timeout, and process health. Earlier attempts (PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem but still hung in the field: * PR NVIDIA#13706 with Python changes restarted workers on every routine per-request timeout (Python fail-close was too aggressive). * PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because the C++ status polling could still call future.get() on an unready worker future, freezing the executor event loop while NIXL/UCX was wedged. This PR keeps each layer's responsibility separate, as described in docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and the new docs/source/features/disagg-kv-transfer-session-lifecycle.md: * checkContextTransferStatus / checkGenTransferStatus are now strictly non-blocking. They poll with wait_for(0ms) and only call future.get() when the future is already ready. atLeastRequestNum > 0 still admits additional ready entries but never selects an unready one to satisfy the count. drainContextTransferStatus / drainGenTransferStatus are the only blocking variants and are intended for shutdown drain. * cancelRequestStructured returns a TransferCancelResult enum with six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise, CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only the first three permit Python to free request resources; the others keep the request owned by C++ until the worker future reaches a final state. The historical bool cancelRequest is preserved as a backward-compatible wrapper. * The transceiver maintains a bounded quarantine counter and a global progress deadline. A per-request timeout marks the entry quarantined but keeps the future pinned so NIXL/UCX cannot write into freed memory. If quarantined entries exceed mQuarantineBudget or no worker has reached a final state for longer than mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth() surfaces the snapshot for orchestration. * PyExecutor adds _can_terminate_request_now / _inflight_cancel_- requested_ids so _do_terminate_request defers freeing resources for any request that is still in disagg transfer state or whose cancel is in flight. Per-request timeouts no longer clear active_requests, the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706 restart-loop trigger and is explicitly forbidden by the analysis doc. * The ADP synchronized pending-response flush from PR NVIDIA#13112 is ported here (the base commit precedes that merge). Transfer- completion responses created in _end_transfer_and_maybe_terminate are buffered and flushed at synchronised loop points so every DP rank participates in the tp_gather collective. Test coverage: * tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py — 14 unit tests for the deferred-terminate guard, the structured- cancel decision tree, and the ADP flush symmetry. They run without GPU or MPI so they fail fast in pre-merge CI. References: * analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md * contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang
added a commit
to yifjiang/TensorRT-LLM
that referenced
this pull request
May 6, 2026
…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 7, 2026
…sig NVIDIA#8, Phase 15 The combo (PR NVIDIA#13713 + NVIDIA#13728 fold + MLA port) regressed when applied to rc13: with rc13's default-on block reuse, `_handle_responses`'s early-termination branch and `_end_transfer_and_maybe_terminate` can each refuse termination under the right timing, leaving the request with no cleanup owner. Server hangs on scenarios that succeeded on rc11. Captures this in the investigation report: - 03-defect-class-stack.md: add L10 (redundant block-reuse cleanup mechanism on the disagg path) with code sites, customer-visible symptom, and the latent symptoms the Phase 1 stop-gap leaves open (pin leak on cancel/timeout, PP > 1 disagg without block reuse, eviction race in the unpin → release window, regression risk on adjacent code). - 03-defect-class-stack.md: extend the layer-to-signature mermaid with L10 → sig NVIDIA#8 plus a dotted edge to the latent-symptom set. - 02-failure-signatures.md: add sig NVIDIA#8 (rc13 server hang under disagg + block reuse + in-flight cancel) with full root-cause, short-term stop-gap, and medium-term Phase 2 plan. - 05-investigation-timeline.md: add Phase 15 documenting the rc13 regression discovery, the two competing fix proposals, and the recommended staged plan. - 08-next-steps-and-pr-map.md: split item 2 into "land combo with rc13 stop-gap" and a new item 2a "land Phase 2 of the block-reuse-overlap-scheduler design". - README.md: add the rc13 caveat callout in the status section; update navigation language from L1-L8/seven-sig to L1-L10/eight-sig. Adds incentive cross-reference in the existing design doc: - docs/design/block-reuse-overlap-scheduler/README.md: promote Phase 2 status from "deprioritised" to "load-bearing for stable disagg block reuse" with cross-link to the rc13 evidence. - docs/design/block-reuse-overlap-scheduler/phase2-unify-reuse-mechanisms.md: add an "Empirical confirmation: the rc13 regression" section at the top, framing Phase 2 as the architectural answer the rc13 bug predicted. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 7, 2026
Adds 09-executive-summary-rc11-to-rc13.md as the 15-minute read for someone briefing the full journey: original rc11 wedge, how PR NVIDIA#13713 solved it, why it regressed on rc13 (block reuse → L10 dual-path), how the short-term stop-gap unblocks rc13, and why the design doc's Phase 2 is the right long-term answer. Five Mermaid figures: - L1-L9 layer stack as the rc11 root cause framework - PR NVIDIA#13713's four-piece composition with L1-L9 mapping - The L10 dual-path: both cleanup paths refusing termination on rc13 - Stop-gap coverage vs latent symptoms it leaves open - Add-coordination vs delete-redundancy comparison + staged plan Updates README.md navigation table and reading paths to point at the new file with a "15 minutes / brief someone on the full journey" entry. The file extends 00-tldr.md (10-minute read on rc11 only). Where 00-tldr stops at "the rc11 wedge is fixed", 09 picks up at the rc13 regression discovery and walks through to the architectural follow-up. Designed for an exec briefing or a teammate joining mid-investigation. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 7, 2026
…cancellation/cleanup invariants
Two changes to 09-executive-summary-rc11-to-rc13.md per review feedback:
1. Treat the file as an independent report. Removed the "extends
00-tldr.md with the rc13 chapter" framing; removed the inline
"> Detail: <other-file>.md" pointers from each section; restated
the ten invariants in-place rather than referring readers to
03-defect-class-stack.md to follow along; moved the cross-file
pointers to a clearly-marked "Optional deep-dive pointers"
appendix at the end. A reader can now walk away with a complete
understanding of the bug class, the fix, the regression, and the
plans without needing to consult any other file.
2. Add a new section 2 "The invariants for correct request
cancellation and cleanup" that names the ten invariants the bug
class violates, organised into four architectural categories:
- Lifetime invariants (L2/L7/L9): what stays alive while
transfers are in flight.
- Resource invariants (L5/L6): every acquired resource must
release on every exit.
- Synchronization invariants (L1/L3/L4): no thread waits forever.
- Coordination invariants (L8/L10): exactly one owner, no
implicit handoffs.
Each invariant is named, defined, and explained in terms of why it
matters (with the rc11/rc13 evidence that violated it). Section 3
onwards then describes PR NVIDIA#13713, the rc13 regression, the stop-gap,
and the architectural fix in terms of which invariants they enforce
or fail to enforce.
A new Mermaid figure visualises the four invariant categories so a
reader can see the architectural hierarchy at a glance: lifetime
governs who is alive, resource governs what is held, synchronization
governs who is waiting, coordination governs who is responsible.
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung
added a commit
to chienchunhung/TensorRT-LLM
that referenced
this pull request
May 7, 2026
…p-dive pointers Per review feedback: - Title now just mentions the NVBug number: "09 — NVBug 6104831 Executive Summary". - Preface trimmed to one sentence that briefly mentions PR NVIDIA#13713 and the rc11/rc13 contrast. Audience block, reading-time block, and self-contained note removed. - "Optional deep-dive pointers" appendix removed entirely. The file is now strictly the rc11 → rc13 narrative with the four- category invariant model, no meta-framing or cross-reference appendix. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
e4c232d to
87c4bb8
Compare
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…disagg cancellation. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> (cherry picked from commit 944561d)
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> 1777930306 -0700
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/pyexecutor/py_executor.py
87c4bb8 to
484655e
Compare
Collaborator
Author
|
/bot run --disable-fail-fast --add-multi-gpu-test |
Collaborator
|
PR_Github #47132 [ run ] triggered by Bot. Commit: |
…transfer Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
484655e to
5719008
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
tl;dr
This PR fixes a permanent disaggregated-serving wedge triggered by long-prompt, cancellation-heavy traffic in the PyTorch backend. The root cause is a stack of cleanup/lifetime invariant gaps across C++ KV transfer, Python request termination, NIXL cancellation, and block reuse. The final solution makes disaggregated KV cleanup transport-aware and block-reuse-safe: cancelled requests stop scheduling immediately, but KV resources are only released after C++ transfer status becomes terminal, and deferred termination is retried exactly once when it becomes safe.
Background
Disaggregated serving splits a request across:
Under long-prompt bursts with client-side cancellations, the old cleanup path could leave the deployment alive but permanently non-responsive: workers stayed up, health endpoints could still return 200, but every post-burst request timed out.
The investigation found that this was not one bug. It was a stack of independent lifecycle gaps. Fixing one exposed the next.
Failure signatures observed
Broken promisecheckGenTransferStatus(atLeastNum=1)could callfuture.get()on an unready futureBroken promiserequestSync()early-return path leaked a recv-buffer slotshared_ptr<LlmRequest>Correctness model
The central invariant is:
Python request state alone is not sufficient. With block reuse enabled, a cancelled request may still have KV blocks referenced by the reuse tree or pinned by an in-flight context transfer. Freeing those blocks too early risks memory corruption; forgetting to free them later causes a permanent wedge.
What this PR changes
1. C++ transfer request lifetime
The C++ transceiver now uses
std::shared_ptr<LlmRequest>across async send/receive state so Python-side termination cannot destroy a request object while C++ futures/workers still dereference it.This closes the UAF class that previously manifested as corrupted request IDs and first-request crashes.
2. Sender/receiver cancellation bookkeeping
Cancellation paths now fulfill promises before dropping queued/ready work. This prevents
std::future_error: Broken promisefrom escaping on both sender and receiver sides.3. Non-blocking generation transfer polling
checkGenTransferStatus(atLeastNum=1)no longer blocks indefinitely on an unready future. It skips unready entries during non-blocking polling and lets timeout handling drive cancellation/error state.4. NIXL transfer cancellation and release
The NIXL path now releases transfer handles on cancellation (
TransferStatus::release()→nixlAgent::releaseXferReq()), so cancelled requests do not leave backend transfer state stranded.5. Eval-order fix after
shared_ptrAfter moving
Response::mRequesttoshared_ptr, this pattern became unsafe:C++ does not specify function-argument evaluation order, so
std::move(resp)could run beforeresp.mRequest->mRequestId. This PR materializesreqIdbefore the move.6. Python idempotency guards
Generation-side disagg initialization and KV receive setup are now guarded by
py_request_id, preventing repeated side effects such as duplicateKVCacheManager::addSequence()or duplicaterequest_and_receive_async()for the same request.7. Fail-closed behavior for unknown transfer quiescence
If TRT-LLM cannot prove transport quiescence, it fails closed instead of returning buffers to the pool. This includes:
BufferIndexHolder::poison()ready,not-ready,cancelled/unknown)cacheFormatter.cppmlaCacheFormatter.cppThis protects against silent buffer-pool corruption under cancellation races.
8. Deferred cleanup for timed-out context/generation transfer
A successful
cancel_request()is not a quiescence proof. The PR therefore defers Python-side resource cleanup after timeout cancellation until C++ transfer status reports the request terminal.This preserves memory safety: resources are not freed while C++/NIXL may still reference transfer buffers or KV blocks.
9. Block-reuse-safe deferred termination
rc13 enabled block reuse with the overlap scheduler, which exposed a new liveness hole.
Before the final fix:
This PR records whether termination was requested and whether resources were actually freed:
AsyncTransferManager.end_transfer()now returns:When transfer becomes terminal,
_end_transfer_and_maybe_terminate()retries termination if an earlier attempt was deferred. This preserves both invariants:Why deferring cleanup is intentional
This PR intentionally defers cleanup in some cancellation paths.
That is the correct behavior because
cancel_request()only requests cancellation. It does not prove that C++/NIXL has stopped touching transfer buffers or KV blocks.The final behavior is:
So the PR does not mean "never clean up." It means "do not clean up until it is safe, and then guarantee cleanup happens."
Tests
New/updated unit coverage includes:
checkGenTransferStatus(atLeastNum=1);EndTransferResult.needs_terminationcoverage.Local test runs:
Notes / limitations
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.