[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine#13796
Draft
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
Draft
Conversation
…g, structured cancel, bounded quarantine The C++ cache transceiver and Python executor cooperate on disaggregated KV transfer cancellation, timeout, and process health. Earlier attempts (PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem but still hung in the field: * PR NVIDIA#13706 with Python changes restarted workers on every routine per-request timeout (Python fail-close was too aggressive). * PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because the C++ status polling could still call future.get() on an unready worker future, freezing the executor event loop while NIXL/UCX was wedged. This PR keeps each layer's responsibility separate, as described in docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and the new docs/source/features/disagg-kv-transfer-session-lifecycle.md: * checkContextTransferStatus / checkGenTransferStatus are now strictly non-blocking. They poll with wait_for(0ms) and only call future.get() when the future is already ready. atLeastRequestNum > 0 still admits additional ready entries but never selects an unready one to satisfy the count. drainContextTransferStatus / drainGenTransferStatus are the only blocking variants and are intended for shutdown drain. * cancelRequestStructured returns a TransferCancelResult enum with six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise, CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only the first three permit Python to free request resources; the others keep the request owned by C++ until the worker future reaches a final state. The historical bool cancelRequest is preserved as a backward-compatible wrapper. * The transceiver maintains a bounded quarantine counter and a global progress deadline. A per-request timeout marks the entry quarantined but keeps the future pinned so NIXL/UCX cannot write into freed memory. If quarantined entries exceed mQuarantineBudget or no worker has reached a final state for longer than mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth() surfaces the snapshot for orchestration. * PyExecutor adds _can_terminate_request_now / _inflight_cancel_- requested_ids so _do_terminate_request defers freeing resources for any request that is still in disagg transfer state or whose cancel is in flight. Per-request timeouts no longer clear active_requests, the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706 restart-loop trigger and is explicitly forbidden by the analysis doc. * The ADP synchronized pending-response flush from PR NVIDIA#13112 is ported here (the base commit precedes that merge). Transfer- completion responses created in _end_transfer_and_maybe_terminate are buffered and flushed at synchronised loop points so every DP rank participates in the tp_gather collective. Test coverage: * tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py — 14 unit tests for the deferred-terminate guard, the structured- cancel decision tree, and the ADP flush symmetry. They run without GPU or MPI so they fail fast in pre-merge CI. References: * analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md * contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…the cache transceiver Follow-up to the previous commit on this branch. Changes the BaseCacheTransceiver / CacheTransceiver public API and the tracked- future vectors from raw LlmRequest* to std::shared_ptr<LlmRequest> so that the C++ object outlives every worker thread access regardless of when Python's _terminate_request drops its pybind reference. This closes the historical use-after-free class on raw LlmRequest* that showed up in field traces as mRequestId == 0x5555555555555555. Specifically: * respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync, cancelRequest, and cancelRequestStructured all take std::shared_ptr<LlmRequest>. * TrackedFuture::request is now std::shared_ptr<LlmRequest>; the shared_ptr in mSenderFutures / mRequesterFutures keeps the LlmRequest pinned for as long as the transceiver tracks the worker future. * trtGptModelInflightBatching.cpp now passes the shared_ptr directly instead of stripping it via .get(). * The nanobind trampoline (PyCacheTransceiver) signatures match. The binding relies on <nanobind/stl/shared_ptr.h> (already included) to provide a Python-aware shared_ptr that pins the wrapper alive while C++ holds it. The Python-side _can_terminate_request_now guard stays in place. Its job is no longer about lifetime — the shared_ptr handles that — but about resource quiescence: free_resources() releases KV blocks back to the pool, and the transport may still be writing into them after the LlmRequest object exists. Termination must still wait for the structured cancel result to signal safety. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md is updated to describe the new "Object lifetime" section. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…e size, test callsites Three issues caught in self-review of the prior commits: 1. cancelRequestStructured returned BackendUnhealthy too eagerly — even for a request that was still queued pre-advertise. Pre-advertise release is always safe (no buffer was ever exposed to a peer), and freeing those resources during the unhealthy window REDUCES backend pressure rather than adds to it. Reorder: check sender / receiver queues first, return CancelledBeforeAdvertise unconditionally if found there. Only consult isHealthy() for the in-flight branch, where freeing would race with NIXL/UCX writes. 2. NB_TRAMPOLINE size was 6 but the trampoline has 7 NB_OVERRIDE_PURE entries (pre-existing under-allocation of nanobind's dispatch hash table). Bump to 7 since we touched the file. 3. cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp called mSender->sendAsync(*llmRequest) and mRequester->receiveAsync(*llmRequest) passing LlmRequest&. After the prior commit those signatures take std::shared_ptr<LlmRequest>. The test was already holding a shared_ptr (request->mLlmRequest), so just drop the dereference. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…V release + health-driven shutdown
The earlier commits surfaced the user-visible "exceeded timeout"
error and pinned the LlmRequest object via shared_ptr, but the
recovery story still had two gaps that an internal review caught:
1. When the C++ deadline-quarantine path marked a request as
DISAGG_TRANS_ERROR, _can_terminate_request_now no longer recognised
the request as "still in flight" (its state was no longer
DISAGG_*_TRANS_IN_PROGRESS), so _do_terminate_request happily
called free_resources() — releasing KV blocks back to the cache
pool while the C++ worker thread might still be writing into them
via NIXL/UCX. The shared_ptr lifetime fix protected the LlmRequest
object but not its KV memory.
2. C++ exposed isHealthy() / getHealth() but nothing on the Python
side consumed those signals. A worker whose backend was genuinely
wedged would silently accumulate quarantined transfers forever,
never triggering an orchestration restart.
This commit closes both:
* PyExecutor adds _pending_resource_release (a list of LlmRequest
whose KV cleanup is deferred) and _maybe_release_pending_resources
which polls cancel_request_structured each iteration. free_resources
runs only when the structured cancel result transitions to
AlreadyComplete / NotFound / CancelledBeforeAdvertise — the three
outcomes that prove the worker has reached a final state.
* The timeout error paths in _check_disagg_ctx_cache_transfer_status
and _handle_responses (gen side) call
_defer_resource_release_for_inflight_transfer() before _handle_errors
/ _end_transfer_and_maybe_terminate, so the user-visible error
response is still surfaced immediately while the actual
free_resources() call defers until C++ quiesces.
* PyExecutor adds _check_transceiver_health which monitors
transceiver.is_healthy(). It records the first-unhealthy timestamp
and, after a grace period (default 2 * global_progress_deadline_-
seconds, falling back to 120s), sets self.is_shutdown so
orchestration's existing restart path takes the worker out of
service. Recovery from a transient unhealthy window does not
trigger restart.
* Both new methods plus the existing _check_kv_transfer_timeout are
called at every executor-loop tick that already checks for KV
transfer timeouts — five sites across PP / non-overlap / overlap
loops.
The doc gains a "Recovery model" section that spells out the three
timescales: user-visible error (immediate), KV resource release
(bounded by worker quiescence), worker restart on persistent wedge
(global deadline + grace).
Tests:
* Five new unit tests in test_kv_transfer_session_lifecycle.py:
- test_defer_resource_release_holds_until_quiesced (full sequence
timeout → in-flight → still in-flight → quiesced → freed).
- test_defer_resource_release_handles_not_found_as_safe.
- test_defer_resource_release_no_op_when_empty.
- test_check_transceiver_health_resets_on_recovery (transient
unhealthy must not shut down).
- test_check_transceiver_health_triggers_shutdown_after_grace
(sustained unhealthy DOES shut down).
* All 19 tests pass without GPU/MPI.
Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
This was referenced May 6, 2026
Draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Disaggregated KV transfer cancellation, timeout, and process-health
contract — a layered fix on top of the analysis in
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md.Earlier attempts (#13706, #13713, #13728) each fixed one slice of
the problem but still hung in the field. Concretely:
per-request timeout — the Python fail-close policy was too aggressive.
because the C++ status polling could call
future.get()on an unreadyworker future, freezing the executor event loop while NIXL/UCX was
wedged.
This PR separates the four concerns that the earlier PRs mixed:
shared_ptr<LlmRequest>through the public API + tracked-future vectors._can_terminate_request_nowdefersfree_resources()until safe.isHealthy() == false.Base:
b9ce4b69d12fe5ba65d13893111b1a2ea29413ee. Two commits.Commit 1 — non-blocking polling, structured cancel, bounded quarantine
C++ — status polling is non-blocking by default
checkContextTransferStatusandcheckGenTransferStatuspoll workerfutures with
wait_for(0ms)and never callfuture.get()on anunready entry. When
atLeastRequestNum > 0is requested, the pollingcode admits additional ready futures but skips unready ones rather than
blocking the executor thread. New explicit blocking variants exist for
shutdown only:
drainContextTransferStatus(bool markComplete)drainGenTransferStatus()This is the actual hang fix — see the analysis doc's "Why the C++-only
fix is not sufficient" section.
C++ — structured
TransferCancelResultenumcancelRequestStructured(LlmRequest*)returns a structured outcome:NotFoundAlreadyCompleteCancelledBeforeAdvertiseCancelRequestedInFlightBackendUnhealthyNotCancellableThe historical
bool cancelRequestis preserved as a backward-compatiblewrapper that returns true only for the first three states.
C++ — bounded quarantine and explicit health signal
When a tracked transfer exceeds
kvTransferTimeoutMsand the workerfuture is still not ready, the entry is flipped to quarantined: an
error is surfaced to the caller, but the future stays pinned so that
NIXL/UCX cannot write into freed memory. A counter increments. If
quarantined entries exceed
mQuarantineBudgetor no worker hasreached a final state for longer than
mGlobalProgressDeadlineMs,isHealthy()flips false.getHealth()returns aTransceiverHealthsnapshot for orchestration / metrics. Defaults: budget 16, deadline 60s.
Python — per-request policy, no fail-close
PyExecutoradds_can_terminate_request_nowand_inflight_cancel_requested_ids._do_terminate_requestdefers freeingresources for any request that is still in disagg transfer state or
whose cancel is in flight. Per-request timeouts no longer clear
active_requests, the waiting queue, or setis_shutdown— that wasthe #13706 restart-loop trigger and is explicitly forbidden by the
analysis doc.
The timeout/cancel path now uses
cancel_request_structuredand onlycalls
_handle_errors([request])when the structured result is one ofCancelledBeforeAdvertise/AlreadyComplete/NotFound. ForCancelRequestedInFlightandBackendUnhealthy, the request stays inactive_requests, the cancel is recorded, and the next iterationre-polls C++.
Python — ADP synchronized response flush (port from #13112)
The base commit predates #13112. Without it, calling
_enqueue_responsesinside
_end_transfer_and_maybe_terminatedeadlocks with ADP becauseonly the owning DP rank reaches that point. Transfer-completion responses
are now buffered in
_pending_transfer_responsesand flushed atsynchronised executor-loop points where every DP rank participates in
the
tp_gathercollective. Flush is symmetric: empty +enable_attention_dp=Truestill calls
_enqueue_responses([])so the collective completes.Commit 2 —
shared_ptr<LlmRequest>lifetime refactorEquivalent to the lifetime hardening that #13713 applied. Pin the
LlmRequestC++ object through the entire transceiver path so itoutlives every worker thread access regardless of when Python's
_terminate_requestdrops its pybind reference.respondAndSendAsync,requestAndReceiveSync,requestAndReceiveAsync,cancelRequest, andcancelRequestStructuredall takestd::shared_ptr<LlmRequest>.TrackedFuture::requestis nowshared_ptr<LlmRequest>; theshared_ptr in
mSenderFutures/mRequesterFutureskeeps theLlmRequestpinned for as long as the transceiver tracks the workerfuture.
trtGptModelInflightBatching.cpppasses the shared_ptr directlyinstead of
.get()-stripping it.PyCacheTransceiver) signatures match.<nanobind/stl/shared_ptr.h>(already included in the binding file)provides a Python-aware shared_ptr that pins the wrapper alive while
C++ holds it.
The Python
_can_terminate_request_nowguard stays in place — but itsjob is no longer object lifetime (the shared_ptr handles that). It is
now strictly about resource quiescence:
free_resources()releasesKV blocks back to the pool, and the transport may still be writing
into them after the
LlmRequestC++ object exists. Termination muststill wait for the structured cancel result to signal safety.
Testing
tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py— 14 unit tests covering the deferred-terminate guard, the structured-
cancel decision tree (including the
NotFound-after-final-statecase), the unhealthy-backend pin, and ADP flush symmetry. They run
without GPU or MPI so they fail fast in pre-merge CI:
Existing
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cppexercises the public API only and is not affected by the
mSenderFutures/mRequesterFuturesstorage refactor.Test plan
pytest tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.pybot run --extra-stage "<cache-transceiver multi-GPU stages>"(C++ multi-GPU tests)buffer reuse (the analysis-doc regression scenarios).
(
accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_gen_first).MALLOC_PERTURB_=85to confirmthe
LlmRequestUAF class is closed by commit 2.Documentation
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md— the analysis that anchors this fix direction.
docs/source/features/disagg-kv-transfer-session-lifecycle.md— the contract this PR establishes (status polling, cancel result,
quarantine budget, ADP flush, Python policy, object lifetime).
What this PR is not
BufferIndexHolder::poison()API (#13728'sapproach). Quarantine is tracked at the transceiver level via a
bounded counter + global progress deadline, which is sufficient to
flip
isHealthy()and let orchestration restart cleanly withoutper-buffer poison plumbing.