[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine by yifjiang · Pull Request #13796 · NVIDIA/TensorRT-LLM

yifjiang · 2026-05-06T03:57:44Z

Summary

Disaggregated KV transfer cancellation, timeout, and process-health
contract — a layered fix on top of the analysis in
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md.
Earlier attempts (#13706, #13713, #13728) each fixed one slice of
the problem but still hung in the field. Concretely:

PR [Draft] Fail closed on unquiesced disagg KV transfer buffers #13706 with Python changes restarted workers on every routine
per-request timeout — the Python fail-close policy was too aggressive.
PR [https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening #13713 / PR [codex] Fail closed on unquiesced disagg KV transfer #13728 still hung even with Python cleanup removed,
because the C++ status polling could call future.get() on an unready
worker future, freezing the executor event loop while NIXL/UCX was
wedged.

This PR separates the four concerns that the earlier PRs mixed:

Responsibility	Owner	Mechanism
Object lifetime	C++ transceiver	`shared_ptr<LlmRequest>` through the public API + tracked-future vectors.
Resource quiescence	Python executor +	Worker future stays pinned until ready / error / quarantine; Python's
	C++ transceiver	`_can_terminate_request_now` defers `free_resources()` until safe.
Request failure	Python executor	Surface a request-level error to the user only after C++ reports a final state.
Process health	C++ transceiver	Bounded quarantine + global progress deadline → `isHealthy() == false`.

Base: b9ce4b69d12fe5ba65d13893111b1a2ea29413ee. Two commits.

Commit 1 — non-blocking polling, structured cancel, bounded quarantine

C++ — status polling is non-blocking by default

checkContextTransferStatus and checkGenTransferStatus poll worker
futures with wait_for(0ms) and never call future.get() on an
unready entry. When atLeastRequestNum > 0 is requested, the polling
code admits additional ready futures but skips unready ones rather than
blocking the executor thread. New explicit blocking variants exist for
shutdown only:

drainContextTransferStatus(bool markComplete)
drainGenTransferStatus()

This is the actual hang fix — see the analysis doc's "Why the C++-only
fix is not sufficient" section.

C++ — structured `TransferCancelResult` enum

cancelRequestStructured(LlmRequest*) returns a structured outcome:

Value	Caller may release request resources?
`NotFound`	yes — C++ already reached a final state
`AlreadyComplete`	yes — worker future is ready
`CancelledBeforeAdvertise`	yes — no buffer was advertised to the peer
`CancelRequestedInFlight`	no — worker may still touch buffers
`BackendUnhealthy`	no — orchestration must restart
`NotCancellable`	no — retry later

The historical bool cancelRequest is preserved as a backward-compatible
wrapper that returns true only for the first three states.

C++ — bounded quarantine and explicit health signal

When a tracked transfer exceeds kvTransferTimeoutMs and the worker
future is still not ready, the entry is flipped to quarantined: an
error is surfaced to the caller, but the future stays pinned so that
NIXL/UCX cannot write into freed memory. A counter increments. If
quarantined entries exceed mQuarantineBudget or no worker has
reached a final state for longer than mGlobalProgressDeadlineMs,
isHealthy() flips false. getHealth() returns a TransceiverHealth
snapshot for orchestration / metrics. Defaults: budget 16, deadline 60s.

Python — per-request policy, no fail-close

PyExecutor adds _can_terminate_request_now and
_inflight_cancel_requested_ids. _do_terminate_request defers freeing
resources for any request that is still in disagg transfer state or
whose cancel is in flight. Per-request timeouts no longer clear
active_requests, the waiting queue, or set is_shutdown — that was
the #13706 restart-loop trigger and is explicitly forbidden by the
analysis doc.

The timeout/cancel path now uses cancel_request_structured and only
calls _handle_errors([request]) when the structured result is one of
CancelledBeforeAdvertise / AlreadyComplete / NotFound. For
CancelRequestedInFlight and BackendUnhealthy, the request stays in
active_requests, the cancel is recorded, and the next iteration
re-polls C++.

Python — ADP synchronized response flush (port from #13112)

The base commit predates #13112. Without it, calling _enqueue_responses
inside _end_transfer_and_maybe_terminate deadlocks with ADP because
only the owning DP rank reaches that point. Transfer-completion responses
are now buffered in _pending_transfer_responses and flushed at
synchronised executor-loop points where every DP rank participates in
the tp_gather collective. Flush is symmetric: empty + enable_attention_dp=True
still calls _enqueue_responses([]) so the collective completes.

Commit 2 — `shared_ptr<LlmRequest>` lifetime refactor

Equivalent to the lifetime hardening that #13713 applied. Pin the
LlmRequest C++ object through the entire transceiver path so it
outlives every worker thread access regardless of when Python's
_terminate_request drops its pybind reference.

respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync,
cancelRequest, and cancelRequestStructured all take
std::shared_ptr<LlmRequest>.
TrackedFuture::request is now shared_ptr<LlmRequest>; the
shared_ptr in mSenderFutures / mRequesterFutures keeps the
LlmRequest pinned for as long as the transceiver tracks the worker
future.
trtGptModelInflightBatching.cpp passes the shared_ptr directly
instead of .get()-stripping it.
The nanobind trampoline (PyCacheTransceiver) signatures match.
<nanobind/stl/shared_ptr.h> (already included in the binding file)
provides a Python-aware shared_ptr that pins the wrapper alive while
C++ holds it.

The Python _can_terminate_request_now guard stays in place — but its
job is no longer object lifetime (the shared_ptr handles that). It is
now strictly about resource quiescence: free_resources() releases
KV blocks back to the pool, and the transport may still be writing
into them after the LlmRequest C++ object exists. Termination must
still wait for the structured cancel result to signal safety.

Testing

tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
— 14 unit tests covering the deferred-terminate guard, the structured-
cancel decision tree (including the NotFound-after-final-state
case), the unhealthy-backend pin, and ADP flush symmetry. They run
without GPU or MPI so they fail fast in pre-merge CI:
```
$ python3 -m pytest tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py -v
============================== 14 passed in 0.07s ==============================
```
Existing cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
exercises the public API only and is not affected by the
mSenderFutures/mRequesterFutures storage refactor.

Test plan

pytest tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
bot run --extra-stage "<cache-transceiver multi-GPU stages>" (C++ multi-GPU tests)
1P1D cancel-burst harness — confirm no hang, no restart loop, no
buffer reuse (the analysis-doc regression scenarios).
ADP gen-first regression test
(accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_gen_first).
Field repro of NVBug 6104831 with MALLOC_PERTURB_=85 to confirm
the LlmRequest UAF class is closed by commit 2.

Documentation

docs/source/features/disagg-kv-transfer-hang-restart-analysis.md
— the analysis that anchors this fix direction.
docs/source/features/disagg-kv-transfer-session-lifecycle.md
— the contract this PR establishes (status polling, cancel result,
quarantine budget, ADP flush, Python policy, object lifetime).

What this PR is not

It does not add a BufferIndexHolder::poison() API (#13728's
approach). Quarantine is tracked at the transceiver level via a
bounded counter + global progress deadline, which is sufficient to
flip isHealthy() and let orchestration restart cleanly without
per-buffer poison plumbing.

…g, structured cancel, bounded quarantine The C++ cache transceiver and Python executor cooperate on disaggregated KV transfer cancellation, timeout, and process health. Earlier attempts (PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem but still hung in the field: * PR NVIDIA#13706 with Python changes restarted workers on every routine per-request timeout (Python fail-close was too aggressive). * PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because the C++ status polling could still call future.get() on an unready worker future, freezing the executor event loop while NIXL/UCX was wedged. This PR keeps each layer's responsibility separate, as described in docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and the new docs/source/features/disagg-kv-transfer-session-lifecycle.md: * checkContextTransferStatus / checkGenTransferStatus are now strictly non-blocking. They poll with wait_for(0ms) and only call future.get() when the future is already ready. atLeastRequestNum > 0 still admits additional ready entries but never selects an unready one to satisfy the count. drainContextTransferStatus / drainGenTransferStatus are the only blocking variants and are intended for shutdown drain. * cancelRequestStructured returns a TransferCancelResult enum with six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise, CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only the first three permit Python to free request resources; the others keep the request owned by C++ until the worker future reaches a final state. The historical bool cancelRequest is preserved as a backward-compatible wrapper. * The transceiver maintains a bounded quarantine counter and a global progress deadline. A per-request timeout marks the entry quarantined but keeps the future pinned so NIXL/UCX cannot write into freed memory. If quarantined entries exceed mQuarantineBudget or no worker has reached a final state for longer than mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth() surfaces the snapshot for orchestration. * PyExecutor adds _can_terminate_request_now / _inflight_cancel_- requested_ids so _do_terminate_request defers freeing resources for any request that is still in disagg transfer state or whose cancel is in flight. Per-request timeouts no longer clear active_requests, the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706 restart-loop trigger and is explicitly forbidden by the analysis doc. * The ADP synchronized pending-response flush from PR NVIDIA#13112 is ported here (the base commit precedes that merge). Transfer- completion responses created in _end_transfer_and_maybe_terminate are buffered and flushed at synchronised loop points so every DP rank participates in the tp_gather collective. Test coverage: * tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py — 14 unit tests for the deferred-terminate guard, the structured- cancel decision tree, and the ADP flush symmetry. They run without GPU or MPI so they fail fast in pre-merge CI. References: * analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md * contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…the cache transceiver Follow-up to the previous commit on this branch. Changes the BaseCacheTransceiver / CacheTransceiver public API and the tracked- future vectors from raw LlmRequest* to std::shared_ptr<LlmRequest> so that the C++ object outlives every worker thread access regardless of when Python's _terminate_request drops its pybind reference. This closes the historical use-after-free class on raw LlmRequest* that showed up in field traces as mRequestId == 0x5555555555555555. Specifically: * respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync, cancelRequest, and cancelRequestStructured all take std::shared_ptr<LlmRequest>. * TrackedFuture::request is now std::shared_ptr<LlmRequest>; the shared_ptr in mSenderFutures / mRequesterFutures keeps the LlmRequest pinned for as long as the transceiver tracks the worker future. * trtGptModelInflightBatching.cpp now passes the shared_ptr directly instead of stripping it via .get(). * The nanobind trampoline (PyCacheTransceiver) signatures match. The binding relies on <nanobind/stl/shared_ptr.h> (already included) to provide a Python-aware shared_ptr that pins the wrapper alive while C++ holds it. The Python-side _can_terminate_request_now guard stays in place. Its job is no longer about lifetime — the shared_ptr handles that — but about resource quiescence: free_resources() releases KV blocks back to the pool, and the transport may still be writing into them after the LlmRequest object exists. Termination must still wait for the structured cancel result to signal safety. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md is updated to describe the new "Object lifetime" section. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…e size, test callsites Three issues caught in self-review of the prior commits: 1. cancelRequestStructured returned BackendUnhealthy too eagerly — even for a request that was still queued pre-advertise. Pre-advertise release is always safe (no buffer was ever exposed to a peer), and freeing those resources during the unhealthy window REDUCES backend pressure rather than adds to it. Reorder: check sender / receiver queues first, return CancelledBeforeAdvertise unconditionally if found there. Only consult isHealthy() for the in-flight branch, where freeing would race with NIXL/UCX writes. 2. NB_TRAMPOLINE size was 6 but the trampoline has 7 NB_OVERRIDE_PURE entries (pre-existing under-allocation of nanobind's dispatch hash table). Bump to 7 since we touched the file. 3. cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp called mSender->sendAsync(*llmRequest) and mRequester->receiveAsync(*llmRequest) passing LlmRequest&. After the prior commit those signatures take std::shared_ptr<LlmRequest>. The test was already holding a shared_ptr (request->mLlmRequest), so just drop the dereference. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…V release + health-driven shutdown The earlier commits surfaced the user-visible "exceeded timeout" error and pinned the LlmRequest object via shared_ptr, but the recovery story still had two gaps that an internal review caught: 1. When the C++ deadline-quarantine path marked a request as DISAGG_TRANS_ERROR, _can_terminate_request_now no longer recognised the request as "still in flight" (its state was no longer DISAGG_*_TRANS_IN_PROGRESS), so _do_terminate_request happily called free_resources() — releasing KV blocks back to the cache pool while the C++ worker thread might still be writing into them via NIXL/UCX. The shared_ptr lifetime fix protected the LlmRequest object but not its KV memory. 2. C++ exposed isHealthy() / getHealth() but nothing on the Python side consumed those signals. A worker whose backend was genuinely wedged would silently accumulate quarantined transfers forever, never triggering an orchestration restart. This commit closes both: * PyExecutor adds _pending_resource_release (a list of LlmRequest whose KV cleanup is deferred) and _maybe_release_pending_resources which polls cancel_request_structured each iteration. free_resources runs only when the structured cancel result transitions to AlreadyComplete / NotFound / CancelledBeforeAdvertise — the three outcomes that prove the worker has reached a final state. * The timeout error paths in _check_disagg_ctx_cache_transfer_status and _handle_responses (gen side) call _defer_resource_release_for_inflight_transfer() before _handle_errors / _end_transfer_and_maybe_terminate, so the user-visible error response is still surfaced immediately while the actual free_resources() call defers until C++ quiesces. * PyExecutor adds _check_transceiver_health which monitors transceiver.is_healthy(). It records the first-unhealthy timestamp and, after a grace period (default 2 * global_progress_deadline_- seconds, falling back to 120s), sets self.is_shutdown so orchestration's existing restart path takes the worker out of service. Recovery from a transient unhealthy window does not trigger restart. * Both new methods plus the existing _check_kv_transfer_timeout are called at every executor-loop tick that already checks for KV transfer timeouts — five sites across PP / non-overlap / overlap loops. The doc gains a "Recovery model" section that spells out the three timescales: user-visible error (immediate), KV resource release (bounded by worker quiescence), worker restart on persistent wedge (global deadline + grace). Tests: * Five new unit tests in test_kv_transfer_session_lifecycle.py: - test_defer_resource_release_holds_until_quiesced (full sequence timeout → in-flight → still in-flight → quiesced → freed). - test_defer_resource_release_handles_not_found_as_safe. - test_defer_resource_release_no_op_when_empty. - test_check_transceiver_health_resets_on_recovery (transient unhealthy must not shut down). - test_check_transceiver_health_triggers_shutdown_after_grace (sustained unhealthy DOES shut down). * All 19 tests pass without GPU/MPI. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

github-actions Bot assigned yifjiang May 6, 2026

yifjiang added 3 commits May 5, 2026 21:10

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine#13796

[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine#13796
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/disagg-kv-transfer-session

yifjiang commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yifjiang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — non-blocking polling, structured cancel, bounded quarantine

C++ — status polling is non-blocking by default

C++ — structured TransferCancelResult enum

C++ — bounded quarantine and explicit health signal

Python — per-request policy, no fail-close

Python — ADP synchronized response flush (port from #13112)

Commit 2 — shared_ptr<LlmRequest> lifetime refactor

Testing

Test plan

Documentation

What this PR is not

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yifjiang commented May 6, 2026 •

edited

Loading

C++ — structured `TransferCancelResult` enum

Commit 2 — `shared_ptr<LlmRequest>` lifetime refactor