Skip to content

[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine#13796

Draft
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/disagg-kv-transfer-session
Draft

[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine#13796
yifjiang wants to merge 5 commits intoNVIDIA:mainfrom
yifjiang:fix/disagg-kv-transfer-session

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented May 6, 2026

Summary

Disaggregated KV transfer cancellation, timeout, and process-health
contract — a layered fix on top of the analysis in
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md.
Earlier attempts (#13706, #13713, #13728) each fixed one slice of
the problem but still hung in the field. Concretely:

This PR separates the four concerns that the earlier PRs mixed:

Responsibility Owner Mechanism
Object lifetime C++ transceiver shared_ptr<LlmRequest> through the public API + tracked-future vectors.
Resource quiescence Python executor + Worker future stays pinned until ready / error / quarantine; Python's
C++ transceiver _can_terminate_request_now defers free_resources() until safe.
Request failure Python executor Surface a request-level error to the user only after C++ reports a final state.
Process health C++ transceiver Bounded quarantine + global progress deadline → isHealthy() == false.

Base: b9ce4b69d12fe5ba65d13893111b1a2ea29413ee. Two commits.

Commit 1 — non-blocking polling, structured cancel, bounded quarantine

C++ — status polling is non-blocking by default

checkContextTransferStatus and checkGenTransferStatus poll worker
futures with wait_for(0ms) and never call future.get() on an
unready entry. When atLeastRequestNum > 0 is requested, the polling
code admits additional ready futures but skips unready ones rather than
blocking the executor thread. New explicit blocking variants exist for
shutdown only:

  • drainContextTransferStatus(bool markComplete)
  • drainGenTransferStatus()

This is the actual hang fix — see the analysis doc's "Why the C++-only
fix is not sufficient" section.

C++ — structured TransferCancelResult enum

cancelRequestStructured(LlmRequest*) returns a structured outcome:

Value Caller may release request resources?
NotFound yes — C++ already reached a final state
AlreadyComplete yes — worker future is ready
CancelledBeforeAdvertise yes — no buffer was advertised to the peer
CancelRequestedInFlight no — worker may still touch buffers
BackendUnhealthy no — orchestration must restart
NotCancellable no — retry later

The historical bool cancelRequest is preserved as a backward-compatible
wrapper that returns true only for the first three states.

C++ — bounded quarantine and explicit health signal

When a tracked transfer exceeds kvTransferTimeoutMs and the worker
future is still not ready, the entry is flipped to quarantined: an
error is surfaced to the caller, but the future stays pinned so that
NIXL/UCX cannot write into freed memory. A counter increments. If
quarantined entries exceed mQuarantineBudget or no worker has
reached a final state for longer than mGlobalProgressDeadlineMs,
isHealthy() flips false. getHealth() returns a TransceiverHealth
snapshot for orchestration / metrics. Defaults: budget 16, deadline 60s.

Python — per-request policy, no fail-close

PyExecutor adds _can_terminate_request_now and
_inflight_cancel_requested_ids. _do_terminate_request defers freeing
resources for any request that is still in disagg transfer state or
whose cancel is in flight. Per-request timeouts no longer clear
active_requests, the waiting queue, or set is_shutdown — that was
the #13706 restart-loop trigger and is explicitly forbidden by the
analysis doc.

The timeout/cancel path now uses cancel_request_structured and only
calls _handle_errors([request]) when the structured result is one of
CancelledBeforeAdvertise / AlreadyComplete / NotFound. For
CancelRequestedInFlight and BackendUnhealthy, the request stays in
active_requests, the cancel is recorded, and the next iteration
re-polls C++.

Python — ADP synchronized response flush (port from #13112)

The base commit predates #13112. Without it, calling _enqueue_responses
inside _end_transfer_and_maybe_terminate deadlocks with ADP because
only the owning DP rank reaches that point. Transfer-completion responses
are now buffered in _pending_transfer_responses and flushed at
synchronised executor-loop points where every DP rank participates in
the tp_gather collective. Flush is symmetric: empty + enable_attention_dp=True
still calls _enqueue_responses([]) so the collective completes.

Commit 2 — shared_ptr<LlmRequest> lifetime refactor

Equivalent to the lifetime hardening that #13713 applied. Pin the
LlmRequest C++ object through the entire transceiver path so it
outlives every worker thread access regardless of when Python's
_terminate_request drops its pybind reference.

  • respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync,
    cancelRequest, and cancelRequestStructured all take
    std::shared_ptr<LlmRequest>.
  • TrackedFuture::request is now shared_ptr<LlmRequest>; the
    shared_ptr in mSenderFutures / mRequesterFutures keeps the
    LlmRequest pinned for as long as the transceiver tracks the worker
    future.
  • trtGptModelInflightBatching.cpp passes the shared_ptr directly
    instead of .get()-stripping it.
  • The nanobind trampoline (PyCacheTransceiver) signatures match.
    <nanobind/stl/shared_ptr.h> (already included in the binding file)
    provides a Python-aware shared_ptr that pins the wrapper alive while
    C++ holds it.

The Python _can_terminate_request_now guard stays in place — but its
job is no longer object lifetime (the shared_ptr handles that). It is
now strictly about resource quiescence: free_resources() releases
KV blocks back to the pool, and the transport may still be writing
into them after the LlmRequest C++ object exists. Termination must
still wait for the structured cancel result to signal safety.

Testing

  • tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
    — 14 unit tests covering the deferred-terminate guard, the structured-
    cancel decision tree (including the NotFound-after-final-state
    case), the unhealthy-backend pin, and ADP flush symmetry. They run
    without GPU or MPI so they fail fast in pre-merge CI:

    $ python3 -m pytest tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py -v
    ============================== 14 passed in 0.07s ==============================
    
  • Existing cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
    exercises the public API only and is not affected by the
    mSenderFutures/mRequesterFutures storage refactor.

Test plan

  • pytest tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
  • bot run --extra-stage "<cache-transceiver multi-GPU stages>" (C++ multi-GPU tests)
  • 1P1D cancel-burst harness — confirm no hang, no restart loop, no
    buffer reuse (the analysis-doc regression scenarios).
  • ADP gen-first regression test
    (accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_gen_first).
  • Field repro of NVBug 6104831 with MALLOC_PERTURB_=85 to confirm
    the LlmRequest UAF class is closed by commit 2.

Documentation

  • docs/source/features/disagg-kv-transfer-hang-restart-analysis.md
    — the analysis that anchors this fix direction.
  • docs/source/features/disagg-kv-transfer-session-lifecycle.md
    — the contract this PR establishes (status polling, cancel result,
    quarantine budget, ADP flush, Python policy, object lifetime).

What this PR is not

  • It does not add a BufferIndexHolder::poison() API (#13728's
    approach). Quarantine is tracked at the transceiver level via a
    bounded counter + global progress deadline, which is sufficient to
    flip isHealthy() and let orchestration restart cleanly without
    per-buffer poison plumbing.

…g, structured cancel, bounded quarantine

The C++ cache transceiver and Python executor cooperate on disaggregated
KV transfer cancellation, timeout, and process health. Earlier attempts
(PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem
but still hung in the field:

* PR NVIDIA#13706 with Python changes restarted workers on every routine
  per-request timeout (Python fail-close was too aggressive).
* PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because
  the C++ status polling could still call future.get() on an unready
  worker future, freezing the executor event loop while NIXL/UCX was
  wedged.

This PR keeps each layer's responsibility separate, as described in
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and
the new docs/source/features/disagg-kv-transfer-session-lifecycle.md:

* checkContextTransferStatus / checkGenTransferStatus are now strictly
  non-blocking. They poll with wait_for(0ms) and only call future.get()
  when the future is already ready. atLeastRequestNum > 0 still admits
  additional ready entries but never selects an unready one to satisfy
  the count. drainContextTransferStatus / drainGenTransferStatus are
  the only blocking variants and are intended for shutdown drain.

* cancelRequestStructured returns a TransferCancelResult enum with
  six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise,
  CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only
  the first three permit Python to free request resources; the others
  keep the request owned by C++ until the worker future reaches a
  final state. The historical bool cancelRequest is preserved as a
  backward-compatible wrapper.

* The transceiver maintains a bounded quarantine counter and a global
  progress deadline. A per-request timeout marks the entry quarantined
  but keeps the future pinned so NIXL/UCX cannot write into freed
  memory. If quarantined entries exceed mQuarantineBudget or no
  worker has reached a final state for longer than
  mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth()
  surfaces the snapshot for orchestration.

* PyExecutor adds _can_terminate_request_now / _inflight_cancel_-
  requested_ids so _do_terminate_request defers freeing resources for
  any request that is still in disagg transfer state or whose cancel
  is in flight. Per-request timeouts no longer clear active_requests,
  the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706
  restart-loop trigger and is explicitly forbidden by the analysis
  doc.

* The ADP synchronized pending-response flush from PR NVIDIA#13112 is
  ported here (the base commit precedes that merge). Transfer-
  completion responses created in _end_transfer_and_maybe_terminate
  are buffered and flushed at synchronised loop points so every DP
  rank participates in the tp_gather collective.

Test coverage:

* tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
  — 14 unit tests for the deferred-terminate guard, the structured-
  cancel decision tree, and the ADP flush symmetry. They run without
  GPU or MPI so they fail fast in pre-merge CI.

References:
* analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md
* contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added 3 commits May 5, 2026 21:10
…the cache transceiver

Follow-up to the previous commit on this branch. Changes the
BaseCacheTransceiver / CacheTransceiver public API and the tracked-
future vectors from raw LlmRequest* to std::shared_ptr<LlmRequest> so
that the C++ object outlives every worker thread access regardless of
when Python's _terminate_request drops its pybind reference. This
closes the historical use-after-free class on raw LlmRequest* that
showed up in field traces as mRequestId == 0x5555555555555555.

Specifically:

* respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync,
  cancelRequest, and cancelRequestStructured all take
  std::shared_ptr<LlmRequest>.
* TrackedFuture::request is now std::shared_ptr<LlmRequest>; the
  shared_ptr in mSenderFutures / mRequesterFutures keeps the LlmRequest
  pinned for as long as the transceiver tracks the worker future.
* trtGptModelInflightBatching.cpp now passes the shared_ptr directly
  instead of stripping it via .get().
* The nanobind trampoline (PyCacheTransceiver) signatures match. The
  binding relies on <nanobind/stl/shared_ptr.h> (already included) to
  provide a Python-aware shared_ptr that pins the wrapper alive while
  C++ holds it.

The Python-side _can_terminate_request_now guard stays in place. Its
job is no longer about lifetime — the shared_ptr handles that — but
about resource quiescence: free_resources() releases KV blocks back to
the pool, and the transport may still be writing into them after the
LlmRequest object exists. Termination must still wait for the
structured cancel result to signal safety.

The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md
is updated to describe the new "Object lifetime" section.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…r queues

Follow-up to the prior commit. The worker queues inside CacheSender::Impl
and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning
the executor's mSenderFutures / mRequesterFutures pinned the lifetime
but the worker thread that actually dereferences the request had only
a raw observer. If the executor erased its tracking entry while the
worker was mid-flight, the LlmRequest could be freed under the worker.

This commit closes that surface:

* CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>.
* CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>.
  Move/copy semantics simplified now that the field is a smart pointer.
* CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly.
* CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value
  in the std::async lambda so the worker thread pins the LlmRequest
  independently of the caller.
* CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive*
  pass the shared_ptr (no longer .get()-strip it) and move into the
  TrackedFuture entry where appropriate.

Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that
PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a
shared_ptr, the one-liner

    sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp));

becomes undefined behaviour because C++ argument evaluation order is
unspecified — the compiler may evaluate std::move(resp) first, leaving
resp.mRequest empty when reading mRequestId. Materialise the id into
a local before the move.

The PyCacheTransceiver trampoline's bool cancelRequest override is
updated to match the new shared_ptr signature.

The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md
spells out the full ownership chain: executor tracking + worker queue
both hold shared_ptr, and TransferSession::mRequest is an ephemeral
observer used only inside the worker frame where the shared_ptr is
already held.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…e size, test callsites

Three issues caught in self-review of the prior commits:

1. cancelRequestStructured returned BackendUnhealthy too eagerly — even
   for a request that was still queued pre-advertise. Pre-advertise
   release is always safe (no buffer was ever exposed to a peer), and
   freeing those resources during the unhealthy window REDUCES backend
   pressure rather than adds to it. Reorder: check sender / receiver
   queues first, return CancelledBeforeAdvertise unconditionally if
   found there. Only consult isHealthy() for the in-flight branch,
   where freeing would race with NIXL/UCX writes.

2. NB_TRAMPOLINE size was 6 but the trampoline has 7 NB_OVERRIDE_PURE
   entries (pre-existing under-allocation of nanobind's dispatch hash
   table). Bump to 7 since we touched the file.

3. cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp called
   mSender->sendAsync(*llmRequest) and mRequester->receiveAsync(*llmRequest)
   passing LlmRequest&. After the prior commit those signatures take
   std::shared_ptr<LlmRequest>. The test was already holding a
   shared_ptr (request->mLlmRequest), so just drop the dereference.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 6, 2026
…V release + health-driven shutdown

The earlier commits surfaced the user-visible "exceeded timeout"
error and pinned the LlmRequest object via shared_ptr, but the
recovery story still had two gaps that an internal review caught:

1. When the C++ deadline-quarantine path marked a request as
   DISAGG_TRANS_ERROR, _can_terminate_request_now no longer recognised
   the request as "still in flight" (its state was no longer
   DISAGG_*_TRANS_IN_PROGRESS), so _do_terminate_request happily
   called free_resources() — releasing KV blocks back to the cache
   pool while the C++ worker thread might still be writing into them
   via NIXL/UCX. The shared_ptr lifetime fix protected the LlmRequest
   object but not its KV memory.

2. C++ exposed isHealthy() / getHealth() but nothing on the Python
   side consumed those signals. A worker whose backend was genuinely
   wedged would silently accumulate quarantined transfers forever,
   never triggering an orchestration restart.

This commit closes both:

* PyExecutor adds _pending_resource_release (a list of LlmRequest
  whose KV cleanup is deferred) and _maybe_release_pending_resources
  which polls cancel_request_structured each iteration. free_resources
  runs only when the structured cancel result transitions to
  AlreadyComplete / NotFound / CancelledBeforeAdvertise — the three
  outcomes that prove the worker has reached a final state.

* The timeout error paths in _check_disagg_ctx_cache_transfer_status
  and _handle_responses (gen side) call
  _defer_resource_release_for_inflight_transfer() before _handle_errors
  / _end_transfer_and_maybe_terminate, so the user-visible error
  response is still surfaced immediately while the actual
  free_resources() call defers until C++ quiesces.

* PyExecutor adds _check_transceiver_health which monitors
  transceiver.is_healthy(). It records the first-unhealthy timestamp
  and, after a grace period (default 2 * global_progress_deadline_-
  seconds, falling back to 120s), sets self.is_shutdown so
  orchestration's existing restart path takes the worker out of
  service. Recovery from a transient unhealthy window does not
  trigger restart.

* Both new methods plus the existing _check_kv_transfer_timeout are
  called at every executor-loop tick that already checks for KV
  transfer timeouts — five sites across PP / non-overlap / overlap
  loops.

The doc gains a "Recovery model" section that spells out the three
timescales: user-visible error (immediate), KV resource release
(bounded by worker quiescence), worker restart on persistent wedge
(global deadline + grace).

Tests:

* Five new unit tests in test_kv_transfer_session_lifecycle.py:
  - test_defer_resource_release_holds_until_quiesced (full sequence
    timeout → in-flight → still in-flight → quiesced → freed).
  - test_defer_resource_release_handles_not_found_as_safe.
  - test_defer_resource_release_no_op_when_empty.
  - test_check_transceiver_health_resets_on_recovery (transient
    unhealthy must not shut down).
  - test_check_transceiver_health_triggers_shutdown_after_grace
    (sustained unhealthy DOES shut down).
* All 19 tests pass without GPU/MPI.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants