[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening by chienchunhung · Pull Request #13713 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-05-03T06:20:23Z

tl;dr

This PR fixes a permanent disaggregated-serving wedge triggered by long-prompt, cancellation-heavy traffic in the PyTorch backend. The root cause is a stack of cleanup/lifetime invariant gaps across C++ KV transfer, Python request termination, NIXL cancellation, and block reuse. The final solution makes disaggregated KV cleanup transport-aware and block-reuse-safe: cancelled requests stop scheduling immediately, but KV resources are only released after C++ transfer status becomes terminal, and deferred termination is retried exactly once when it becomes safe.

Background

Disaggregated serving splits a request across:

a context worker that runs prefill and sends KV cache;
a generation worker that receives KV cache and decodes tokens;
a disaggregated frontend that routes OpenAI-compatible requests.

Under long-prompt bursts with client-side cancellations, the old cleanup path could leave the deployment alive but permanently non-responsive: workers stayed up, health endpoints could still return 200, but every post-burst request timed out.

The investigation found that this was not one bug. It was a stack of independent lifecycle gaps. Fixing one exposed the next.

Failure signatures observed

Signature	Symptom	Root cause
#1	Sender-side `Broken promise`	cancel-after-ready path erased a promise without fulfilling it
#4	Generation event loop blocked forever	`checkGenTransferStatus(atLeastNum=1)` could call `future.get()` on an unready future
#5	Receiver-side `Broken promise`	queued cancel erased receiver promise without fulfilling it
#6	Receiver buffer pool wedge	`requestSync()` early-return path leaked a recv-buffer slot
#7	First-request SIGSEGV / mutex wedge variants	request lifetime and eval-order bugs after moving from raw pointer to `shared_ptr<LlmRequest>`
rc13 regression	Block reuse enabled caused no recovery even after the above fixes	Python deferred termination while context transfer was in progress, but later skipped cleanup because it assumed early termination had already succeeded

Correctness model

The central invariant is:

A request can be cancelled at the API level immediately,
but its KV resources cannot be released/reused until:
  1. C++ KV transfer is terminal, or the transfer/buffer state is fail-closed;
  2. transfer buffer ownership has been released or poisoned;
  3. block-reuse/prefix-cache references are accounted for;
  4. Python resource cleanup runs exactly once.

Python request state alone is not sufficient. With block reuse enabled, a cancelled request may still have KV blocks referenced by the reuse tree or pinned by an in-flight context transfer. Freeing those blocks too early risks memory corruption; forgetting to free them later causes a permanent wedge.

What this PR changes

1. C++ transfer request lifetime

The C++ transceiver now uses std::shared_ptr<LlmRequest> across async send/receive state so Python-side termination cannot destroy a request object while C++ futures/workers still dereference it.

This closes the UAF class that previously manifested as corrupted request IDs and first-request crashes.

2. Sender/receiver cancellation bookkeeping

Cancellation paths now fulfill promises before dropping queued/ready work. This prevents std::future_error: Broken promise from escaping on both sender and receiver sides.

3. Non-blocking generation transfer polling

checkGenTransferStatus(atLeastNum=1) no longer blocks indefinitely on an unready future. It skips unready entries during non-blocking polling and lets timeout handling drive cancellation/error state.

4. NIXL transfer cancellation and release

The NIXL path now releases transfer handles on cancellation (TransferStatus::release() → nixlAgent::releaseXferReq()), so cancelled requests do not leave backend transfer state stranded.

5. Eval-order fix after `shared_ptr`

After moving Response::mRequest to shared_ptr, this pattern became unsafe:

sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp));

C++ does not specify function-argument evaluation order, so std::move(resp) could run before resp.mRequest->mRequestId. This PR materializes reqId before the move.

6. Python idempotency guards

Generation-side disagg initialization and KV receive setup are now guarded by py_request_id, preventing repeated side effects such as duplicate KVCacheManager::addSequence() or duplicate request_and_receive_async() for the same request.

7. Fail-closed behavior for unknown transfer quiescence

If TRT-LLM cannot prove transport quiescence, it fails closed instead of returning buffers to the pool. This includes:

BufferIndexHolder::poison()
tri-state ready-signal handling (ready, not-ready, cancelled/unknown)
poison-on-NIXL-throw in cacheFormatter.cpp
equivalent MLA formatter hardening in mlaCacheFormatter.cpp
Python-side deferral/fail-safe handling for unquiesced transfer states

This protects against silent buffer-pool corruption under cancellation races.

8. Deferred cleanup for timed-out context/generation transfer

A successful cancel_request() is not a quiescence proof. The PR therefore defers Python-side resource cleanup after timeout cancellation until C++ transfer status reports the request terminal.

This preserves memory safety: resources are not freed while C++/NIXL may still reference transfer buffers or KV blocks.

9. Block-reuse-safe deferred termination

rc13 enabled block reuse with the overlap scheduler, which exposed a new liveness hole.

Before the final fix:

_handle_responses() decides request is done
→ attempts _terminate_request()
→ _terminate_request() refuses because context KV transfer is still in progress
→ later transfer completes
→ AsyncTransferManager.end_transfer() assumes early termination already happened
→ cleanup obligation is lost
→ request/resources remain pinned
→ frontend times out forever

This PR records whether termination was requested and whether resources were actually freed:

RequestTransferMetadata:
  termination_requested
  resources_freed

AsyncTransferManager.end_transfer() now returns:

EndTransferResult(
    completed: bool,
    needs_termination: bool,
)

When transfer becomes terminal, _end_transfer_and_maybe_terminate() retries termination if an earlier attempt was deferred. This preserves both invariants:

do not free/reuse KV resources while context transfer is still in progress;
do not drop the cleanup obligation after deferring it.

Why deferring cleanup is intentional

This PR intentionally defers cleanup in some cancellation paths.

That is the correct behavior because cancel_request() only requests cancellation. It does not prove that C++/NIXL has stopped touching transfer buffers or KV blocks.

The final behavior is:

request cancelled
→ stop scheduling it
→ request C++ cancellation once
→ keep transfer/KV resources alive
→ wait for terminal C++ transfer status
→ release/poison transfer buffers
→ unpin block-reuse refs
→ free Python resources exactly once

So the PR does not mean "never clean up." It means "do not clean up until it is safe, and then guarantee cleanup happens."

Tests

New/updated unit coverage includes:

sender cancel-after-ready promise fulfillment;
receiver queued-cancel promise fulfillment;
non-blocking checkGenTransferStatus(atLeastNum=1);
deferred termination surviving until async transfer completion;
partial-reuse path still skipping termination when no termination was requested;
successful early free preventing duplicate termination;
direct EndTransferResult.needs_termination coverage.

Local test runs:

python -m py_compile tensorrt_llm/_torch/pyexecutor/py_executor.py
pytest -v tests/unittest/_torch/executor/test_async_transfer_manager.py
  9 passed
pytest -v tests/unittest/others/test_kv_cache_transceiver.py -k \
  "test_cancel_request_in_transmission_does_not_break_sender_future or \
   test_check_gen_transfer_status_at_least_one_does_not_block_on_unready_future or \
   test_cancel_queued_gen_request_fulfills_receiver_future"
  6 passed

Notes / limitations

The direct UCX path has a separate throughput/backpressure boundary under high concurrency. This PR primarily fixes the NIXL customer path and the shared disaggregated cleanup invariants.
MLA send-path hardening is included for parity with the non-MLA formatter path, but the stress validation above uses Qwen3-0.6B and therefore primarily exercises the non-MLA formatter.
The local CuTe DSL import guard used during development is not part of this PR.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Adds 00-tldr.md as the front door to the investigation. Targets a 10-minute read with two inline Mermaid figures: - The disagg architecture + cancellation flow (where the bugs live). - The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which layer). Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen final solution. Doesn't compare approaches A/B/C — that comparison lives in 06-fix-approaches/README.md for readers who want depth on the landing decision. Section structure: 1. What is broken (the wedge symptom). 2. Where the bugs live (architecture figure). 3. Root cause: eight invariant gaps (L1-L8 table). 4. The fix (combo figure + per-piece breakdown). 5. Does it work (empirical recovery table). 6. Why this took 8 days to find (cascade pattern). 7. What is left to do. 8. Where to go from here (reader paths to deeper files). README.md updated to put 00-tldr.md at the top of the navigation table and add a 10-minute reading path entry. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…rt, and L9 invariant Update the NVBug 6104831 investigation report to reflect the current state of the combo fix path (PR NVIDIA#13713): - Fold PR NVIDIA#13728 (fail-closed on unquiesced disagg KV transfer) into the combo as a fifth piece, plus the MLA-formatter port that closes the same hazard for DeepSeek-style models. - Add L9 (transport quiescence on unsafe exit) to the defect-class stack as a defense-in-depth memory-safety invariant. L1-L8 are wedge-prevention; L9 is the rip-cord that prevents silent buffer-pool corruption on cancel/exception when transport quiescence is unknown. - Update the empirical results tables across README.md, 00-tldr.md, 06-fix-approaches/D-combo.md, and 06-fix-approaches/README.md to show CONC=128 (3-pair) review-fix-v3 5/5 PASS post-fold-in plus the prior pre-fold-in stress runs through CONC=256. - Add MLA-stress validation as a follow-up item and motivate the RAII-audit task with the MLA-formatter finding. - Update the PR map to include NVIDIA#13728 and the chained sig regression PRs that the combo retains as test scaffolding. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…g, structured cancel, bounded quarantine The C++ cache transceiver and Python executor cooperate on disaggregated KV transfer cancellation, timeout, and process health. Earlier attempts (PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem but still hung in the field: * PR NVIDIA#13706 with Python changes restarted workers on every routine per-request timeout (Python fail-close was too aggressive). * PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because the C++ status polling could still call future.get() on an unready worker future, freezing the executor event loop while NIXL/UCX was wedged. This PR keeps each layer's responsibility separate, as described in docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and the new docs/source/features/disagg-kv-transfer-session-lifecycle.md: * checkContextTransferStatus / checkGenTransferStatus are now strictly non-blocking. They poll with wait_for(0ms) and only call future.get() when the future is already ready. atLeastRequestNum > 0 still admits additional ready entries but never selects an unready one to satisfy the count. drainContextTransferStatus / drainGenTransferStatus are the only blocking variants and are intended for shutdown drain. * cancelRequestStructured returns a TransferCancelResult enum with six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise, CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only the first three permit Python to free request resources; the others keep the request owned by C++ until the worker future reaches a final state. The historical bool cancelRequest is preserved as a backward-compatible wrapper. * The transceiver maintains a bounded quarantine counter and a global progress deadline. A per-request timeout marks the entry quarantined but keeps the future pinned so NIXL/UCX cannot write into freed memory. If quarantined entries exceed mQuarantineBudget or no worker has reached a final state for longer than mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth() surfaces the snapshot for orchestration. * PyExecutor adds _can_terminate_request_now / _inflight_cancel_- requested_ids so _do_terminate_request defers freeing resources for any request that is still in disagg transfer state or whose cancel is in flight. Per-request timeouts no longer clear active_requests, the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706 restart-loop trigger and is explicitly forbidden by the analysis doc. * The ADP synchronized pending-response flush from PR NVIDIA#13112 is ported here (the base commit precedes that merge). Transfer- completion responses created in _end_transfer_and_maybe_terminate are buffered and flushed at synchronised loop points so every DP rank participates in the tp_gather collective. Test coverage: * tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py — 14 unit tests for the deferred-terminate guard, the structured- cancel decision tree, and the ADP flush symmetry. They run without GPU or MPI so they fail fast in pre-merge CI. References: * analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md * contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…r queues Follow-up to the prior commit. The worker queues inside CacheSender::Impl and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning the executor's mSenderFutures / mRequesterFutures pinned the lifetime but the worker thread that actually dereferences the request had only a raw observer. If the executor erased its tracking entry while the worker was mid-flight, the LlmRequest could be freed under the worker. This commit closes that surface: * CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>. * CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>. Move/copy semantics simplified now that the field is a smart pointer. * CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>). * CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly. * CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value in the std::async lambda so the worker thread pins the LlmRequest independently of the caller. * CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive* pass the shared_ptr (no longer .get()-strip it) and move into the TrackedFuture entry where appropriate. Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a shared_ptr, the one-liner sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp)); becomes undefined behaviour because C++ argument evaluation order is unspecified — the compiler may evaluate std::move(resp) first, leaving resp.mRequest empty when reading mRequestId. Materialise the id into a local before the move. The PyCacheTransceiver trampoline's bool cancelRequest override is updated to match the new shared_ptr signature. The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md spells out the full ownership chain: executor tracking + worker queue both hold shared_ptr, and TransferSession::mRequest is an ephemeral observer used only inside the worker frame where the shared_ptr is already held. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…sig NVIDIA#8, Phase 15 The combo (PR NVIDIA#13713 + NVIDIA#13728 fold + MLA port) regressed when applied to rc13: with rc13's default-on block reuse, `_handle_responses`'s early-termination branch and `_end_transfer_and_maybe_terminate` can each refuse termination under the right timing, leaving the request with no cleanup owner. Server hangs on scenarios that succeeded on rc11. Captures this in the investigation report: - 03-defect-class-stack.md: add L10 (redundant block-reuse cleanup mechanism on the disagg path) with code sites, customer-visible symptom, and the latent symptoms the Phase 1 stop-gap leaves open (pin leak on cancel/timeout, PP > 1 disagg without block reuse, eviction race in the unpin → release window, regression risk on adjacent code). - 03-defect-class-stack.md: extend the layer-to-signature mermaid with L10 → sig NVIDIA#8 plus a dotted edge to the latent-symptom set. - 02-failure-signatures.md: add sig NVIDIA#8 (rc13 server hang under disagg + block reuse + in-flight cancel) with full root-cause, short-term stop-gap, and medium-term Phase 2 plan. - 05-investigation-timeline.md: add Phase 15 documenting the rc13 regression discovery, the two competing fix proposals, and the recommended staged plan. - 08-next-steps-and-pr-map.md: split item 2 into "land combo with rc13 stop-gap" and a new item 2a "land Phase 2 of the block-reuse-overlap-scheduler design". - README.md: add the rc13 caveat callout in the status section; update navigation language from L1-L8/seven-sig to L1-L10/eight-sig. Adds incentive cross-reference in the existing design doc: - docs/design/block-reuse-overlap-scheduler/README.md: promote Phase 2 status from "deprioritised" to "load-bearing for stable disagg block reuse" with cross-link to the rc13 evidence. - docs/design/block-reuse-overlap-scheduler/phase2-unify-reuse-mechanisms.md: add an "Empirical confirmation: the rc13 regression" section at the top, framing Phase 2 as the architectural answer the rc13 bug predicted. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Adds 09-executive-summary-rc11-to-rc13.md as the 15-minute read for someone briefing the full journey: original rc11 wedge, how PR NVIDIA#13713 solved it, why it regressed on rc13 (block reuse → L10 dual-path), how the short-term stop-gap unblocks rc13, and why the design doc's Phase 2 is the right long-term answer. Five Mermaid figures: - L1-L9 layer stack as the rc11 root cause framework - PR NVIDIA#13713's four-piece composition with L1-L9 mapping - The L10 dual-path: both cleanup paths refusing termination on rc13 - Stop-gap coverage vs latent symptoms it leaves open - Add-coordination vs delete-redundancy comparison + staged plan Updates README.md navigation table and reading paths to point at the new file with a "15 minutes / brief someone on the full journey" entry. The file extends 00-tldr.md (10-minute read on rc11 only). Where 00-tldr stops at "the rc11 wedge is fixed", 09 picks up at the rc13 regression discovery and walks through to the architectural follow-up. Designed for an exec briefing or a teammate joining mid-investigation. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…cancellation/cleanup invariants Two changes to 09-executive-summary-rc11-to-rc13.md per review feedback: 1. Treat the file as an independent report. Removed the "extends 00-tldr.md with the rc13 chapter" framing; removed the inline "> Detail: <other-file>.md" pointers from each section; restated the ten invariants in-place rather than referring readers to 03-defect-class-stack.md to follow along; moved the cross-file pointers to a clearly-marked "Optional deep-dive pointers" appendix at the end. A reader can now walk away with a complete understanding of the bug class, the fix, the regression, and the plans without needing to consult any other file. 2. Add a new section 2 "The invariants for correct request cancellation and cleanup" that names the ten invariants the bug class violates, organised into four architectural categories: - Lifetime invariants (L2/L7/L9): what stays alive while transfers are in flight. - Resource invariants (L5/L6): every acquired resource must release on every exit. - Synchronization invariants (L1/L3/L4): no thread waits forever. - Coordination invariants (L8/L10): exactly one owner, no implicit handoffs. Each invariant is named, defined, and explained in terms of why it matters (with the rc11/rc13 evidence that violated it). Section 3 onwards then describes PR NVIDIA#13713, the rc13 regression, the stop-gap, and the architectural fix in terms of which invariants they enforce or fail to enforce. A new Mermaid figure visualises the four invariant categories so a reader can see the architectural hierarchy at a glance: lifetime governs who is alive, resource governs what is held, synchronization governs who is waiting, coordination governs who is responsible. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…p-dive pointers Per review feedback: - Title now just mentions the NVBug number: "09 — NVBug 6104831 Executive Summary". - Preface trimmed to one sentence that briefly mentions PR NVIDIA#13713 and the rc11/rc13 contrast. Audience block, reading-time block, and self-contained note removed. - "Optional deep-dive pointers" appendix removed entirely. The file is now strictly the rc11 → rc13 narrative with the four- category invariant model, no meta-framing or cross-reference appendix. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…disagg cancellation. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> (cherry picked from commit 944561d)

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> 1777930306 -0700

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/pyexecutor/py_executor.py

chienchunhung · 2026-05-07T06:30:48Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-07T06:38:52Z

PR_Github #47132 [ run ] triggered by Bot. Commit: 484655e Link to invocation

…transfer Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

github-actions Bot assigned chienchunhung May 3, 2026

chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch 4 times, most recently from 631975c to 262796c Compare May 4, 2026 06:57

This was referenced May 4, 2026

Disagg request cancellation fix (PR #13713 rebased onto v1.3.0rc13) #13727

Draft

[codex] Fail closed on unquiesced disagg KV transfer #13728

Draft

Disagg cancellation + fail-closed (PR #13728 rebased onto v1.3.0rc13) #13741

Draft

chienchunhung changed the title ~~[https://nvbugs/6104831][fix] Disagg request cancellation fix~~ [https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening May 5, 2026

yifjiang mentioned this pull request May 6, 2026

[https://nvbugs/6104831][fix] Disagg KV transfer: non-blocking polling, structured cancel, bounded quarantine #13796

Draft

5 tasks

chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from e4c232d to 87c4bb8 Compare May 7, 2026 05:59

chienchunhung and others added 7 commits May 6, 2026 23:05

[https://nvbugs/6104831][fix] Disagg request cancellation fix

07a1298

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6104831][test] Add reproducers for broken-promise on …

863dfee

…disagg cancellation. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> (cherry picked from commit 944561d)

[https://nvbugs/6104831][test] Add disagg cancellation regression tests

4fecc6c

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Fail closed on unquiesced disagg KV transfer

3c3b6e3

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> 1777930306 -0700

[https://nvbugs/6104831][fix] Incorporate PR#13728's improvements

7d4a82a

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Narrow Python disagg transfer timeout handling

583d1e2

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Defer context transfer cleanup after timeout cancel

5de8b54

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/pyexecutor/py_executor.py

chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from 87c4bb8 to 484655e Compare May 7, 2026 06:22

[https://nvbugs/6104831][fix] Complete deferred disagg cleanup after …

5719008

…transfer Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from 484655e to 5719008 Compare May 7, 2026 06:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening#13713

[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening#13713
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
chienchunhung:pr13056-pr13495-combo-nvbug6104831

chienchunhung commented May 3, 2026 •

edited

Loading

Uh oh!

chienchunhung commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chienchunhung commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tl;dr

Background

Failure signatures observed

Correctness model

What this PR changes

1. C++ transfer request lifetime

2. Sender/receiver cancellation bookkeeping

3. Non-blocking generation transfer polling

4. NIXL transfer cancellation and release

5. Eval-order fix after shared_ptr

6. Python idempotency guards

7. Fail-closed behavior for unknown transfer quiescence

8. Deferred cleanup for timed-out context/generation transfer

9. Block-reuse-safe deferred termination

Why deferring cleanup is intentional

Tests

Notes / limitations

PR Checklist

GitHub Bot Help

Uh oh!

chienchunhung commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chienchunhung commented May 3, 2026 •

edited

Loading

5. Eval-order fix after `shared_ptr`