Skip to content

[https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement#13056

Open
yifjiang wants to merge 16 commits intoNVIDIA:mainfrom
yifjiang:fix/prefill-sender-total-timeout
Open

[https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement#13056
yifjiang wants to merge 16 commits intoNVIDIA:mainfrom
yifjiang:fix/prefill-sender-total-timeout

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented Apr 15, 2026

Summary

Fixes NVBug 6104831: disaggregated-serving wedges where prefill/decode workers either crash with std::future_error: Broken promise / bad_optional_access, or silently leak KV-cache blocks and buffer-pool slots until the pool saturates and no new transfers can start. Full investigation notes are in the debug doc.

16 coupled C++/Python commits covering: (a) closing a raw-pointer UAF class on the transceiver's futures maps, (b) enforcing kv_transfer_timeout_ms on both sender and receiver, (c) plumbing perRequestCancel through the data-phase paths so evicted requests actually release their resources, (d) RAII on the buffer-index pool, and (e) hardening the drain worker against non-std::exception escapes.

One-sentence version: shared_ptr ownership + enforced deadline + working cancel path + RAII on pool slots.

Why all 16 commits?

Commit 14 (649d1466b) is the UAF fix proper — it changes mSenderFutures / mRequesterFutures in CacheTransceiver to hold std::shared_ptr<LlmRequest> instead of raw pointers. Confirmed via MALLOC_PERTURB_=85 (observed mRequestId=0x5555555555555555). With that ownership change, the Python-side Path A/B guards that previously deferred cleanup to avoid the UAF are no longer needed; commit 14 removes them.

But fixing the UAF alone doesn't give a working system. These are coupled fixes, not independent — each addresses a distinct failure mode on the same bug family, and removing any one leaves a different no-recovery path open:

Failure mode Which commits fix it What happens if you keep only commit 14
UAF / Broken promise / bad_optional_access 14 + 4 (43b518070) fixed
Stuck transfer never evicted (no deadline) 1–3, 5 (a31977564, a3bce0ce3, a8c27a45b, 3d0a6d183) executor survives, but stuck senders pin KV blocks forever → pool exhausts → new requests hang
Receiver std::async task blocked in recvReadySignal even after deadline fires 6 (6bb116dac) + 7–10 cancel-flag plumbing (62c5d24ae, 7e8016a2f, e7fcbab01, 2860ac50a) deadline fires, Python drops its ref, but the worker task is still waiting for a signal that'll never arrive → NIXL/UCX per-request state accumulates → "no recovery" wedge even after load stops
Buffer-pool slot leak on non-happy exits 11–13 RAII holders (39f5452ea, 28f836e19, 6f85ce895) one early-return in requestSync → index never released → size-1 buffer pool permanently wedged
Unhandled non-std::exception in drain worker 15 (abfcb6d39) a single std::bad_optional_access (or similar) still kills the thread

Concretely, running a 1P1D disagg loadgen stress reproducer (single prefill worker, single decode worker, concurrent request burst at the saturation point, then idle and probe for recovery):

  • Commit 14 alone → crash signatures (Broken promise, bad_optional_access) disappear, but stuck transfers still never get evicted. Pool exhausts, behavior looks superficially similar to the pre-fix wedge (no crash marker, no forward progress). Still no recovery after a 180 s idle probe window.
  • 14 + deadline commits (1–5) → deadline fires, but receiver zombies accumulate: each timed-out receive leaves an std::async task spinning inside recvReadySignal, pinning NIXL state. After a saturation window, every new transfer deterministically waits kv_transfer_timeout_ms even if the system is otherwise idle.
  • 14 + 1–5 + 6–10 → recovery works in principle, but buffer-pool slots leak on error paths. Slow drift toward wedge under sustained error-path churn.
  • All 16 → crash gone, deadline fires, receivers unwind cleanly, buffer slots released; idle-then-probe reliably returns to serving within the probe window.

Order also matters on the removal side: commit 14 only becomes safe after the earlier commits are in place, because the deadline + cancel plumbing they add is what the Path A/B guards were substituting for. Removing the guards without that plumbing in place would regress the orphan-induced KV-block leak the guards existed to mitigate.

Commits (in applied order)

  1. a31977564 — Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers
  2. a3bce0ce3 — Report timed-out transfers to Python for full cleanup
  3. a8c27a45b — Apply kv_transfer_timeout_ms in blockAll mode
  4. 43b518070 — Close remaining UAF paths and enforce gen-side deadline
  5. 3d0a6d183 — Hoist kv_transfer_timeout_ms check above readiness gate
  6. 6bb116dac — Release receiver resources when gen-side deadline evicts
  7. 62c5d24aesendRequestInfo / receiveReadySignal / data-phase send honor perRequestCancel
  8. 7e8016a2f — Sender-side in-flight cancel parity + notifySyncMessage pre-notify check
  9. e7fcbab01 — Tie sender cancel-flag lifetime to session + port catch(...) to sender worker
  10. 2860ac50a — Bound assignBufferIndex CV wait + plumb cancel + diagnostic [buf] logs
  11. 39f5452ea — RAII BufferIndexHolder for requestSync — stop leaking pool slots on non-happy exits
  12. 28f836e19 — RAII BufferIndexHolder on sender formatters + reqId-tagged [buf] logs
  13. 6f85ce895 — Send-side RAII: atomic release(), cancel parity, FUTURE_WAIT/JOIN diagnostics
  14. 649d1466bCarry std::shared_ptr<LlmRequest> through transceiver — close UAF, remove Path A/B guards, unwedge KV block leak
  15. abfcb6d39 — Harden drain worker loop against non-std::exception escapes
  16. c9777c4ac — Re-add is_disagg_context_complete_state guard to _handle_responses

Base

Branch is linear on top of v1.3.0rc11 (4e69c14f732a6e6afce4f71616db5b5cd2b10530) — chosen so it can be built + tested against the tensorrt-llm==1.3.0rc11 wheel baked into NVIDIA NGC's dynamo-trtllm images. 16 commits on top of that tag.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 15, 2026
yifjiang added a commit to yifjiang/TensorRT-LLM that referenced this pull request Apr 15, 2026
… side

Cherry-pick of PR NVIDIA#13056. Adds total elapsed time check against
kv_transfer_timeout_ms in checkContextTransferStatus. Without this,
stuck transfers retry the per-iteration timeout forever, holding KV
blocks indefinitely and exhausting the cache pool.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 408c1da to a0be6d6 Compare April 16, 2026 18:08
@yifjiang yifjiang changed the title [NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender side [NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers Apr 16, 2026
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch from a0be6d6 to 70c6cd6 Compare April 16, 2026 18:09
@yifjiang yifjiang changed the title [NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers [https://nvbugs/5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers Apr 16, 2026
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch 2 times, most recently from a98ed73 to 8c6c8fc Compare April 16, 2026 18:27
@yifjiang yifjiang marked this pull request as ready for review April 16, 2026 18:27
@yifjiang yifjiang requested review from a team as code owners April 16, 2026 18:27
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 8c6c8fc to d7844fc Compare April 16, 2026 18:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Two files are modified to enhance KV cache transfer timeout handling and request state management. The C++ cache transceiver adds timeout configuration retrieval, debug logging, and timeout-based cleanup logic for pending transfers. The Python executor adds conditional state handling to defer termination of disaggregated context transmission requests.

Changes

Cohort / File(s) Summary
KV Cache Transfer Timeout & Logging
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Enhanced timeout handling with retrieval of kvTransferTimeoutMs configuration. Added elapsed-time guards to detect and remove timed-out transfers. Introduced detailed debug logging of sender futures state and completion statistics. Modified exception handling to log warnings without mutating request state, then remove affected entries from pending transfers.
Disaggregated Context Transmission State Handling
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added explicit conditional branch in _handle_responses to defer termination of requests in disaggregated context transmission state. Prevents premature termination while allowing existing timeout checks to manage completion or timeout scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: enforcing kv_transfer_timeout_ms and guarding against dangling pointers in the disaggregated transceiver path.
Description check ✅ Passed The PR description is comprehensive and thoroughly documents the fix, including detailed investigation notes, a 16-commit breakdown with coupling rationale, and test validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 499-513: The diagnostic loop in checkContextTransferStatus is
dereferencing potentially-stale raw pointers from mSenderFutures (it reads
req->mRequestId and req->getKvCacheTransferStart()), causing use-after-free;
change the logging so it does not dereference req (only log the pointer address
and index) or else store safe copies of req->mRequestId and the start timestamp
at the time entries are inserted into mSenderFutures so the diagnostics read
those cached values instead of accessing req. Update the loop using the
mSenderFutures entry variables (auto& [req,fut]) to avoid any req->... calls, or
refactor insertion code to capture mRequestId and getKvCacheTransferStart into
the stored struct so getKvCacheTransferStart and mRequestId are never read from
the raw pointer during logging.
- Around line 493-497: The code currently only loads senderFutureTimeoutMs and
kvTransferTimeoutMs when (!blockAll), which causes kvTransferTimeoutMs to be
unset when blockAll is true and allows future.get() to block indefinitely;
change the logic so kvTransferTimeoutMs is always read from
mCacheTransceiverConfig (call getKvTransferTimeoutMs() and assign to
kvTransferTimeoutMs) regardless of blockAll, but keep the conditional skip of
getKvTransferSenderFutureTimeoutMs() / senderFutureTimeoutMs when blockAll is
true; update the block that references mCacheTransceiverConfig,
senderFutureTimeoutMs, kvTransferTimeoutMs, and blockAll to reflect this
separation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8073f6b4-20e7-4498-88ea-a36216bad2e6

📥 Commits

Reviewing files that changed from the base of the PR and between 61cef21 and d7844fc.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment on lines +493 to +497
if (!blockAll && mCacheTransceiverConfig.has_value())
{
senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load kvTransferTimeoutMs even when blockAll is true.

blockAll should disable the per-iteration poll timeout, not the overall transfer deadline. With the current guard, any caller that omits atLeastRequestNum still falls through to a blocking future.get() with no kv_transfer_timeout_ms enforcement, so a stuck sender can still hang forever.

Suggested fix
-    if (!blockAll && mCacheTransceiverConfig.has_value())
+    if (mCacheTransceiverConfig.has_value())
     {
-        senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
         kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs();
+        if (!blockAll)
+        {
+            senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 493 - 497,
The code currently only loads senderFutureTimeoutMs and kvTransferTimeoutMs when
(!blockAll), which causes kvTransferTimeoutMs to be unset when blockAll is true
and allows future.get() to block indefinitely; change the logic so
kvTransferTimeoutMs is always read from mCacheTransceiverConfig (call
getKvTransferTimeoutMs() and assign to kvTransferTimeoutMs) regardless of
blockAll, but keep the conditional skip of getKvTransferSenderFutureTimeoutMs()
/ senderFutureTimeoutMs when blockAll is true; update the block that references
mCacheTransceiverConfig, senderFutureTimeoutMs, kvTransferTimeoutMs, and
blockAll to reflect this separation.

Comment on lines +499 to +513
// Log mSenderFutures state for diagnosing dangling pointer issues.
// Each entry's pointer address and request ID are logged so we can detect
// when a pointer's underlying memory is freed (reqId changes to 0).
if (!mSenderFutures.empty())
{
TLLM_LOG_DEBUG(
"checkContextTransferStatus: mSenderFutures.size()=%zu, blockAll=%d, "
"kvTransferTimeoutMs=%d",
mSenderFutures.size(), blockAll ? 1 : 0, kvTransferTimeoutMs.value_or(-1));
for (size_t i = 0; i < mSenderFutures.size(); ++i)
{
auto& [req, fut] = mSenderFutures[i];
auto startTs = req->getKvCacheTransferStart().time_since_epoch().count();
TLLM_LOG_DEBUG(" [%zu] ptr=%p reqId=%ld startTs=%ld", i, static_cast<void const*>(req), req->mRequestId,
static_cast<long>(startTs));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't dereference req in the dangling-pointer diagnostics.

This loop reads req->mRequestId and req->getKvCacheTransferStart() from mSenderFutures, but that container is exactly where the raw pointer may already be stale. In the failure mode this PR is fixing, the debug path itself becomes the use-after-free.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 499 - 513,
The diagnostic loop in checkContextTransferStatus is dereferencing
potentially-stale raw pointers from mSenderFutures (it reads req->mRequestId and
req->getKvCacheTransferStart()), causing use-after-free; change the logging so
it does not dereference req (only log the pointer address and index) or else
store safe copies of req->mRequestId and the start timestamp at the time entries
are inserted into mSenderFutures so the diagnostics read those cached values
instead of accessing req. Update the loop using the mSenderFutures entry
variables (auto& [req,fut]) to avoid any req->... calls, or refactor insertion
code to capture mRequestId and getKvCacheTransferStart into the stored struct so
getKvCacheTransferStart and mRequestId are never read from the raw pointer
during logging.

Comment on lines +609 to +635
if (shouldTimeout)
{
// IMPORTANT: Do NOT dereference request->setState() or call
// mCacheSender->cancelRequest(*request) here. The LlmRequest*
// in mSenderFutures is a raw pointer with no ownership — the
// Python layer may have already freed the request. Dereferencing
// a dangling pointer causes heap corruption (free(): invalid
// next size). Just erase the stale entry from mSenderFutures.
// The Python layer handles state transitions via
// _end_transfer_and_maybe_terminate when it processes the
// error/completion from check_context_transfer_status.
if (startTimeValid)
{
TLLM_LOG_WARNING(
"Context KV cache transfer timed out: elapsed %ld ms > limit %d ms. "
"Removing entry from mSenderFutures (ptr=%p).",
elapsedMs, kvTransferTimeoutMs.value(), static_cast<void const*>(request));
}
else
{
TLLM_LOG_WARNING(
"Removing stale entry from mSenderFutures: transfer start time is "
"uninitialized (request pointer %p may be dangling).",
static_cast<void const*>(request));
}
it = mSenderFutures.erase(it);
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't erase timed-out or failed entries without surfacing the request ID back to Python.

After these erase(it) calls, check_context_transfer_status() no longer returns the request ID, but tensorrt_llm/_torch/pyexecutor/py_executor.py:3244-3278 only calls _end_transfer_and_maybe_terminate() for IDs present in finished_requests + error_requests. That means the request can stay pinned in AsyncTransferManager indefinitely unless a later cancel_request() happens to succeed.

Also applies to: 658-662

@pcastonguay pcastonguay requested a review from Tabrizian April 16, 2026 18:57
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 891411d to b2b0a82 Compare April 16, 2026 22:44
@Tabrizian
Copy link
Copy Markdown
Member

@yifjiang Is it possible to have a test that demonstrates the issues? Also, can you please provide a more concise summary of the issue and a minimal reproducer that reproduces the problem.

Also, in the PR description is it possible to only reference PRs that are in TRT-LLM main branch (do NOT reference PRs that are closed or not merged).

@Tabrizian
Copy link
Copy Markdown
Member

@yifjiang
Copy link
Copy Markdown
Contributor Author

@Tabrizian thanks for the review. Addressed in 88987cc:

  • Test: added test_context_transfer_times_out_on_stuck_sender in tests/unittest/others/test_kv_cache_transceiver.py. It enqueues a ctx transfer without starting a matching receiver, configures kv_transfer_timeout_ms=500 ms, and asserts that check_context_transfer_status returns the request in error_requests, sets state to DISAGG_TRANS_ERROR, and evicts the entry. This is a direct unit-level reproduction of the stuck-sender path that the C++ change fixes.
  • Description: rewritten to be concise — summary, root cause, and a minimal reproducer (Qwen3-0.6B 1P1D, concurrency 64, ISL 8000). Kept only the reference to [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path #6348 (merged) and dropped all references to unmerged / closed PRs.
  • Dead link: removed the reference to the external rc11-crash-and-hang-proof.md — the reproducer is now inlined in the PR description so no external artifact is needed.

Copy link
Copy Markdown
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to include the client command in the reproducing instructions too? Thank you.

Comment on lines +3755 to +3763
elif request.is_disagg_context_transmission_state:
# Request is still transferring KV cache to the decode worker.
# Do NOT terminate — the C++ CacheTransceiver still holds a raw
# pointer to this request in mSenderFutures. Terminating now
# would free the LlmRequest and create a dangling pointer,
# causing use-after-free (request ID reads as 0, transfer start
# reads as epoch). Let the transfer complete or time out via
# kv_transfer_timeout_ms in checkContextTransferStatus.
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LlmRequest pointer is safe. It is kept alive in AsyncTransferManager until transfer is completed.

Copy link
Copy Markdown
Contributor Author

@yifjiang yifjiang Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing my earlier reply — I was sloppy. Let me split the answer by code path because your statement holds on the line you anchored on but not universally.

For this specific line (elif request.is_disagg_context_transmission_state: pass inside the force_terminate_for_partial_reuse branch of _handle_responses): you are correct. The pre-fix code would call _terminate_request_do_terminate_requestresource_manager.free_resources(request). None of that touches async_transfer_manager; AsyncTransferManager._requests_in_transfer still holds the Python reference. So the LlmRequest itself is not destroyed. No classic LlmRequest-UAF on this path.

What the guard still prevents is some form of corruption that empirically fires under load (free(): invalid next size through LRUEvictionPolicy::claimBlock, heap metadata detected as invalid at a later unrelated free()). I previously claimed this was specifically "KV-cache-block corruption" — I retract that claim. I don't have ASan or similar verification for the actual mechanism; I was constructing a plausible story. Could be something else entirely. What I can say is that eliminating this termination reliably eliminates the crash under the overload workload.

For Path B (_check_disagg_ctx_cache_transfer_status, the py_kv_transfer_timed_out retry-cancel loop): the invariant "AsyncTransferManager keeps it alive" is not applicable. Path B calls _end_transfer_and_maybe_terminate, which calls AsyncTransferManager.end_transfer, which is precisely the code that pops _requests_in_transfer — i.e. that path itself is the "transfer is completed" signal that drops AsyncTransferManager's hold. A "Request X not found in transfer manager" warning observed once in an instrumented run is the direct fingerprint of this premature pop. Surviving Python references after Path B (active_requests membership, locals) are not load-bearing. So on Path B, LlmRequest lifetime after the pop is not guaranteed.

For Path A gen side (_handle_responsespy_kv_transfer_timed_outcancel_request + _handle_errors, gen request in DISAGG_GENERATION_TRANS_IN_PROGRESS): gen requests never enter AsyncTransferManager._requests_in_transfer at all (only ctx requests go through async_transfer_manager.start_transfer). The only Python references are active_requests and locals. _handle_errors explicitly removes from active_requests then calls _terminate_request. No AsyncTransferManager holder exists to save us here — this guard is the only thing standing between _handle_errors and a classic UAF on mRequesterFutures.

I've pushed 568454fbd that updates the inline comments to reflect this split (and admits the force_terminate_for_partial_reuse crash mechanism is uncorroborated) rather than the blanket "dangling pointer / UAF" framing that was there before.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, store_blocks_for_reuse has a pin argument to make sure the blocks are not deallocated even after the request is terminated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — I hadn't accounted for the store_blocks_for_reuse(request, True) pin set at AsyncTransferManager.start_transfer (py_executor.py:207). I've confirmed:

  • start_transfer calls kv_cache_manager.store_blocks_for_reuse(request, True) → blocks are pinned.
  • _terminate_requestfree_resources(request) calls remove_sequence(..., pin_on_release=False). Per your note, that does not drop the pin set by store_blocks_for_reuse, so the blocks remain live through a Python-side termination.
  • unpin_blocks_by_id is invoked only from AsyncTransferManager.end_transfer (py_executor.py:241), i.e. the transfer-completion signal.

So for the force_terminate_for_partial_reuse path (the line you anchored on), both safety nets apply: AsyncTransferManager holds the LlmRequest Python ref, and the pin keeps the KV blocks live. My earlier "KV-block corruption from free_resources" story is ruled out by the pin. I withdraw that framing.

That leaves me more confused, not less, about why the crash actually fires on that path. Empirically, with enable_partial_reuse_for_disagg=true on rc11 (no guard), 37× storeContextBlocks: Can not find sequence warnings and free(): invalid next size (fast) through LRUEvictionPolicy::claimBlock → __libc_free reproduce in ~45–65 s at conc 48, ISL 8000. With the is_disagg_context_transmission_state guard active, that crash disappears and the storeContextBlocks count drops to 1. Something about running _terminate_request on an in-transmission request is breaking the KV cache manager's invariants in a way that later surfaces during claimBlock — but with both LlmRequest and pinned blocks accounted for, I can't offer a mechanism. My best current guess is that the storeContextBlocks: Can not find sequence warnings are a symptom of remove_sequence running with a sequence the partial-reuse path expected to later storeContextBlocks for, leaving a dangling block-id / sequence-id mapping that claimBlock later trips on — but that's still speculation.

Do you have a theory on what's actually being corrupted? If not, I'd like to run this under MALLOC_CHECK_=3 or ASan to get a writer stack on the corruption site rather than leaving it as "empirical correlation, mechanism TBD." The fix is straightforward; I just don't want to ship a comment that keeps hand-waving about the cause.

For the other two paths the pin argument doesn't change the per-path verification I posted: _check_disagg_ctx_cache_transfer_status Path B explicitly calls end_transfer which unpins the blocks (py_executor.py:241) and pops _requests_in_transfer; _handle_responses Path A gen side has no AsyncTransferManager/pin path at all (receiver-side transfers go through request_and_receive_async, not async_transfer_manager.start_transfer). So those guards are still preventing real, nameable hazards. I'll update the code comment at 3791 to reflect your correction on the pin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up — got the forensic mechanism. Ran the same rc11 pre-fix image under MALLOC_PERTURB_=85 / MALLOC_CHECK_=3 with the same workload (conc 48, ISL 8000, 120 s) and caught the UAF directly:

[TensorRT-LLM][ERROR] Error occurred during context transfer for
  request 6148914691236517205: std::future_error: Broken promise

6148914691236517205 == 0x5555555555555555 — the glibc poison fill written into freed memory. That log line comes from checkContextTransferStatus's catch block doing LOG_ERROR("... %ld ...", it->first->mRequestId). Request IDs are monotonic 64-bit integers, so 0x55..55 is not a legal ID; the mRequestId field must have been read from a freed slot. Direct mechanistic proof the LlmRequest was freed out from under mSenderFutures.

Walking through why, given your pin correction:

  • AsyncTransferManager.start_transfer pins via store_blocks_for_reuse(request, True) — blocks stay live through _terminate_request (your correction, confirmed).
  • _terminate_request alone does not free the LlmRequest — AsyncTransferManager's strong ref survives. So the force_terminate_for_partial_reuse path in isolation does not produce the UAF; it only produces the storeContextBlocks: Can not find sequence warnings (observed 37x-102x) via remove_sequence.
  • The free happens on Path B in _check_disagg_ctx_cache_transfer_status. Path B calls _end_transfer_and_maybe_terminate, which calls AsyncTransferManager.end_transfer, which pops _requests_in_transfer — the drop you were pointing at as "until transfer is completed" is baked into this path's cancel sequence. After the pop, the iteration-local request is the only remaining strong ref; when the loop moves on, the pybind shared_ptr hits zero and the C++ LlmRequest is destroyed. The mSenderFutures entry's raw pointer is now dangling.
  • Under MALLOC_PERTURB_, the freed region is filled with 0x55; the catch-block read captures that pattern. Without MALLOC_PERTURB_ (original rc11 run), the same region gets reused by a later allocation; a subsequent C++ write (setState(kDISAGG_TRANS_ERROR) at 0x06 into what is now another object's mid-word) corrupts heap metadata, and the next unrelated free() in LRUEvictionPolicy::claimBlock aborts with invalid next size.

So the corrected mechanism for this PR's fix is:

  • Path B guard (in _check_disagg_ctx_cache_transfer_status): forensically confirmed to prevent a classic LlmRequest-UAF. This is the path that drops the AsyncTransferManager ref.
  • Path A gen-side guard (in _handle_responses): classic UAF by construction — gen requests never enter AsyncTransferManager, so _handle_errors is a direct free. No forensic run targeted gen-side separately in this investigation, but the argument is airtight from the code.
  • force_terminate_for_partial_reuse guard (the line you anchored on): neither the LlmRequest nor the pinned blocks are freed by this path alone. Skipping it still matters because in the real workload the same request is later picked up by Path B once py_kv_transfer_timed_out fires, and Path B is where the confirmed free happens. Whether this path alone can crash independent of Path B is not directly isolated.

Caveats on the heap-debug run:

  • The MALLOC_CHECK_=3 abort on invalid next size did not trigger in the 12-min window — a pure read of a poisoned slot doesn't flag the check. The poison read is the proof, not an abort.
  • An ASan rebuild would pin the first write through the dangling pointer with its stack. That's a source rebuild, so out of scope for this image-only investigation but a clean follow-up if more forensics are wanted.

Pushed 5779e5c53df42104b063a342f98a832407fec82d updating the inline comments to reflect this — Path B guard comment now cites the 0x55..55 evidence; Path A guard splits ctx vs gen accurately; force_terminate_for_partial_reuse guard comment keeps your pin correction and frames the composite-with-Path-B story without overclaiming. Also walking the walk-back back in the C++ catch-block comments.


error_request_ids = []
# 20x the configured timeout is a generous wall-clock budget for CI jitter.
deadline = time.monotonic() + (kv_transfer_timeout_ms * 20) / 1000.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a timeout in the py_executor level, this test doesn't trigger the py_executor level changes so that could be why that timeout is not triggered.

Copy link
Copy Markdown
Contributor Author

@yifjiang yifjiang Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — this test exercises the C++ check_context_transfer_status API directly, bypassing py_executor. The py_executor-level _check_kv_transfer_timeoutcancel_request_handle_errors path is intentionally not in the loop here; the point of the test is to verify the C++ kv_transfer_timeout_ms enforcement in isolation.

Why the C++-layer enforcement still matters even though py_executor already has a timeout: the py_executor-level path (_check_kv_transfer_timeout fires py_kv_transfer_timed_out → either _handle_responses Path A or _check_disagg_ctx_cache_transfer_status Path B) calls cancel_request + _handle_errors / _end_transfer_and_maybe_terminate on requests whose C++ transfer is still in flight. Under load, that reliably produces free(): invalid next size on ctx side, and creates a classic UAF window on gen side (gen requests have no AsyncTransferManager holder). See the split analysis on the py_executor.py thread above for per-path specifics.

The Python guards in this PR skip that py-executor cleanup for in-transmission requests and defer to the C++ deadline, which evicts the entry from mSenderFutures / mRequesterFutures before any Python-side free_resources can run. That's the enforcement path the test is directly exercising.

I can add a heavier py_executor-level integration test too if you'd like — it would need a full LLM + 1P1D fixture. Let me know if that's worth it vs. the C++-API unit test.

@yifjiang
Copy link
Copy Markdown
Contributor Author

@Tabrizian client command added to the reproducer section of the PR description — full loadgen script (asyncio, concurrency 64, ISL ~8000, OSL ~200, 180s load + 5 recovery probes) plus a single-shot curl sanity check. It talks to the OpenAI-compatible /v1/chat/completions endpoint on the frontend.

I also updated the "what actually breaks" paragraph to reflect the mechanism you pointed out in py_executor.py:3762 — the LlmRequest stays alive via AsyncTransferManager; the corruption is at the KV-cache-block level (free_resources returns blocks to LRUEvictionPolicy's free queue while the NIXL sender is still writing GPU bytes into them). Full reply on that thread too.

@yifjiang
Copy link
Copy Markdown
Contributor Author

yifjiang commented Apr 17, 2026

Adding empirical backing for the two inline threads above. Numbers below, mechanism analysis further down. Update (2026-04-17): forensic confirmation via MALLOC_PERTURB_ at the bottom of this comment.

Setup — reproducible from a fresh TRT-LLM build (no prebuilt image assumed) with a small observability patch that instruments counters on the two Python paths in question. Deploy 1P1D disaggregated serving (one prefill + one decode worker) with:

  • Model: Qwen/Qwen3-0.6B
  • Backend: NIXL over TCP (UCX_TLS=tcp,cuda_copy,self)
  • cache_transceiver_config.kv_transfer_timeout_ms = 60000
  • cache_transceiver_config.kv_transfer_sender_future_timeout_ms = 10000
  • Both workers: free_gpu_memory_fraction: 0.3
  • Prefill: disable_overlap_scheduler: true
  • Decode: disable_overlap_scheduler: false

Workload — asyncio client hitting the frontend's /v1/chat/completions: concurrency 48, ISL ~Normal(8000, 2000) floored at 100, OSL ~Normal(200, 50) floored at 50, temperature=0, non-streaming, 120 s sustained load, then 5 recovery probes.

Run A — baseline (no PR #13056 fix), observability only. Counts how often each Python path operates on an in-transmission request:

Event Prefill Decode
_check_disagg_ctx_cache_transfer_status Path B — cancel_request returned True on a ctx request in DISAGG_CONTEXT_TRANS_IN_PROGRESS, then _end_transfer_and_maybe_terminate ran 23 0
_handle_responses Path A — py_kv_transfer_timed_out would have triggered cancel_request + _handle_errors on a gen request in DISAGG_GENERATION_TRANS_IN_PROGRESS 12
CacheSender::cancelRequest returning True on an in-transmission ctx request 23 / 23
"Request X not found in transfer manager" warning 1 0

The 23/23 on CacheSender::cancelRequest shows this isn't a corner case — every ctx request whose Python-level timeout fires while C++ is still driving its transfer gets cancel_request == True (the request is still in mReadyResponses), and the pre-fix code proceeds into _handle_errors / _end_transfer_and_maybe_terminate. The single "not found" warning is the direct fingerprint that Path B did pop _requests_in_transfer on a request that C++ still considered in flight.

Run B — same setup and workload, with the Python guards in this PR applied (both Path A and Path B skip termination when the request is in a transmission state):

Event Prefill Decode
Termination skipped by the new guard (per firing, not unique requests) 4322 12
Heap crash (free(): invalid next size) none none
C++ kv_transfer_timeout_ms eviction fires (ctx side) 63 0
C++ kv_transfer_timeout_ms eviction fires (gen side) 0
Recovery probes pass after sustained load 0 / 5

So:

  • Prefill is healed by the Python guard + ctx-side C++ deadline. 63 stuck entries evicted cleanly, KV blocks unpinned via _end_transfer_and_maybe_terminate, no crash, no hang.
  • Decode still hangs even with the Python guard — the 12 would-be terminations are correctly skipped, but without a checkGenTransferStatus deadline, stuck receivers accumulate in mRequesterFutures, the gen KV pool exhausts, and the server doesn't recover. That's why daf70d403 adds the gen-side C++ deadline and b9450529b hoists the check above the toCompleteIdSet readiness gate (so entries that are never "ready" still get their deadline checked).

Mechanism — per-path

  • Path B guard (_check_disagg_ctx_cache_transfer_status retry-cancel): this path calls _end_transfer_and_maybe_terminateAsyncTransferManager.end_transfer, which is literally the code that pops _requests_in_transfer. The "AsyncTransferManager keeps it alive" invariant does not apply here — this path is the completion signal that drops the hold. After the pop, surviving Python references (active_requests, loop locals) are transient; the pybind shared_ptr refcount drops to zero, the C++ LlmRequest is destroyed, and mSenderFutures's raw pointer is stale. Forensic confirmation in the follow-up run below.

  • Path A gen-side guard (_handle_responsespy_kv_transfer_timed_out for gen): gen requests never enter AsyncTransferManager. _handle_errors explicitly removes the request from active_requests and calls _terminate_request. No holder exists to save the LlmRequest — this guard is the only thing preventing a classic UAF on mRequesterFutures on the gen side.

  • force_terminate_for_partial_reuse guard (the line @Tabrizian anchored on): _terminate_request does not touch async_transfer_manager, and store_blocks_for_reuse(request, True) pins the KV blocks (thanks to @Tabrizian for that correction). So neither the LlmRequest nor its pinned blocks are freed by this path alone. The observable impact is the 37×–102× storeContextBlocks: Can not find sequence warnings from remove_sequence. In the real workload the same request is later picked up by Path B once py_kv_transfer_timed_out fires, and Path B is where the forensically confirmed free occurs. Whether this path could crash independent of Path B via a KVCacheManager-state inconsistency has not been isolated.


Update (2026-04-17): forensic confirmation of the Path B UAF via MALLOC_PERTURB_

Re-ran the same pre-fix image, same workload, with heap-debug env on both workers:

MALLOC_CHECK_: "3"
MALLOC_PERTURB_: "85"        # fills freed memory with 0x55
GLIBC_TUNABLES: "glibc.malloc.check=3"
LIBC_FATAL_STDERR_: "1"

After ~60 s of load, the prefill worker logged:

[TensorRT-LLM][ERROR] Error occurred during context transfer
  for request 6148914691236517205: std::future_error: Broken promise

6148914691236517205 == 0x5555555555555555 — the poison fill pattern glibc writes into freed memory under MALLOC_PERTURB_=85. The log line comes from checkContextTransferStatus's catch block reading it->first->mRequestId off mSenderFutures. Request IDs are monotonic 64-bit integers from the executor; 0x55…55 is not a legal ID. The only way for mRequestId to read as 0x55…55 is if the memory the field was stored in has been freed and poisoned.

This is direct, mechanical proof that the LlmRequest was freed out from under mSenderFutures.

Sequence:

  1. Python timeout fires for a request in DISAGG_CONTEXT_TRANS_IN_PROGRESS. Path B runs: cancel_request → True, _end_transfer_and_maybe_terminateAsyncTransferManager.end_transfer pops _requests_in_transfer (drops the AsyncTransferManager ref), then _terminate_requestfree_resourcesremove_sequence (explaining the 102 storeContextBlocks: Can not find sequence warnings in the same run).
  2. The iteration-local request in Path B is the only surviving strong ref. When the loop moves on, pybind shared_ptr refcount hits zero; the C++ LlmRequest is destroyed.
  3. Glibc either reuses the slot or fills it with 0x55.
  4. Either via broken-promise resolution or the next tick iterating mSenderFutures, the C++ code dereferences the stale raw pointer. The catch-block log above is the smoking gun.

Caveats:

  • MALLOC_CHECK_=3 did not SIGABRT during the heap-debug window. A pure read of poisoned bytes doesn't flag invalid next size; it only fires on metadata damage at a later unrelated free. The poison read is the proof, not the abort.
  • The subsequent request->setState(kDISAGG_TRANS_ERROR) writes 0x06 into the freed region's mState offset. Whether that write trips the check depends on which glibc bin the slot was assigned to — timing-sensitive. To force a deterministic abort at the write site, the next step would be an ASan rebuild, which is out of scope for this image-only investigation.
  • Decode worker stayed entirely quiet (0 of every signature). The crash mechanism in this image is exclusively prefill/sender side — consistent with mSenderFutures being the container holding the freed LlmRequest.

Pushed 5779e5c53 updating the inline code comments with this forensic evidence (Path B guard comment cites the 0x55..55 read; Path A guard splits ctx vs gen accurately; force_terminate_for_partial_reuse guard acknowledges @Tabrizian's pin correction and frames the composite-with-Path-B story without overclaiming).

yifjiang and others added 2 commits April 21, 2026 16:33
… + guard against dangling pointers

Two fixes for disaggregated prefill/decode KV cache transfer:

1. **C++ total timeout enforcement** (`cacheTransceiver.cpp`):
   `checkContextTransferStatus` retried stuck transfers indefinitely using
   only the per-retry `kv_transfer_sender_future_timeout_ms`. The total
   `kv_transfer_timeout_ms` was plumbed through config but never enforced.
   Now checks total elapsed time after each per-retry timeout and removes
   entries that exceed the limit. All pointer dereferences are avoided in
   timeout and exception paths since `mSenderFutures` holds raw
   `LlmRequest*` that may be dangling.

2. **Python guard for in-flight transfers** (`py_executor.py`):
   The `force_terminate_for_partial_reuse` path terminated requests
   unconditionally, including those in `DISAGG_CONTEXT_TRANS_IN_PROGRESS`
   state. This freed the `LlmRequest` while C++ `mSenderFutures` still
   held a raw pointer, causing heap corruption (`free(): invalid next
   size`). Added `is_disagg_context_transmission_state` check before the
   termination path — requests in active transfer are skipped and cleaned
   up when the transfer completes or times out.

Bug origin: PR NVIDIA#6348 (2025-09-27) moved the `is_disagg_context_transmission_state`
guard into an else branch, leaving the block-reuse path unprotected.

Tested on nscale B200 cluster with Qwen3-0.6B 1P1D disagg:
- Without fix: crash at ~65s (conc 48, ISL 8000, free_gpu_mem=0.3)
- With Python guard only: no crash, 1 storeContextBlocks warning (was 37)
- With both fixes: no crash, stuck transfers cleaned up by C++ timeout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…eanup

With the Python-side is_disagg_context_transmission_state guard, ALL
LlmRequest* pointers in mSenderFutures are guaranteed valid — the
guard prevents Python from terminating requests during active transfer.

This simplifies the C++ timeout and exception paths:
- Timeout: dereference request->mRequestId, setState(kDISAGG_TRANS_ERROR),
  add to errorRequestIds → Python calls end_transfer → unpin_blocks
- Exception (Broken promise): same — report as error for cleanup

Previously, timed-out entries were silently erased without reporting to
Python, leaving entries in _requests_in_transfer with pinned KV blocks
that were never unpinned, causing KV pool exhaustion after overload.

Removed the startTimeValid heuristic — it was checking the pointer
to determine if the pointer is safe, which is itself UB on a
dangling pointer. With the Python guard, this check is unnecessary.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yifjiang and others added 13 commits April 21, 2026 16:34
Address CodeRabbit review on NVIDIA#13056: the overall transfer deadline was
only loaded when !blockAll, so callers that omit atLeastRequestNum
could still hang forever on a stuck sender via future.get().

- Always load kvTransferTimeoutMs from mCacheTransceiverConfig.
- In the timeout branch, enforce the overall deadline regardless of
  blockAll; only skip-and-retry when per-iter timeout is set.
- In blockAll mode, fall through to future.get() only when the overall
  deadline has not yet been exceeded.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…eadline

Empirical testing (v5-diag run on Qwen3-0.6B 1P1D, conc 48, ISL 8000)
surfaced three remaining paths that still hit use-after-free or hang:

1. Python `_handle_responses` — the `py_kv_transfer_timed_out` branch
   (line 3643) calls `cancel_request` + `_handle_errors` + `continue`,
   bypassing the existing `is_disagg_context_transmission_state` guard.
   `cancel_request` on the sender only drains `mReadyResponses`; the raw
   LlmRequest* stays in `mSenderFutures`, so `_handle_errors` frees the
   request and the next `checkContextTransferStatus` dereferences a
   dangling pointer. Same hazard on the receiver side with
   `mRequesterFutures`.

2. Python `_check_disagg_ctx_cache_transfer_status` — the retry-cancel
   loop (line 3267) iterates `requests_in_transfer`, every entry of
   which is in `DISAGG_CONTEXT_TRANS_IN_PROGRESS` by construction.
   Cancelling + terminating always frees an in-transmission request,
   identical to Path 1. In v5-diag this fired 23 times per 120 s run.

   Both Python paths now skip cancel+terminate when the request is in
   transmission state and defer to the C++ kv_transfer_timeout_ms path,
   which evicts the entry and reports it via errorRequestIds / ERROR
   state — at which point termination is safe.

3. C++ `checkGenTransferStatus` (receiver side) — unlike
   `checkContextTransferStatus`, it never enforced
   `kv_transfer_timeout_ms`: the happy path called `future.get()`
   unconditionally, so a stuck receiver hung forever and its entry
   stayed pinned in `mRequesterFutures`. v5f fixed the prefill UAF but
   the decode-side hang persisted (10/104 succeeded, probes failed)
   because no deadline evicted the stuck receivers.

   Mirror the sender restructure: load `kv_transfer_timeout_ms` (always)
   and `kv_transfer_sender_future_timeout_ms` (polling mode only); on
   `wait_for` timeout, check elapsed against the overall deadline and
   evict with `kDISAGG_TRANS_ERROR` when exceeded. The existing
   `_check_cache_transfer_errors` surfaces the ERROR state back to
   Python, which then safely unwinds the request (the C++ entry was
   already erased, so there is no dangling pointer).

Additional correctness fix: `CacheReceiver::receiveAsync()` spawns the
thread that calls `setKvCacheTransferStart` asynchronously. Until that
thread runs, `getKvCacheTransferStart()` returns the default epoch time
point and the new gen-side deadline check would falsely evict fresh
entries. Set the transfer start synchronously in
`CacheTransceiver::requestAndReceiveAsync` before pushing the future.

Debug logging: mirror `respondAndSendAsync`'s insertion log in
`requestAndReceiveAsync`, and the per-entry/summary scan in
`checkGenTransferStatus` to match `checkContextTransferStatus`. These
make future stuck-transfer investigations tractable without rebuilding.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…ss gate

The elapsed-time check added in daf70d4 was nested inside the
`status == future_status::timeout` branch of `wait_for(...)`, which is
itself gated by `blockAll || toCompleteIdSet.find(...) != end()`. In
polling mode (blockAll == false), `toCompleteIdSet` is populated from
the initial `wait_for(0)` sweep (consensus across ranks) plus an
insert-order fallback that only runs when `atLeastRequestNum > 0`.

A stuck sender/receiver whose future never becomes ready is not in the
consensus set, and callers of `_check_disagg_gen_cache_transfer_status`
typically pass `at_least_num = 0` (see py_executor.py:2930, where
`need_check_one` requires ALL non-gen-first requests to be in
transmission — narrow). With `atLeastRequestNum = 0`, such entries
escape the deadline check entirely and pin KV blocks forever.

Empirical evidence on 1P1D Qwen3-0.6B, conc 48, ISL 8000, 120 s:
  - mRequesterFutures.size() stayed at 22 across 384 checkGenTransferStatus
    polls (blockAll=0, kvTransferTimeoutMs=60000, senderFutureTimeoutMs=10000
    confirmed via TLLM_LOG_DEBUG)
  - 0 decode-side total-timeout evictions vs 63 on prefill (prefill is
    masked by `_check_disagg_ctx_cache_transfer_status(1)` callers that
    rotate entries through `toCompleteIdSet`)
  - server did not recover after phase-1 sustained load — all 5 probes
    timed out

Hoist the deadline check above the readiness gate in both
`checkContextTransferStatus` and `checkGenTransferStatus`. A stuck
entry is now evicted at `kv_transfer_timeout_ms` regardless of whether
the future is ready or whether the entry is in `toCompleteIdSet`. The
inner fallback checks in the `wait_for(...) == timeout` branch are
removed since the outer check has already evicted any entry whose
deadline has passed.

Also add a ctx-side regression test that polls with
`at_least_request_num=0` — the default scheduler cadence that would have
exposed this bug.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…e evicts

Addresses the post-saturation "no-recovery" bug documented in the
gen-side ghost-UCX investigation: with PR NVIDIA#13056 commits 1-6 applied,
the UAF/crash is fixed but every probe after a saturation event
deterministically returns HTTP 500 at exactly `kv_transfer_timeout_ms`,
indefinitely. Py-spy traces show Python blocked in
check_gen_transfer_status waiting on at_least_num=1; the scheduler is
alive and requests are being accepted, but every new gen transfer
never becomes "ready."

Root cause: the timeout-eviction in `checkGenTransferStatus`
(`daf70d4034` / `b9450529b`) erases the `std::future<void>` from
`mRequesterFutures`, but the underlying task spawned via
`CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads` is still
running — blocked inside `recvReadySignal` waiting on a peer that
will never respond. These zombie tasks hold NIXL/UCX per-request
state (TransferSession, buffer registrations). Over a saturation
window ~100+ zombies accumulate and pin the receive pool, so new
receives queue behind a frozen pool and every new transfer waits the
60s deadline. Empirical signatures: no crash, server responds, thread
and FD counts identical to a fresh worker (damage is inside C++
lifetime state), killing only the decode pod recovers fully.

Fix: extend `CacheReceiver::Impl::cancelRequest` to cover in-flight
receives, and call it from `checkGenTransferStatus`'s timeout-eviction
path.

dataTransceiver.cpp (CacheReceiver::Impl):
- Add a per-request cancel-flag registry
  (mInFlightCancelMutex + mInFlightCancelFlags) keyed by request id.
- Register a fresh std::atomic<bool> in
  requestAndReceiveAsyncMultiThreads before enqueueing; attach the
  shared_ptr to RequestAndPromise so it survives through the worker.
- Worker in request() passes the flag to requestSync; unregisters
  after requestSync returns (success or exception).
- requestSync takes the flag, checks it at entry, passes it to
  receiveReadySignal, and re-checks after the ready-signal wait
  returns.
- receiveReadySignal gains a per-request-cancel overload that plumbs
  the flag as the DataContext's transferTerminate — the existing
  polling loop in AgentConnectionManager::waitForNotification then
  observes the flag via ctx.getTransferTerminate().load() and exits
  early with isReady=false.
- cancelRequest, when the request is not found in mRequestsQueue,
  now flips the registered flag (previously it just logged
  "Cannot cancel"). Returns true for both queued and in-flight cases.
- ~Impl flips all per-request flags alongside setting mTerminate so
  process shutdown still unblocks recvReadySignal polling loops.
- Impl::receiveAsync (currently unused via CacheReceiver public API)
  switched to a lambda to disambiguate the new requestSync overload.

cacheTransceiver.cpp:
- checkGenTransferStatus timeout-eviction branch now calls
  mCacheReceiver->cancelRequest(*request) before erase, signaling the
  blocked task to abort via the per-request flag.
- checkContextTransferStatus timeout-eviction branch calls
  mCacheSender->cancelRequest(*request) symmetrically for defense in
  depth; sender zombies empirically unwind on peer teardown but
  clearing the bookkeeping is cheap and consistent.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…e send honor perRequestCancel

v11b ([reqSync] instrumentation commit e61a5fa) empirically pinned
the post-saturation no-recovery wedge to CacheReceiver::Impl::sendRequestInfo:
pre-sendRequestInfo fires for a reqId and has no matching
post-sendRequestInfo for 60 s+; after the overall kv_transfer_timeout_ms
cancelRequest flips the per-request flag and logs "Flipped in-flight
cancel flag", but the blocking send call inside sendRequestInfo never
observes the flag because the polling loops below requestSync don't
read ctx.getTransferTerminate(). The drain worker therefore never
returns to the CV wait and every subsequent enqueue's notify_all() is
delivered to a thread not waiting on it. Queue backs up. Every new
transfer at the 60 s deadline hits Path A silent erase.

Fix: plumb perRequestCancel through the send path so the blocking
calls below can observe the flag.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl):
- sendRequestInfo(LlmRequest) -> overload now delegating to the new
  sendRequestInfo(LlmRequest, std::atomic<bool> const& perRequestCancel)
  that threads perRequestCancel through:
    * checks perRequestCancel at the top of the per-counterpart
      for-loop (throws to abort further peer notifications once a
      cancel has fired),
    * passes it to the non-agent sendRequestInfo(connection, info,
      perRequestCancel) helper,
    * uses it as the DataContext's transferTerminate in the
      TransferSession constructor, so the downstream data-phase
      operations (receiveSync -> unformat -> per-connection send/recv
      -> AgentConnection::send) also honor the cancel flag.
- sendRequestInfo(Connection*, RequestInfo const&, perRequestCancel):
  plumbs the flag into each of the three DataContext constructions
  (kID_TAG / kINFO_SIZE_TAG / kINFO_TAG) used in the non-agent path.
- requestSync(..., perRequestCancel): calls the new two-arg
  sendRequestInfo.
- receiveReadySignal(session, perRequestCancel): adds a
  between-iteration perRequestCancel check in the per-peer for-loop
  so the multi-rank wait sequence bails out on cancel.

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(AgentConnection::send):
- Replaces the unbounded status->wait() with a polling loop of
  status->wait(100ms) that checks ctx.getTransferTerminate() between
  polls. If cancel fires mid-send, throws TLLM_THROW so requestSync
  unwinds via RequestSpecificException; the drain worker's existing
  catch(std::exception) path handles the throw, sets the promise, and
  resumes. TransferStatus exposes no abort primitive, so this leaks
  at most one in-flight xfer descriptor per cancel event — preferable
  to an unrecoverable wedge.

The v11b evidence showed the wedge sub-call is sendRequestInfo. This
commit addresses that path (via DataContext plumbing into the
underlying AgentConnection::send poll loop) and extends the same
pattern to receiveReadySignal (for iteration-level cancel observance)
and to the data-phase transfers that go through the TransferSession's
DataContext (receiveSync's unformat -> AgentConnection::send).

Still out of scope for this commit (per docs/pr-13056-option-c.md):
per-sub-call cancel inside notifySyncMessage (NIXL backend internals
not reachable from this layer), and any sender-side in-flight flag
parity (CacheSender::Impl still lacks the flag mechanism entirely).

See docs/pr-13056-no-recovery-ghost-ucx.md for the saturation repro
and docs/pr-13056-option-c.md for the v11b evidence + decision tree.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…Message pre-notify check

Closes the two TBDs left open after Option C (commit 7381830):

1. Sender-side in-flight cancel flag (symmetric to CacheReceiver::Impl)
-----------------------------------------------------------------------
Pre-fix, CacheSender::cancelRequest could only abort requests still in
the mReadyResponses queue that were NOT the mCurrentRequest. In-flight
sends (request is mCurrentRequest, worker is inside sendSync's data
transfer) just logged "Cannot cancel request" and returned false —
leaving the send to block on the peer indefinitely when that peer's
receive side has already been evicted.

Add a per-request cancel-flag registry to CacheSender::Impl:
- mInFlightCancelMutex + mInFlightCancelFlags map (reqId -> shared_ptr<atomic<bool>>)
- getOrCreateInFlightCancelFlag(reqId) helper
- sendAsync registers a flag at enqueue time
- recvRequestInfo constructs the TransferSession's DataContext using
  the registered flag (via *cancelFlag) instead of mTerminate, so the
  downstream data-phase send/recv operations — specifically
  AgentConnection::send's poll-wait loop added in 7381830 —
  observe the flag via ctx.getTransferTerminate()
- cancelRequest flips the flag on the previously "Cannot cancel" miss
  path (returns true now)
- removeResponse and the sendResponse cancel-cleanup branch erase the
  flag after send completion / cancel
- terminate() flips every registered flag before joining the response
  worker so shutdown still interrupts in-flight sends (the DataContext
  references per-request flags, not mTerminate)

2. notifySyncMessage pre-notify cancellability
----------------------------------------------
notifySyncMessage drops into the NIXL backend and is not interruptible
from this layer. Two narrow windows where a cancel fired but the
notify still ran — fix both:

- AgentConnection::send: added a cancel check between the poll-wait's
  TransferState::kSUCCESS return and the notifySyncMessage call. If
  the cancel fires in that gap, throw instead of sending a notify the
  peer will never act on.
- AgentConnection::sendRequestAndBufferInfo: added an optional
  `std::atomic<bool> const* perRequestCancel = nullptr` parameter. When
  non-null and flipped true, throws before notifySyncMessage. Caller
  in CacheReceiver::Impl::sendRequestInfo passes &perRequestCancel.

Limitations still standing after this commit
----------------------------------------------
- notifySyncMessage itself is not interruptible mid-call. The pre-notify
  check only catches the gap between wait() returning and notify
  starting. If notifySyncMessage blocks internally on a stuck peer
  (unclear whether NIXL's genNotif is fully synchronous), the cancel
  can't unwind it from this layer. Full remediation would need a new
  cancellation API on BaseTransferAgent.
- TransferStatus exposes no abort primitive; AgentConnection::send's
  mid-poll throw leaks at most one in-flight xfer descriptor per
  cancel event (acceptable vs the alternative of a drain-worker wedge).
- Race window between sendAsync (sender registers flag) and
  recvRequestInfo on the sender side — if the control message from the
  peer arrives before sendAsync registers, getOrCreateInFlightCancelFlag
  creates the flag at recvRequestInfo time, so lifecycle is intact, but
  a cancel fired in that early window uses the getOrCreate flag as the
  authority — confirmed consistent.
- No sender-side equivalent of the catch(...) drain-loop hardening
  from commit f63172f; CacheSender's response() worker has a top-level
  catch(std::exception) but no catch(...). If a non-std throw escapes
  from sendSync, the sender worker thread dies and all queued sends
  stall. Separate commit.
- No regression test for sender-side in-flight cancel (same constraint
  as the Option C commit: requires a stalled-peer harness).

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…t catch(...) to sender worker

Two pre-e2e cleanups identified during code trace:

1. Sender-side mInFlightCancelFlags erase was unsafe
--------------------------------------------------------
In commit 3736d11 I erased the per-request cancel flag from
mInFlightCancelFlags inside removeResponse() and inside the
sendResponse cancel-cleanup else-branch. The session's DataContext
(built in recvRequestInfo) holds a reference (not ownership) to the
atomic managed by the shared_ptr in that map. Tracing:

- Success path: sendAndRemoveResponse calls release(id) which destroys
  the session, THEN sendResponse calls removeResponse. With the erase
  in removeResponse, the flag was dropped AFTER the session — safe.
- Cancel-throw path: sendSync throws (via the Option C cancel check),
  sendAndRemoveResponse's catch skips release(id), session remains in
  mRequestToSession. removeResponse still runs → flag erased while the
  session (with its DataContext ref) is still live. Dangling reference.
- mCancelledRequests cancel path (isReady=false): release not called;
  inline erase of the flag similarly dangles session's DataContext.

Nothing in the current code dereferences the session's DataContext
after sendSync returns, so this is a latent UAF rather than an
observed crash. But it's fragile and would trip the first time anyone
adds post-send session iteration.

Move the flag erase out of removeResponse and out of sendResponse's
cancel-cleanup into release(id), where the session is actually
destroyed. Cancel paths that skip release() intentionally leak the
flag shared_ptr until ~Impl / terminate(), matching the existing
session-leak behavior on those same error paths. release() orders
session-destroy before flag-drop so DataContext reference is never
dangling.

2. CacheSender::Impl::response() lacked catch(...)
----------------------------------------------------
Symmetric to the catch(...) added to CacheReceiver::Impl::request()
in commit f63172f. response() is noexcept, so a non-std::exception
escape (integer throw, custom type throw from NIXL/UCX backends) would
call std::terminate. Adding catch(...) resolves pending promises with
current_exception() so callers see a failure, then exits the worker
cleanly. Worker thread still exits after the catch — sender is dead
for this process, but fail-closed via promises is strictly better
than terminate().

No behavior change on the happy path. No new primitives. Trace is now
clean for both receiver and sender sides of the cancel plumbing.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
… diagnostic [buf] logs

Trace through the receiver-side sendRequestInfo path identified a
more likely wedge site than notifySyncMessage: the CV wait inside
BaseTransBufferManager::assignBufferIndex (baseTransBuffer.cpp:228
pre-fix) parks on `mBuffersCV.wait(lk, predicate)` with no timeout
and no awareness of the per-request cancel flag. Under saturation
with N in-flight wedged transfers all holding their receive-buffer
slots, every new call to sendRequestInfo parks here and never
observes perRequestCancel — exactly fitting v11b's [reqSync]
pre-sendRequestInfo-with-no-post-sendRequestInfo signature.

Unlike notifySyncMessage (a thin wrap over NIXL's genNotif with no
timeout API), the buffer-pool CV does have a natural wait_for
variant we can plumb.

Changes
-------

baseTransBuffer.h: extend public `assignBufferIndexForRecv` and
private `assignBufferIndex` with three optional parameters:
  - std::atomic<bool> const* perRequestCancel = nullptr
  - int64_t waitSliceMs = 100
  - std::optional<uint64_t> requestIdForLog = std::nullopt
Default values preserve existing behavior for unchanged callers;
the new capability is opt-in.

baseTransBuffer.cpp: replace the unconditional `mBuffersCV.wait`
with a bounded loop that:
  - re-checks the predicate every waitSliceMs via wait_for;
  - checks perRequestCancel between slices and TLLM_THROWs on cancel,
    so the drain worker can unwind via its existing catch(std::exception);
  - logs [buf] markers — ENTER (WAIT) when the pool is exhausted,
    STILL_WAITING every ~5 s of parking (50 × 100 ms), ACQUIRED on
    success, CANCEL on cancel — each tagged with the reqId so they
    cross-reference cleanly with [reqSync] and [drain] log trails.
The hot path (pool not exhausted) takes the no-wait fast-path with
zero log spam.

dataTransceiver.cpp (CacheReceiver::Impl::sendRequestInfo, agent
path): pass &perRequestCancel and the llmRequest's mRequestId into
each assignBufferIndexForRecv call. Wrap the loop with [reqSync]
pre-/post-assignBufferIndexForRecv markers so the v12 diagnostic
trail can distinguish pool-exhaustion wedges from deeper NIXL
notifySyncMessage wedges without ambiguity.

Expected behavior after this change on the v11b workload
--------------------------------------------------------

If the wedge was pool exhaustion (the most-likely hypothesis):
  - [buf] assignBufferIndex WAIT fires whenever saturation hits
  - Some mix of ACQUIRED (pool drains, wait wakes up) and CANCEL
    (kv_transfer_timeout_ms-driven cancelRequest unblocks the wait)
  - No more unmatched [reqSync] pre-sendRequestInfo; all requests
    either finish or are cancelled with a captured latency and
    pool-occupancy snapshot.

If the wedge was actually deeper (notifySyncMessage / NIXL genNotif):
  - [reqSync] pre-assignBufferIndexForRecv / post-assignBufferIndexForRecv
    pair cleanly (no [buf] WAIT → STILL_WAITING pattern)
  - [reqSync] post-sendRequestInfo is still missing for wedged reqIds
  - Diagnosis shifts to limitation 1 (notify backend) with the next
    fix direction being a supervisor thread or a NIXL API addition

Either outcome narrows the remaining problem surface.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…aking pool slots on non-happy exits

v12 image (PR HEAD with assignBufferIndexForRecv bounded-wait + cancel
plumbing) reproduced saturation with Phase 1 = 11/104 and probes 0/5.
Decode log showed the mechanism unambiguously:
  - 12 reqIds reached [reqSync] post-assignBufferIndexForRecv
    (buffer slot acquired from the size-1 pool)
  - 11 completed via the normal receiveSync → formatter →
    freeBufferIndexForRecv chain (slot released)
  - 1 (reqId=8584867770318859) took the `(not-ready or cancel after
    ready)` early-return branch because prefill's own ctx-timeout
    fired and signalled isReady=0 — receiveSync was skipped entirely,
    so the formatter never released the slot. That single leak
    permanently wedged the size-1 pool, and every subsequent
    assignBufferIndexForRecv parked on the CV wait forever.

Root cause pattern
------------------
CacheReceiver::Impl::requestSync has at least six exit paths (normal,
early-cancel, not-ready, cancel-after-ready, cancel from inside
sendRequestInfo, cancel from inside receiveReadySignal, exception
escape). Pre-fix, only one of them — normal → receiveSync →
formatter — released the buffer index. All five other paths leaked
it. v12 empirically proved the (not-ready) branch; the other four
were latent but reachable under different workloads.

Fix: RAII holder
----------------
New BufferIndexHolder in baseTransBuffer.h — move-only scoped owner
for a buffer index, releases on destructor (including stack unwind
from exceptions). detach() hands off ownership when the formatter
inside receiveSync takes over on the happy path.

Changes:

cpp/tensorrt_llm/batch_manager/baseTransBuffer.h:
  - Added BufferIndexHolder class (inline, move-only, noexcept
    release).

cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp:
  - BufferIndexHolder::release() out-of-line so it can call
    freeBufferIndexForRecv/Send on the manager. Emits
    [buf] AUTO_RELEASE on each RAII release so post-saturation
    traces can attribute how often non-happy-path exits fire vs
    formatter releases.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl):
  - requestSync now acquires each receive-buffer index directly
    (moving the acquisition up from sendRequestInfo), wraps each
    acquisition in a BufferIndexHolder, and passes the raw
    cacheBufferIds vector into sendRequestInfo as a new parameter.
  - sendRequestInfo(llmRequest, perRequestCancel) becomes a thin
    delegator to a new three-arg overload sendRequestInfo(
    llmRequest, perRequestCancel, preAcquiredCacheBufferIds). The
    three-arg variant skips the internal acquire when the caller
    provides ids (requestSync path), matching pre-RAII behavior
    when the vector is empty (legacy / public-wrapper path).
  - After receiveSync returns on the happy path, detach() every
    holder so the formatter's existing freeBufferIndexForRecv call
    doesn't double-release.
  - All other exits from requestSync (early-cancel, not-ready,
    cancel-after-ready, throws from sendRequestInfo /
    receiveReadySignal) unwind through ~BufferIndexHolder and
    release the slot. No branch-specific release calls needed.

Expected v13 behavior
---------------------
  - Phase 1 completion count substantially higher than 11/104 —
    pool no longer wedges after the first (not-ready) exit.
  - All 5 recovery probes return HTTP 200 within single-digit
    seconds (assuming limitations 1-3 don't trip).
  - [buf] AUTO_RELEASE fires on (not-ready) and cancel branches;
    count correlates with early-exit traffic.
  - [buf] WAIT → ACQUIRED pairs stay bounded; no permanent
    concurrence=C/C saturation.
  - The verification script in docs/pr-13056-raii-buffer-release.md
    (every post-assignBufferIndexForRecv reqId has a matching
    [reqSync] EXIT) runs clean.

Still not addressed here
------------------------
  - Sender-side parity (assignBufferIndexForSend release discipline
    on CacheSender's non-happy paths) — separate commit if observed
    to leak.
  - Pool size = 1 is architecturally small for conc=48 workloads.
    Throughput will be bounded by per-request buffer dwell time
    even without leaks. Separate tuning discussion.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…qId-tagged [buf] logs

Extends the receiver-side RAII pattern (30f9cb2) to the two sender-side
cache formatters so any non-happy exit between assignBufferIndexForSend and
the explicit freeBufferIndexForSend auto-releases the send-pool slot.
v13 saturation testing showed the same wedge class on the send side:
one early exit permanently wedged the size-1 pool and every subsequent
transfer parked forever on the CV wait.

Happy-path timing is unchanged: sendAllBuffers returning is still treated
as "transport done, slot safe to free immediately" (the pre-PR invariant
at cacheFormatter.cpp:568 / mlaCacheFormatter.cpp:383). The holder is
detached before the existing free call, so the destructor is a no-op on
the happy path; only non-happy exits fire AUTO_RELEASE.

Also closes three instrumentation gaps for diagnosing a wedged pool:

1. assignBufferIndexForSend gains an optional requestIdForLog parameter
   (parity with assignBufferIndexForRecv), forwarded into the internal
   bounded CV-wait so send-pool WAIT / ACQUIRED / STILL_WAITING / CANCEL
   lines can be attributed to a specific reqId.

2. BufferIndexHolder carries requestIdForLog, so the AUTO_RELEASE log on
   stack unwind names the responsible request. Without it, the v13-class
   bug surfaces as "something leaked" with no way to cross-reference the
   [reqSync] trail. Receiver-side call site now passes reqIdForLog too.

3. New SEND_ACQUIRED / SEND_RELEASE DEBUG markers at both formatter sites
   give per-request send-pool turnover without relying on AUTO_RELEASE
   firing only on error paths.

rnnCacheFormatter has the same bug class at its send/recv acquire sites
but is off the NIXL ghost-UCX path that drove this PR — flagged as a
follow-up rather than expanding scope here.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
… FUTURE_WAIT/JOIN diagnostics

Follow-up to the sender-side RAII commit. Addresses three residual risks
raised during review:

1) **Race between sendAllBuffers return and the explicit free.**
   The previous detach() + freeBufferIndexForSend() pair left a narrow
   window where a throw between the two calls (e.g. from setTime's
   std::map::at()) would trip the destructor fallback and emit a
   misleading AUTO_RELEASE for what is really a correctly-handled
   exception. Replaced with a single noexcept BufferIndexHolder::release()
   (now public) that frees and disarms atomically at the API level, and
   moved the call to immediately after sendAllBuffers returns. Post-
   release throws no longer leak OR pollute the AUTO_RELEASE leak signal.

   BufferIndexHolder now has three API modes:
   - release(): happy-path, silent free + disarm.
   - detach(): hand off ownership to a downstream owner (recv side uses
     this when receiveSync's formatter takes the release responsibility).
   - destructor fallback: any exit path that left mHeld=true — logs
     AUTO_RELEASE so the non-happy exit is visible in [buf] diagnostics.

2) **No cancel on send-pool CV wait.**
   assignBufferIndexForSend now accepts (perRequestCancel, waitSliceMs,
   requestIdForLog) — full parity with assignBufferIndexForRecv. Both
   formatter sites pass &session.getDataContext().getTransferTerminate()
   so a cancel-in-flight while the sender is parked on a saturated pool
   unblocks via the existing bounded-wait loop in assignBufferIndex.

3) **Wrong-hypothesis risk: wedge inside sendAllBuffers, not at formatter.**
   Added [send] diagnostic markers in the send loops so a wedge in the
   transport layer is distinguishable from a formatter-level short-circuit:
   - [send] FUTURE_WAIT reqId=X batch=K concurrency=C remain=R  (before future.get())
   - [send] FUTURE_JOIN reqId=X batch=K elapsedMs=E              (after future.get())
   - [send] START/PRE/DONE reqId=X connIdx=N ...                 (per send DEBUG)
   These pair with the existing [buf] WAIT/ACQUIRED/AUTO_RELEASE trail
   so a v13-class wedge can be attributed to (reqId, batch, connIdx) even
   if the cause turns out to be transport-layer rather than formatter-
   level. FUTURE_WAIT/JOIN are WARN-level (one per concurrency batch);
   per-send markers stay DEBUG to avoid hot-path spam.

Touches both cacheFormatter.cpp and mlaCacheFormatter.cpp symmetrically.
rnnCacheFormatter still uses the simple no-arg assignBufferIndexForSend();
listed as a follow-up (off the NIXL ghost-UCX path).

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…eiver — close UAF, remove Path A/B guards, unwedge KV block leak

The KV-block leak described in docs/pr-13056-next-step-asan-and-leak.md
Track B (decode pod wedged at 0/213 requests scheduled after the gen
transfer deadline fired) is a direct consequence of the Path A/B
guards in py_executor.py deferring Python-side _terminate_request
while a request is DISAGG_*_TRANS_IN_PROGRESS. Those guards exist to
avoid a forensically-confirmed UAF on CacheTransceiver's raw
LlmRequest* tracking (mRequestId observed as 0x5555555555555555 under
MALLOC_PERTURB_=85 — glibc's free-fill pattern read through a dangling
pointer). Whenever the C++ deadline eviction failed to reach an entry
— e.g. the receiver-side orphan class where Python tracks IN_PROGRESS
but the C++ mRequesterFutures has no future to iterate — the guards
kept skipping cleanup forever and the request pinned its KV blocks.

Instead of working around the UAF with a grace-period or
reconciliation backstop, this commit fixes the UAF at the source:

- BaseCacheTransceiver's respondAndSendAsync / requestAndReceiveSync /
  requestAndReceiveAsync / cancelRequest now take
  std::shared_ptr<LlmRequest> (not a raw pointer). Pybind already
  exposes LlmRequest via std::shared_ptr, so the binding auto-converts
  at the boundary with no Python-side changes.

- CacheTransceiver::mSenderFutures and mRequesterFutures now hold
  std::shared_ptr<LlmRequest> in the pair. Entries are still erased on
  the normal completion / deadline / exception paths, so the strong
  reference is dropped at the end of the transfer lifecycle — no leak.

- CacheReceiver::Impl::RequestAndPromise::mRequest and
  CacheSender::Impl::Response::mRequest now hold
  std::shared_ptr<LlmRequest>. The receive-async std::async lambda
  captures the shared_ptr by value, so Python-side _terminate_request
  can no longer drop the LlmRequest out from under an in-flight
  worker's dereferences. Same for the send-async path's Response.

- With the UAF closed, the Path A guard in _handle_responses (gen-side
  timeout-and-skip-if-IN_PROGRESS) and the Path B guard in
  _check_disagg_ctx_cache_transfer_status (ctx-side ditto) are no
  longer needed. Both are removed: when py_kv_transfer_timed_out
  fires, Python calls cancel_request + _handle_errors immediately,
  recovering the KV blocks even when the C++ tracker has orphaned the
  entry.

Lifecycle after this commit, for a non-orphan request:
  1. Python: request_and_receive_async → C++ enqueues, async worker
     spins up.
  2. Worker finishes (completion / peer ack / cancel) → future
     resolves.
  3. checkGenTransferStatus sees future ready → completeEntry sets
     state=COMPLETE, erases entry (drops its shared_ptr).
  4. Python observes state=COMPLETE → runs through normal completion
     path (no _terminate_request yet; request proceeds to
     generation).
  5. Eventually Python's active_requests drops the request → pybind
     shared_ptr refcount=0 → LlmRequest destroyed.

For an orphan (C++ lost the future but Python state still IN_PROGRESS):
  1. Python _check_kv_transfer_timeout sets py_kv_transfer_timed_out.
  2. _handle_responses sees the flag; with the guard removed, proceeds
     to cancel_request + _handle_errors.
  3. _handle_errors → _terminate_request → free_resources →
     remove_sequence → KV blocks released.
  4. LlmRequest still alive while any remaining C++ reference holds
     it (async worker if any); eventually worker drops, request
     destroyed. No UAF.

Also reverts 80474ce's grace-period fallback — it was a fragile
workaround (magic 2x multiplier, wall-clock vs monotonic skew, residual
UAF window) for the same underlying issue the shared_ptr refactor
cleanly fixes at the source. The [reqFut] instrumentation from
5904b9e stays (still useful for investigating the orphan formation
path itself; the leak is now decoupled from the orphan).

Scope: touches BaseCacheTransceiver's API boundary but since every
caller already holds std::shared_ptr<LlmRequest> (RequestVector is
std::vector<std::shared_ptr<LlmRequest>>; pybind auto-converts),
existing call sites drop their .get() and compile unchanged.
No behavior change for Python tests — they pass Python LlmRequest
objects which pybind converts to shared_ptr.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…ion escapes

The v9 in-flight cancellation primitive (267e813) assumes the
requestAndReceiveAsyncMultiThreads drain worker is alive and pulling
from mRequestsQueue. In the post-saturation "no recovery" repro the
worker has either exited or is no longer consuming the queue, so:

- New requests enqueue but are never dequeued.
- The in-flight cancel flag is registered before enqueue, so the flag
  map has entries for every pending request.
- When the 60s deadline fires, cancelRequest finds the request still
  in the queue (Path A), erases it silently, and returns true. The
  per-request-flag flip path (Path B) is never reached because Path
  A always succeeds first.

The fix must target the worker lifecycle. Two additions here:

1. Add a catch(...) block to CacheReceiver::Impl::request()'s drain
   loop. Today the only handlers are
   RequestSpecificException / std::exception, so any non-std throw
   from NIXL / UCX / C++ ABI escapes the loop, terminates the worker
   thread, and strands every queued receive. The catch-all sets the
   caller's promise (so the future resolves with an exception rather
   than hanging) and continues the loop.

2. Add TLLM_LOG_WARNING markers at ENTER / pre-requestSync /
   post-requestSync / EXIT / catch-all so the next image rebuild can
   distinguish between:
     - worker never launched (no ENTER)
     - worker died on non-std escape (EXIT with mTerminate=0 + a
       matching catch-all log)
     - worker deadlocked inside requestSync (pre-requestSync with no
       matching post-requestSync for >kv_transfer_timeout_ms)
     - worker healthy (many pre/post pairs, no EXIT)
   The decision tree in the ghost-UCX investigation doc maps each
   pattern to the next diagnostic step.

The catch(...) is both the diagnostic that would name the culprit
and a likely fix — keeping the drain loop robust to any escape has
no downside and closes the lifecycle hole that the per-request
cancel mechanism implicitly depends on.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 33262c5 to 369d944 Compare April 22, 2026 23:12
@yifjiang yifjiang changed the title [https://nvbugs/5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers [NVBug 5969206][draft] Subset of PR 13056 cherry-picked onto 4e69c14f73 (rc11 base) — 15 C++/Python fixes Apr 22, 2026
@yifjiang yifjiang changed the title [NVBug 5969206][draft] Subset of PR 13056 cherry-picked onto 4e69c14f73 (rc11 base) — 15 C++/Python fixes [NVBug 5969206][draft] Subset rebased onto 4e69c14f73 (v1.3.0rc11 tag) — 15 C++/Python fixes Apr 22, 2026
@yifjiang yifjiang changed the title [NVBug 5969206][draft] Subset rebased onto 4e69c14f73 (v1.3.0rc11 tag) — 15 C++/Python fixes [https://nvbugs/5969206][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement Apr 22, 2026
@yifjiang yifjiang changed the title [https://nvbugs/5969206][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement [https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement Apr 22, 2026
…opped during d7844fc cherry-pick conflict resolution on the 4e69c14 base

PR 13056 commit d7844fc preserves an is_disagg_context_complete_state
branch that was introduced earlier in the PR's base. On the rc11 base
(4e69c14) that guard is not present, so the conflict resolution
during cherry-pick dropped it — which leaves _handle_responses vulnerable
to double-terminating a COMPLETE-state request that _end_transfer_and_maybe_terminate
left in active_requests (end_transfer returned False). Re-add the guard
to match PR 13056's intent. nvbug/5961736.

Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
@yifjiang yifjiang force-pushed the fix/prefill-sender-total-timeout branch 2 times, most recently from 8217bcb to c9777c4 Compare April 22, 2026 23:33
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 1, 2026
…ifestations broaden sig NVIDIA#7 to a CacheSender::Impl::* bug class

run9 (rc11 + our fixes + UCX) and run10 (PR NVIDIA#13056 + UCX) both used a
gdb capture loop on the dataTransResp thread to pin down the exact mutex
behind sig NVIDIA#7. The wedge changed character in both runs:

- run9 ctx mpi worker SIGSEGVs at iter 92 of the burst inside
  _PyObject_GenericGetAttrWithDict (Python C-API), downstream of the
  sig #1 fix path firing cleanly.
- run10 ctx mpi worker SIGSEGVs synchronously inside
  CacheSender::Impl::handleAsyncSend(AsyncSendResource&) on the very
  first sanity-probe request - no concurrency, no burst, no
  cancellation. Zero UCX/NIXL frames in the stack; root cause is most
  plausibly a null shared_ptr<LlmRequest> deref at line 594 since
  PR NVIDIA#13056's ownership model is necessary but does not enforce
  Response::mRequest non-null at producer or consumer.

Both findings are cleanly TRT-LLM-internal and broaden sig NVIDIA#7 from a
single pthread_mutex_lock deadlock to a class of CacheSender::Impl::*
bugs with at least four observed manifestations across two transports
and three fix bundles.

Touched sections: Status block, caveat block, signature table (sig NVIDIA#7
row), sig NVIDIA#7 section variants/fix/status, Phase 12 cross-reference,
new Phase 13 section, Next Steps items 7 / 7a / 7b.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 3, 2026
…le structure

Split the 3193-line single-file report into 13 reader-friendly files
under docs/investigations/nvbug-6104831-disagg-permanent-wedge/:

- README.md: top-level orientation, status, navigation, suggested
  reading paths.
- 01-background.md: architecture diagrams, request lifecycle
  walkthrough, state machine, cancellation flow.
- 02-failure-signatures.md: the seven failure signatures (#1-NVIDIA#7).
- 03-defect-class-stack.md: NEW - the L1-L8 defect-class layering
  that frames the four-approach comparison.
- 04-reproduction.md: how to reproduce locally, load shape, run
  archive index.
- 05-investigation-timeline.md: chronological story (Phases 0-14).
- 06-fix-approaches/: side-by-side comparison plus one file per
  approach (A chained, B PR NVIDIA#13056, C PR NVIDIA#13495, D combo).
- 07-architectural-reflections.md: seven invariants + retrospective.
- 08-next-steps-and-pr-map.md: PR map, outstanding work, deadline
  effort estimate, run archive index.

The L1-L8 defect class stack is the new framework introduced here.
It re-frames the seven signatures as the visible faces of eight
underlying invariant gaps and is used to explain why the combo
approach (D) is the only stack that recovers cleanly across the
test matrix. The five Mermaid diagrams from the request walkthrough
move into 01-background.md as the prerequisite reading for the rest
of the investigation.

The single-file form was hard to navigate at this size; the split
follows the user's reading-path needs (cold reader, single-PR
reviewer, fix-path picker, retrospective reader) called out in the
README.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 3, 2026
Adds 00-tldr.md as the front door to the investigation. Targets a
10-minute read with two inline Mermaid figures:

- The disagg architecture + cancellation flow (where the bugs live).
- The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which
  layer).

Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR
NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen
final solution. Doesn't compare approaches A/B/C — that comparison
lives in 06-fix-approaches/README.md for readers who want depth on the
landing decision.

Section structure:
1. What is broken (the wedge symptom).
2. Where the bugs live (architecture figure).
3. Root cause: eight invariant gaps (L1-L8 table).
4. The fix (combo figure + per-piece breakdown).
5. Does it work (empirical recovery table).
6. Why this took 8 days to find (cascade pattern).
7. What is left to do.
8. Where to go from here (reader paths to deeper files).

README.md updated to put 00-tldr.md at the top of the navigation
table and add a 10-minute reading path entry.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 3, 2026
… colored fix components

Two improvements to 00-tldr.md based on review feedback:

1. Add a 7-row table after the architecture section that gives a
   one-line "where it lives" + "symptom" for each signature #1-NVIDIA#7.
   The TL;DR previously mentioned the seven signatures by number
   without describing them, forcing the reader to jump to
   02-failure-signatures.md to make sense of references like
   "sig NVIDIA#4" or "sig NVIDIA#7".

2. Color-code the four fix components in the combo diagram:
   - PR NVIDIA#13056: blue (lifetime + cancel-flag + RAII)
   - PR NVIDIA#13495: orange (NIXL TransferStatus::release)
   - eval-order fix: green
   - Python idempotency guards: purple

   Each L1-L8 layer node is tinted with the lighter shade of
   whichever fix closes it, so the "closes" arrows are reinforced
   by colour-matching. Makes the fix-to-layer mapping visually
   scannable at a glance.

Note added underneath the diagram explaining the colour scheme.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants