[https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement by yifjiang · Pull Request #13056 · NVIDIA/TensorRT-LLM

yifjiang · 2026-04-15T01:03:11Z

Summary

Fixes NVBug 6104831: disaggregated-serving wedges where prefill/decode workers either crash with std::future_error: Broken promise / bad_optional_access, or silently leak KV-cache blocks and buffer-pool slots until the pool saturates and no new transfers can start. Full investigation notes are in the debug doc.

16 coupled C++/Python commits covering: (a) closing a raw-pointer UAF class on the transceiver's futures maps, (b) enforcing kv_transfer_timeout_ms on both sender and receiver, (c) plumbing perRequestCancel through the data-phase paths so evicted requests actually release their resources, (d) RAII on the buffer-index pool, and (e) hardening the drain worker against non-std::exception escapes.

One-sentence version: shared_ptr ownership + enforced deadline + working cancel path + RAII on pool slots.

Why all 16 commits?

Commit 14 (649d1466b) is the UAF fix proper — it changes mSenderFutures / mRequesterFutures in CacheTransceiver to hold std::shared_ptr<LlmRequest> instead of raw pointers. Confirmed via MALLOC_PERTURB_=85 (observed mRequestId=0x5555555555555555). With that ownership change, the Python-side Path A/B guards that previously deferred cleanup to avoid the UAF are no longer needed; commit 14 removes them.

But fixing the UAF alone doesn't give a working system. These are coupled fixes, not independent — each addresses a distinct failure mode on the same bug family, and removing any one leaves a different no-recovery path open:

Failure mode	Which commits fix it	What happens if you keep only commit 14
UAF / `Broken promise` / `bad_optional_access`	14 + 4 (`43b518070`)	fixed
Stuck transfer never evicted (no deadline)	1–3, 5 (`a31977564`, `a3bce0ce3`, `a8c27a45b`, `3d0a6d183`)	executor survives, but stuck senders pin KV blocks forever → pool exhausts → new requests hang
Receiver `std::async` task blocked in `recvReadySignal` even after deadline fires	6 (`6bb116dac`) + 7–10 cancel-flag plumbing (`62c5d24ae`, `7e8016a2f`, `e7fcbab01`, `2860ac50a`)	deadline fires, Python drops its ref, but the worker task is still waiting for a signal that'll never arrive → NIXL/UCX per-request state accumulates → "no recovery" wedge even after load stops
Buffer-pool slot leak on non-happy exits	11–13 RAII holders (`39f5452ea`, `28f836e19`, `6f85ce895`)	one early-return in `requestSync` → index never released → size-1 buffer pool permanently wedged
Unhandled non-`std::exception` in drain worker	15 (`abfcb6d39`)	a single `std::bad_optional_access` (or similar) still kills the thread

Concretely, running a 1P1D disagg loadgen stress reproducer (single prefill worker, single decode worker, concurrent request burst at the saturation point, then idle and probe for recovery):

Commit 14 alone → crash signatures (Broken promise, bad_optional_access) disappear, but stuck transfers still never get evicted. Pool exhausts, behavior looks superficially similar to the pre-fix wedge (no crash marker, no forward progress). Still no recovery after a 180 s idle probe window.
14 + deadline commits (1–5) → deadline fires, but receiver zombies accumulate: each timed-out receive leaves an std::async task spinning inside recvReadySignal, pinning NIXL state. After a saturation window, every new transfer deterministically waits kv_transfer_timeout_ms even if the system is otherwise idle.
14 + 1–5 + 6–10 → recovery works in principle, but buffer-pool slots leak on error paths. Slow drift toward wedge under sustained error-path churn.
All 16 → crash gone, deadline fires, receivers unwind cleanly, buffer slots released; idle-then-probe reliably returns to serving within the probe window.

Order also matters on the removal side: commit 14 only becomes safe after the earlier commits are in place, because the deadline + cancel plumbing they add is what the Path A/B guards were substituting for. Removing the guards without that plumbing in place would regress the orphan-induced KV-block leak the guards existed to mitigate.

Commits (in applied order)

a31977564 — Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers
a3bce0ce3 — Report timed-out transfers to Python for full cleanup
a8c27a45b — Apply kv_transfer_timeout_ms in blockAll mode
43b518070 — Close remaining UAF paths and enforce gen-side deadline
3d0a6d183 — Hoist kv_transfer_timeout_ms check above readiness gate
6bb116dac — Release receiver resources when gen-side deadline evicts
62c5d24ae — sendRequestInfo / receiveReadySignal / data-phase send honor perRequestCancel
7e8016a2f — Sender-side in-flight cancel parity + notifySyncMessage pre-notify check
e7fcbab01 — Tie sender cancel-flag lifetime to session + port catch(...) to sender worker
2860ac50a — Bound assignBufferIndex CV wait + plumb cancel + diagnostic [buf] logs
39f5452ea — RAII BufferIndexHolder for requestSync — stop leaking pool slots on non-happy exits
28f836e19 — RAII BufferIndexHolder on sender formatters + reqId-tagged [buf] logs
6f85ce895 — Send-side RAII: atomic release(), cancel parity, FUTURE_WAIT/JOIN diagnostics
649d1466b — Carry std::shared_ptr<LlmRequest> through transceiver — close UAF, remove Path A/B guards, unwedge KV block leak
abfcb6d39 — Harden drain worker loop against non-std::exception escapes
c9777c4ac — Re-add is_disagg_context_complete_state guard to _handle_responses

Base

Branch is linear on top of v1.3.0rc11 (4e69c14f732a6e6afce4f71616db5b5cd2b10530) — chosen so it can be built + tested against the tensorrt-llm==1.3.0rc11 wheel baked into NVIDIA NGC's dynamo-trtllm images. 16 commits on top of that tag.

… side Cherry-pick of PR NVIDIA#13056. Adds total elapsed time check against kv_transfer_timeout_ms in checkContextTransferStatus. Without this, stuck transfers retry the per-iteration timeout forever, holding KV blocks indefinitely and exhausting the cache pool. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

coderabbitai · 2026-04-16T18:29:53Z

📝 Walkthrough

Walkthrough

Two files are modified to enhance KV cache transfer timeout handling and request state management. The C++ cache transceiver adds timeout configuration retrieval, debug logging, and timeout-based cleanup logic for pending transfers. The Python executor adds conditional state handling to defer termination of disaggregated context transmission requests.

Changes

Cohort / File(s)	Summary
KV Cache Transfer Timeout & Logging `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Enhanced timeout handling with retrieval of `kvTransferTimeoutMs` configuration. Added elapsed-time guards to detect and remove timed-out transfers. Introduced detailed debug logging of sender futures state and completion statistics. Modified exception handling to log warnings without mutating request state, then remove affected entries from pending transfers.
Disaggregated Context Transmission State Handling `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added explicit conditional branch in `_handle_responses` to defer termination of requests in disaggregated context transmission state. Prevents premature termination while allowing existing timeout checks to manage completion or timeout scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main changes: enforcing kv_transfer_timeout_ms and guarding against dangling pointers in the disaggregated transceiver path.
Description check	✅ Passed	The PR description is comprehensive and thoroughly documents the fix, including detailed investigation notes, a 16-commit breakdown with coupling rationale, and test validation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 499-513: The diagnostic loop in checkContextTransferStatus is
dereferencing potentially-stale raw pointers from mSenderFutures (it reads
req->mRequestId and req->getKvCacheTransferStart()), causing use-after-free;
change the logging so it does not dereference req (only log the pointer address
and index) or else store safe copies of req->mRequestId and the start timestamp
at the time entries are inserted into mSenderFutures so the diagnostics read
those cached values instead of accessing req. Update the loop using the
mSenderFutures entry variables (auto& [req,fut]) to avoid any req->... calls, or
refactor insertion code to capture mRequestId and getKvCacheTransferStart into
the stored struct so getKvCacheTransferStart and mRequestId are never read from
the raw pointer during logging.
- Around line 493-497: The code currently only loads senderFutureTimeoutMs and
kvTransferTimeoutMs when (!blockAll), which causes kvTransferTimeoutMs to be
unset when blockAll is true and allows future.get() to block indefinitely;
change the logic so kvTransferTimeoutMs is always read from
mCacheTransceiverConfig (call getKvTransferTimeoutMs() and assign to
kvTransferTimeoutMs) regardless of blockAll, but keep the conditional skip of
getKvTransferSenderFutureTimeoutMs() / senderFutureTimeoutMs when blockAll is
true; update the block that references mCacheTransceiverConfig,
senderFutureTimeoutMs, kvTransferTimeoutMs, and blockAll to reflect this
separation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8073f6b4-20e7-4498-88ea-a36216bad2e6

📥 Commits

Reviewing files that changed from the base of the PR and between 61cef21 and d7844fc.

📒 Files selected for processing (2)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai · 2026-04-16T18:36:06Z

    if (!blockAll && mCacheTransceiverConfig.has_value())
    {
        senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
+        kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs();
+    }


⚠️ Potential issue | 🟠 Major

Load kvTransferTimeoutMs even when blockAll is true.

blockAll should disable the per-iteration poll timeout, not the overall transfer deadline. With the current guard, any caller that omits atLeastRequestNum still falls through to a blocking future.get() with no kv_transfer_timeout_ms enforcement, so a stuck sender can still hang forever.

Suggested fix

- if (!blockAll && mCacheTransceiverConfig.has_value()) + if (mCacheTransceiverConfig.has_value()) { - senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs(); kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs(); + if (!blockAll) + { + senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs(); + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 493 - 497, The code currently only loads senderFutureTimeoutMs and kvTransferTimeoutMs when (!blockAll), which causes kvTransferTimeoutMs to be unset when blockAll is true and allows future.get() to block indefinitely; change the logic so kvTransferTimeoutMs is always read from mCacheTransceiverConfig (call getKvTransferTimeoutMs() and assign to kvTransferTimeoutMs) regardless of blockAll, but keep the conditional skip of getKvTransferSenderFutureTimeoutMs() / senderFutureTimeoutMs when blockAll is true; update the block that references mCacheTransceiverConfig, senderFutureTimeoutMs, kvTransferTimeoutMs, and blockAll to reflect this separation.

coderabbitai · 2026-04-16T18:36:06Z

+    // Log mSenderFutures state for diagnosing dangling pointer issues.
+    // Each entry's pointer address and request ID are logged so we can detect
+    // when a pointer's underlying memory is freed (reqId changes to 0).
+    if (!mSenderFutures.empty())
+    {
+        TLLM_LOG_DEBUG(
+            "checkContextTransferStatus: mSenderFutures.size()=%zu, blockAll=%d, "
+            "kvTransferTimeoutMs=%d",
+            mSenderFutures.size(), blockAll ? 1 : 0, kvTransferTimeoutMs.value_or(-1));
+        for (size_t i = 0; i < mSenderFutures.size(); ++i)
+        {
+            auto& [req, fut] = mSenderFutures[i];
+            auto startTs = req->getKvCacheTransferStart().time_since_epoch().count();
+            TLLM_LOG_DEBUG("  [%zu] ptr=%p reqId=%ld startTs=%ld", i, static_cast<void const*>(req), req->mRequestId,
+                static_cast<long>(startTs));


⚠️ Potential issue | 🔴 Critical

Don't dereference req in the dangling-pointer diagnostics.

This loop reads req->mRequestId and req->getKvCacheTransferStart() from mSenderFutures, but that container is exactly where the raw pointer may already be stale. In the failure mode this PR is fixing, the debug path itself becomes the use-after-free.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 499 - 513, The diagnostic loop in checkContextTransferStatus is dereferencing potentially-stale raw pointers from mSenderFutures (it reads req->mRequestId and req->getKvCacheTransferStart()), causing use-after-free; change the logging so it does not dereference req (only log the pointer address and index) or else store safe copies of req->mRequestId and the start timestamp at the time entries are inserted into mSenderFutures so the diagnostics read those cached values instead of accessing req. Update the loop using the mSenderFutures entry variables (auto& [req,fut]) to avoid any req->... calls, or refactor insertion code to capture mRequestId and getKvCacheTransferStart into the stored struct so getKvCacheTransferStart and mRequestId are never read from the raw pointer during logging.

coderabbitai · 2026-04-16T18:36:06Z

+                        if (shouldTimeout)
+                        {
+                            // IMPORTANT: Do NOT dereference request->setState() or call
+                            // mCacheSender->cancelRequest(*request) here. The LlmRequest*
+                            // in mSenderFutures is a raw pointer with no ownership — the
+                            // Python layer may have already freed the request. Dereferencing
+                            // a dangling pointer causes heap corruption (free(): invalid
+                            // next size). Just erase the stale entry from mSenderFutures.
+                            // The Python layer handles state transitions via
+                            // _end_transfer_and_maybe_terminate when it processes the
+                            // error/completion from check_context_transfer_status.
+                            if (startTimeValid)
+                            {
+                                TLLM_LOG_WARNING(
+                                    "Context KV cache transfer timed out: elapsed %ld ms > limit %d ms. "
+                                    "Removing entry from mSenderFutures (ptr=%p).",
+                                    elapsedMs, kvTransferTimeoutMs.value(), static_cast<void const*>(request));
+                            }
+                            else
+                            {
+                                TLLM_LOG_WARNING(
+                                    "Removing stale entry from mSenderFutures: transfer start time is "
+                                    "uninitialized (request pointer %p may be dangling).",
+                                    static_cast<void const*>(request));
+                            }
+                            it = mSenderFutures.erase(it);
+                            continue;


⚠️ Potential issue | 🟠 Major

Don't erase timed-out or failed entries without surfacing the request ID back to Python.

After these erase(it) calls, check_context_transfer_status() no longer returns the request ID, but tensorrt_llm/_torch/pyexecutor/py_executor.py:3244-3278 only calls _end_transfer_and_maybe_terminate() for IDs present in finished_requests + error_requests. That means the request can stay pinned in AsyncTransferManager indefinitely unless a later cancel_request() happens to succeed.

Also applies to: 658-662

Tabrizian · 2026-04-16T23:34:12Z

@yifjiang Is it possible to have a test that demonstrates the issues? Also, can you please provide a more concise summary of the issue and a minimal reproducer that reproduces the problem.

Also, in the PR description is it possible to only reference PRs that are in TRT-LLM main branch (do NOT reference PRs that are closed or not merged).

Tabrizian · 2026-04-16T23:45:08Z

This link doesn't work: https://github.com/yifjiang/dynamo-gb200-test/blob/main/docs/rc11-crash-and-hang-proof.md

yifjiang · 2026-04-17T01:27:21Z

@Tabrizian thanks for the review. Addressed in 88987cc:

Test: added test_context_transfer_times_out_on_stuck_sender in tests/unittest/others/test_kv_cache_transceiver.py. It enqueues a ctx transfer without starting a matching receiver, configures kv_transfer_timeout_ms=500 ms, and asserts that check_context_transfer_status returns the request in error_requests, sets state to DISAGG_TRANS_ERROR, and evicts the entry. This is a direct unit-level reproduction of the stuck-sender path that the C++ change fixes.
Description: rewritten to be concise — summary, root cause, and a minimal reproducer (Qwen3-0.6B 1P1D, concurrency 64, ISL 8000). Kept only the reference to [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path #6348 (merged) and dropped all references to unmerged / closed PRs.
Dead link: removed the reference to the external rc11-crash-and-hang-proof.md — the reproducer is now inlined in the PR description so no external artifact is needed.

Tabrizian

Is it possible to include the client command in the reproducing instructions too? Thank you.

Tabrizian · 2026-04-17T16:47:13Z

+                elif request.is_disagg_context_transmission_state:
+                    # Request is still transferring KV cache to the decode worker.
+                    # Do NOT terminate — the C++ CacheTransceiver still holds a raw
+                    # pointer to this request in mSenderFutures. Terminating now
+                    # would free the LlmRequest and create a dangling pointer,
+                    # causing use-after-free (request ID reads as 0, transfer start
+                    # reads as epoch). Let the transfer complete or time out via
+                    # kv_transfer_timeout_ms in checkContextTransferStatus.
+                    pass


The LlmRequest pointer is safe. It is kept alive in AsyncTransferManager until transfer is completed.

Reviewing my earlier reply — I was sloppy. Let me split the answer by code path because your statement holds on the line you anchored on but not universally.

For this specific line (elif request.is_disagg_context_transmission_state: pass inside the force_terminate_for_partial_reuse branch of _handle_responses): you are correct. The pre-fix code would call _terminate_request → _do_terminate_request → resource_manager.free_resources(request). None of that touches async_transfer_manager; AsyncTransferManager._requests_in_transfer still holds the Python reference. So the LlmRequest itself is not destroyed. No classic LlmRequest-UAF on this path.

What the guard still prevents is some form of corruption that empirically fires under load (free(): invalid next size through LRUEvictionPolicy::claimBlock, heap metadata detected as invalid at a later unrelated free()). I previously claimed this was specifically "KV-cache-block corruption" — I retract that claim. I don't have ASan or similar verification for the actual mechanism; I was constructing a plausible story. Could be something else entirely. What I can say is that eliminating this termination reliably eliminates the crash under the overload workload.

For Path B (_check_disagg_ctx_cache_transfer_status, the py_kv_transfer_timed_out retry-cancel loop): the invariant "AsyncTransferManager keeps it alive" is not applicable. Path B calls _end_transfer_and_maybe_terminate, which calls AsyncTransferManager.end_transfer, which is precisely the code that pops _requests_in_transfer — i.e. that path itself is the "transfer is completed" signal that drops AsyncTransferManager's hold. A "Request X not found in transfer manager" warning observed once in an instrumented run is the direct fingerprint of this premature pop. Surviving Python references after Path B (active_requests membership, locals) are not load-bearing. So on Path B, LlmRequest lifetime after the pop is not guaranteed.

For Path A gen side (_handle_responses → py_kv_transfer_timed_out → cancel_request + _handle_errors, gen request in DISAGG_GENERATION_TRANS_IN_PROGRESS): gen requests never enter AsyncTransferManager._requests_in_transfer at all (only ctx requests go through async_transfer_manager.start_transfer). The only Python references are active_requests and locals. _handle_errors explicitly removes from active_requests then calls _terminate_request. No AsyncTransferManager holder exists to save us here — this guard is the only thing standing between _handle_errors and a classic UAF on mRequesterFutures.

I've pushed 568454fbd that updates the inline comments to reflect this split (and admits the force_terminate_for_partial_reuse crash mechanism is uncorroborated) rather than the blanket "dangling pointer / UAF" framing that was there before.

This is not true, store_blocks_for_reuse has a pin argument to make sure the blocks are not deallocated even after the request is terminated.

You're right — I hadn't accounted for the store_blocks_for_reuse(request, True) pin set at AsyncTransferManager.start_transfer (py_executor.py:207). I've confirmed:

start_transfer calls kv_cache_manager.store_blocks_for_reuse(request, True) → blocks are pinned.

_terminate_request → free_resources(request) calls remove_sequence(..., pin_on_release=False). Per your note, that does not drop the pin set by store_blocks_for_reuse, so the blocks remain live through a Python-side termination.

unpin_blocks_by_id is invoked only from AsyncTransferManager.end_transfer (py_executor.py:241), i.e. the transfer-completion signal.

So for the force_terminate_for_partial_reuse path (the line you anchored on), both safety nets apply: AsyncTransferManager holds the LlmRequest Python ref, and the pin keeps the KV blocks live. My earlier "KV-block corruption from free_resources" story is ruled out by the pin. I withdraw that framing.

That leaves me more confused, not less, about why the crash actually fires on that path. Empirically, with enable_partial_reuse_for_disagg=true on rc11 (no guard), 37× storeContextBlocks: Can not find sequence warnings and free(): invalid next size (fast) through LRUEvictionPolicy::claimBlock → __libc_free reproduce in ~45–65 s at conc 48, ISL 8000. With the is_disagg_context_transmission_state guard active, that crash disappears and the storeContextBlocks count drops to 1. Something about running _terminate_request on an in-transmission request is breaking the KV cache manager's invariants in a way that later surfaces during claimBlock — but with both LlmRequest and pinned blocks accounted for, I can't offer a mechanism. My best current guess is that the storeContextBlocks: Can not find sequence warnings are a symptom of remove_sequence running with a sequence the partial-reuse path expected to later storeContextBlocks for, leaving a dangling block-id / sequence-id mapping that claimBlock later trips on — but that's still speculation.

Do you have a theory on what's actually being corrupted? If not, I'd like to run this under MALLOC_CHECK_=3 or ASan to get a writer stack on the corruption site rather than leaving it as "empirical correlation, mechanism TBD." The fix is straightforward; I just don't want to ship a comment that keeps hand-waving about the cause.

For the other two paths the pin argument doesn't change the per-path verification I posted: _check_disagg_ctx_cache_transfer_status Path B explicitly calls end_transfer which unpins the blocks (py_executor.py:241) and pops _requests_in_transfer; _handle_responses Path A gen side has no AsyncTransferManager/pin path at all (receiver-side transfers go through request_and_receive_async, not async_transfer_manager.start_transfer). So those guards are still preventing real, nameable hazards. I'll update the code comment at 3791 to reflect your correction on the pin.

Follow-up — got the forensic mechanism. Ran the same rc11 pre-fix image under MALLOC_PERTURB_=85 / MALLOC_CHECK_=3 with the same workload (conc 48, ISL 8000, 120 s) and caught the UAF directly:

[TensorRT-LLM][ERROR] Error occurred during context transfer for request 6148914691236517205: std::future_error: Broken promise

6148914691236517205 == 0x5555555555555555 — the glibc poison fill written into freed memory. That log line comes from checkContextTransferStatus's catch block doing LOG_ERROR("... %ld ...", it->first->mRequestId). Request IDs are monotonic 64-bit integers, so 0x55..55 is not a legal ID; the mRequestId field must have been read from a freed slot. Direct mechanistic proof the LlmRequest was freed out from under mSenderFutures.

Walking through why, given your pin correction:

AsyncTransferManager.start_transfer pins via store_blocks_for_reuse(request, True) — blocks stay live through _terminate_request (your correction, confirmed).

_terminate_request alone does not free the LlmRequest — AsyncTransferManager's strong ref survives. So the force_terminate_for_partial_reuse path in isolation does not produce the UAF; it only produces the storeContextBlocks: Can not find sequence warnings (observed 37x-102x) via remove_sequence.

The free happens on Path B in _check_disagg_ctx_cache_transfer_status. Path B calls _end_transfer_and_maybe_terminate, which calls AsyncTransferManager.end_transfer, which pops _requests_in_transfer — the drop you were pointing at as "until transfer is completed" is baked into this path's cancel sequence. After the pop, the iteration-local request is the only remaining strong ref; when the loop moves on, the pybind shared_ptr hits zero and the C++ LlmRequest is destroyed. The mSenderFutures entry's raw pointer is now dangling.

Under MALLOC_PERTURB_, the freed region is filled with 0x55; the catch-block read captures that pattern. Without MALLOC_PERTURB_ (original rc11 run), the same region gets reused by a later allocation; a subsequent C++ write (setState(kDISAGG_TRANS_ERROR) at 0x06 into what is now another object's mid-word) corrupts heap metadata, and the next unrelated free() in LRUEvictionPolicy::claimBlock aborts with invalid next size.

So the corrected mechanism for this PR's fix is:

Path B guard (in _check_disagg_ctx_cache_transfer_status): forensically confirmed to prevent a classic LlmRequest-UAF. This is the path that drops the AsyncTransferManager ref.

Path A gen-side guard (in _handle_responses): classic UAF by construction — gen requests never enter AsyncTransferManager, so _handle_errors is a direct free. No forensic run targeted gen-side separately in this investigation, but the argument is airtight from the code.

force_terminate_for_partial_reuse guard (the line you anchored on): neither the LlmRequest nor the pinned blocks are freed by this path alone. Skipping it still matters because in the real workload the same request is later picked up by Path B once py_kv_transfer_timed_out fires, and Path B is where the confirmed free happens. Whether this path alone can crash independent of Path B is not directly isolated.

Caveats on the heap-debug run:

The MALLOC_CHECK_=3 abort on invalid next size did not trigger in the 12-min window — a pure read of a poisoned slot doesn't flag the check. The poison read is the proof, not an abort.

An ASan rebuild would pin the first write through the dangling pointer with its stack. That's a source rebuild, so out of scope for this image-only investigation but a clean follow-up if more forensics are wanted.

Pushed 5779e5c53df42104b063a342f98a832407fec82d updating the inline comments to reflect this — Path B guard comment now cites the 0x55..55 evidence; Path A guard splits ctx vs gen accurately; force_terminate_for_partial_reuse guard comment keeps your pin correction and frames the composite-with-Path-B story without overclaiming. Also walking the walk-back back in the C++ catch-block comments.

Tabrizian · 2026-04-17T17:55:58Z

+
+    error_request_ids = []
+    # 20x the configured timeout is a generous wall-clock budget for CI jitter.
+    deadline = time.monotonic() + (kv_transfer_timeout_ms * 20) / 1000.0


We already have a timeout in the py_executor level, this test doesn't trigger the py_executor level changes so that could be why that timeout is not triggered.

Correct — this test exercises the C++ check_context_transfer_status API directly, bypassing py_executor. The py_executor-level _check_kv_transfer_timeout → cancel_request → _handle_errors path is intentionally not in the loop here; the point of the test is to verify the C++ kv_transfer_timeout_ms enforcement in isolation.

Why the C++-layer enforcement still matters even though py_executor already has a timeout: the py_executor-level path (_check_kv_transfer_timeout fires py_kv_transfer_timed_out → either _handle_responses Path A or _check_disagg_ctx_cache_transfer_status Path B) calls cancel_request + _handle_errors / _end_transfer_and_maybe_terminate on requests whose C++ transfer is still in flight. Under load, that reliably produces free(): invalid next size on ctx side, and creates a classic UAF window on gen side (gen requests have no AsyncTransferManager holder). See the split analysis on the py_executor.py thread above for per-path specifics.

The Python guards in this PR skip that py-executor cleanup for in-transmission requests and defer to the C++ deadline, which evicts the entry from mSenderFutures / mRequesterFutures before any Python-side free_resources can run. That's the enforcement path the test is directly exercising.

I can add a heavier py_executor-level integration test too if you'd like — it would need a full LLM + 1P1D fixture. Let me know if that's worth it vs. the C++-API unit test.

yifjiang · 2026-04-17T18:08:07Z

@Tabrizian client command added to the reproducer section of the PR description — full loadgen script (asyncio, concurrency 64, ISL ~8000, OSL ~200, 180s load + 5 recovery probes) plus a single-shot curl sanity check. It talks to the OpenAI-compatible /v1/chat/completions endpoint on the frontend.

I also updated the "what actually breaks" paragraph to reflect the mechanism you pointed out in py_executor.py:3762 — the LlmRequest stays alive via AsyncTransferManager; the corruption is at the KV-cache-block level (free_resources returns blocks to LRUEvictionPolicy's free queue while the NIXL sender is still writing GPU bytes into them). Full reply on that thread too.

yifjiang · 2026-04-17T18:09:14Z

Adding empirical backing for the two inline threads above. Numbers below, mechanism analysis further down. Update (2026-04-17): forensic confirmation via MALLOC_PERTURB_ at the bottom of this comment.

Setup — reproducible from a fresh TRT-LLM build (no prebuilt image assumed) with a small observability patch that instruments counters on the two Python paths in question. Deploy 1P1D disaggregated serving (one prefill + one decode worker) with:

Model: Qwen/Qwen3-0.6B
Backend: NIXL over TCP (UCX_TLS=tcp,cuda_copy,self)
cache_transceiver_config.kv_transfer_timeout_ms = 60000
cache_transceiver_config.kv_transfer_sender_future_timeout_ms = 10000
Both workers: free_gpu_memory_fraction: 0.3
Prefill: disable_overlap_scheduler: true
Decode: disable_overlap_scheduler: false

Workload — asyncio client hitting the frontend's /v1/chat/completions: concurrency 48, ISL ~Normal(8000, 2000) floored at 100, OSL ~Normal(200, 50) floored at 50, temperature=0, non-streaming, 120 s sustained load, then 5 recovery probes.

Run A — baseline (no PR #13056 fix), observability only. Counts how often each Python path operates on an in-transmission request:

Event	Prefill	Decode
`_check_disagg_ctx_cache_transfer_status` Path B — `cancel_request` returned True on a ctx request in `DISAGG_CONTEXT_TRANS_IN_PROGRESS`, then `_end_transfer_and_maybe_terminate` ran	23	0
`_handle_responses` Path A — `py_kv_transfer_timed_out` would have triggered `cancel_request` + `_handle_errors` on a gen request in `DISAGG_GENERATION_TRANS_IN_PROGRESS`	—	12
`CacheSender::cancelRequest` returning True on an in-transmission ctx request	23 / 23	—
`"Request X not found in transfer manager"` warning	1	0

The 23/23 on CacheSender::cancelRequest shows this isn't a corner case — every ctx request whose Python-level timeout fires while C++ is still driving its transfer gets cancel_request == True (the request is still in mReadyResponses), and the pre-fix code proceeds into _handle_errors / _end_transfer_and_maybe_terminate. The single "not found" warning is the direct fingerprint that Path B did pop _requests_in_transfer on a request that C++ still considered in flight.

Run B — same setup and workload, with the Python guards in this PR applied (both Path A and Path B skip termination when the request is in a transmission state):

Event	Prefill	Decode
Termination skipped by the new guard (per firing, not unique requests)	4322	12
Heap crash (`free(): invalid next size`)	none	none
C++ `kv_transfer_timeout_ms` eviction fires (ctx side)	63	0
C++ `kv_transfer_timeout_ms` eviction fires (gen side)	—	0
Recovery probes pass after sustained load	—	0 / 5

So:

Prefill is healed by the Python guard + ctx-side C++ deadline. 63 stuck entries evicted cleanly, KV blocks unpinned via _end_transfer_and_maybe_terminate, no crash, no hang.
Decode still hangs even with the Python guard — the 12 would-be terminations are correctly skipped, but without a checkGenTransferStatus deadline, stuck receivers accumulate in mRequesterFutures, the gen KV pool exhausts, and the server doesn't recover. That's why daf70d403 adds the gen-side C++ deadline and b9450529b hoists the check above the toCompleteIdSet readiness gate (so entries that are never "ready" still get their deadline checked).

Mechanism — per-path

Path B guard (_check_disagg_ctx_cache_transfer_status retry-cancel): this path calls _end_transfer_and_maybe_terminate → AsyncTransferManager.end_transfer, which is literally the code that pops _requests_in_transfer. The "AsyncTransferManager keeps it alive" invariant does not apply here — this path is the completion signal that drops the hold. After the pop, surviving Python references (active_requests, loop locals) are transient; the pybind shared_ptr refcount drops to zero, the C++ LlmRequest is destroyed, and mSenderFutures's raw pointer is stale. Forensic confirmation in the follow-up run below.
Path A gen-side guard (_handle_responses → py_kv_transfer_timed_out for gen): gen requests never enter AsyncTransferManager. _handle_errors explicitly removes the request from active_requests and calls _terminate_request. No holder exists to save the LlmRequest — this guard is the only thing preventing a classic UAF on mRequesterFutures on the gen side.
force_terminate_for_partial_reuse guard (the line @Tabrizian anchored on): _terminate_request does not touch async_transfer_manager, and store_blocks_for_reuse(request, True) pins the KV blocks (thanks to @Tabrizian for that correction). So neither the LlmRequest nor its pinned blocks are freed by this path alone. The observable impact is the 37×–102× storeContextBlocks: Can not find sequence warnings from remove_sequence. In the real workload the same request is later picked up by Path B once py_kv_transfer_timed_out fires, and Path B is where the forensically confirmed free occurs. Whether this path could crash independent of Path B via a KVCacheManager-state inconsistency has not been isolated.

Update (2026-04-17): forensic confirmation of the Path B UAF via `MALLOC_PERTURB_`

Re-ran the same pre-fix image, same workload, with heap-debug env on both workers:

MALLOC_CHECK_: "3"
MALLOC_PERTURB_: "85"        # fills freed memory with 0x55
GLIBC_TUNABLES: "glibc.malloc.check=3"
LIBC_FATAL_STDERR_: "1"

After ~60 s of load, the prefill worker logged:

[TensorRT-LLM][ERROR] Error occurred during context transfer
  for request 6148914691236517205: std::future_error: Broken promise

6148914691236517205 == 0x5555555555555555 — the poison fill pattern glibc writes into freed memory under MALLOC_PERTURB_=85. The log line comes from checkContextTransferStatus's catch block reading it->first->mRequestId off mSenderFutures. Request IDs are monotonic 64-bit integers from the executor; 0x55…55 is not a legal ID. The only way for mRequestId to read as 0x55…55 is if the memory the field was stored in has been freed and poisoned.

This is direct, mechanical proof that the LlmRequest was freed out from under mSenderFutures.

Sequence:

Python timeout fires for a request in DISAGG_CONTEXT_TRANS_IN_PROGRESS. Path B runs: cancel_request → True, _end_transfer_and_maybe_terminate → AsyncTransferManager.end_transfer pops _requests_in_transfer (drops the AsyncTransferManager ref), then _terminate_request → free_resources → remove_sequence (explaining the 102 storeContextBlocks: Can not find sequence warnings in the same run).
The iteration-local request in Path B is the only surviving strong ref. When the loop moves on, pybind shared_ptr refcount hits zero; the C++ LlmRequest is destroyed.
Glibc either reuses the slot or fills it with 0x55.
Either via broken-promise resolution or the next tick iterating mSenderFutures, the C++ code dereferences the stale raw pointer. The catch-block log above is the smoking gun.

Caveats:

MALLOC_CHECK_=3 did not SIGABRT during the heap-debug window. A pure read of poisoned bytes doesn't flag invalid next size; it only fires on metadata damage at a later unrelated free. The poison read is the proof, not the abort.
The subsequent request->setState(kDISAGG_TRANS_ERROR) writes 0x06 into the freed region's mState offset. Whether that write trips the check depends on which glibc bin the slot was assigned to — timing-sensitive. To force a deterministic abort at the write site, the next step would be an ASan rebuild, which is out of scope for this image-only investigation.
Decode worker stayed entirely quiet (0 of every signature). The crash mechanism in this image is exclusively prefill/sender side — consistent with mSenderFutures being the container holding the freed LlmRequest.

Pushed 5779e5c53 updating the inline code comments with this forensic evidence (Path B guard comment cites the 0x55..55 read; Path A guard splits ctx vs gen accurately; force_terminate_for_partial_reuse guard acknowledges @Tabrizian's pin correction and frames the composite-with-Path-B story without overclaiming).

… + guard against dangling pointers Two fixes for disaggregated prefill/decode KV cache transfer: 1. **C++ total timeout enforcement** (`cacheTransceiver.cpp`): `checkContextTransferStatus` retried stuck transfers indefinitely using only the per-retry `kv_transfer_sender_future_timeout_ms`. The total `kv_transfer_timeout_ms` was plumbed through config but never enforced. Now checks total elapsed time after each per-retry timeout and removes entries that exceed the limit. All pointer dereferences are avoided in timeout and exception paths since `mSenderFutures` holds raw `LlmRequest*` that may be dangling. 2. **Python guard for in-flight transfers** (`py_executor.py`): The `force_terminate_for_partial_reuse` path terminated requests unconditionally, including those in `DISAGG_CONTEXT_TRANS_IN_PROGRESS` state. This freed the `LlmRequest` while C++ `mSenderFutures` still held a raw pointer, causing heap corruption (`free(): invalid next size`). Added `is_disagg_context_transmission_state` check before the termination path — requests in active transfer are skipped and cleaned up when the transfer completes or times out. Bug origin: PR NVIDIA#6348 (2025-09-27) moved the `is_disagg_context_transmission_state` guard into an else branch, leaving the block-reuse path unprotected. Tested on nscale B200 cluster with Qwen3-0.6B 1P1D disagg: - Without fix: crash at ~65s (conc 48, ISL 8000, free_gpu_mem=0.3) - With Python guard only: no crash, 1 storeContextBlocks warning (was 37) - With both fixes: no crash, stuck transfers cleaned up by C++ timeout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…eanup With the Python-side is_disagg_context_transmission_state guard, ALL LlmRequest* pointers in mSenderFutures are guaranteed valid — the guard prevents Python from terminating requests during active transfer. This simplifies the C++ timeout and exception paths: - Timeout: dereference request->mRequestId, setState(kDISAGG_TRANS_ERROR), add to errorRequestIds → Python calls end_transfer → unpin_blocks - Exception (Broken promise): same — report as error for cleanup Previously, timed-out entries were silently erased without reporting to Python, leaving entries in _requests_in_transfer with pinned KV blocks that were never unpinned, causing KV pool exhaustion after overload. Removed the startTimeValid heuristic — it was checking the pointer to determine if the pointer is safe, which is itself UB on a dangling pointer. With the Python guard, this check is unnecessary. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address CodeRabbit review on NVIDIA#13056: the overall transfer deadline was only loaded when !blockAll, so callers that omit atLeastRequestNum could still hang forever on a stuck sender via future.get(). - Always load kvTransferTimeoutMs from mCacheTransceiverConfig. - In the timeout branch, enforce the overall deadline regardless of blockAll; only skip-and-retry when per-iter timeout is set. - In blockAll mode, fall through to future.get() only when the overall deadline has not yet been exceeded. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…eadline Empirical testing (v5-diag run on Qwen3-0.6B 1P1D, conc 48, ISL 8000) surfaced three remaining paths that still hit use-after-free or hang: 1. Python `_handle_responses` — the `py_kv_transfer_timed_out` branch (line 3643) calls `cancel_request` + `_handle_errors` + `continue`, bypassing the existing `is_disagg_context_transmission_state` guard. `cancel_request` on the sender only drains `mReadyResponses`; the raw LlmRequest* stays in `mSenderFutures`, so `_handle_errors` frees the request and the next `checkContextTransferStatus` dereferences a dangling pointer. Same hazard on the receiver side with `mRequesterFutures`. 2. Python `_check_disagg_ctx_cache_transfer_status` — the retry-cancel loop (line 3267) iterates `requests_in_transfer`, every entry of which is in `DISAGG_CONTEXT_TRANS_IN_PROGRESS` by construction. Cancelling + terminating always frees an in-transmission request, identical to Path 1. In v5-diag this fired 23 times per 120 s run. Both Python paths now skip cancel+terminate when the request is in transmission state and defer to the C++ kv_transfer_timeout_ms path, which evicts the entry and reports it via errorRequestIds / ERROR state — at which point termination is safe. 3. C++ `checkGenTransferStatus` (receiver side) — unlike `checkContextTransferStatus`, it never enforced `kv_transfer_timeout_ms`: the happy path called `future.get()` unconditionally, so a stuck receiver hung forever and its entry stayed pinned in `mRequesterFutures`. v5f fixed the prefill UAF but the decode-side hang persisted (10/104 succeeded, probes failed) because no deadline evicted the stuck receivers. Mirror the sender restructure: load `kv_transfer_timeout_ms` (always) and `kv_transfer_sender_future_timeout_ms` (polling mode only); on `wait_for` timeout, check elapsed against the overall deadline and evict with `kDISAGG_TRANS_ERROR` when exceeded. The existing `_check_cache_transfer_errors` surfaces the ERROR state back to Python, which then safely unwinds the request (the C++ entry was already erased, so there is no dangling pointer). Additional correctness fix: `CacheReceiver::receiveAsync()` spawns the thread that calls `setKvCacheTransferStart` asynchronously. Until that thread runs, `getKvCacheTransferStart()` returns the default epoch time point and the new gen-side deadline check would falsely evict fresh entries. Set the transfer start synchronously in `CacheTransceiver::requestAndReceiveAsync` before pushing the future. Debug logging: mirror `respondAndSendAsync`'s insertion log in `requestAndReceiveAsync`, and the per-entry/summary scan in `checkGenTransferStatus` to match `checkContextTransferStatus`. These make future stuck-transfer investigations tractable without rebuilding. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…ss gate The elapsed-time check added in daf70d4 was nested inside the `status == future_status::timeout` branch of `wait_for(...)`, which is itself gated by `blockAll || toCompleteIdSet.find(...) != end()`. In polling mode (blockAll == false), `toCompleteIdSet` is populated from the initial `wait_for(0)` sweep (consensus across ranks) plus an insert-order fallback that only runs when `atLeastRequestNum > 0`. A stuck sender/receiver whose future never becomes ready is not in the consensus set, and callers of `_check_disagg_gen_cache_transfer_status` typically pass `at_least_num = 0` (see py_executor.py:2930, where `need_check_one` requires ALL non-gen-first requests to be in transmission — narrow). With `atLeastRequestNum = 0`, such entries escape the deadline check entirely and pin KV blocks forever. Empirical evidence on 1P1D Qwen3-0.6B, conc 48, ISL 8000, 120 s: - mRequesterFutures.size() stayed at 22 across 384 checkGenTransferStatus polls (blockAll=0, kvTransferTimeoutMs=60000, senderFutureTimeoutMs=10000 confirmed via TLLM_LOG_DEBUG) - 0 decode-side total-timeout evictions vs 63 on prefill (prefill is masked by `_check_disagg_ctx_cache_transfer_status(1)` callers that rotate entries through `toCompleteIdSet`) - server did not recover after phase-1 sustained load — all 5 probes timed out Hoist the deadline check above the readiness gate in both `checkContextTransferStatus` and `checkGenTransferStatus`. A stuck entry is now evicted at `kv_transfer_timeout_ms` regardless of whether the future is ready or whether the entry is in `toCompleteIdSet`. The inner fallback checks in the `wait_for(...) == timeout` branch are removed since the outer check has already evicted any entry whose deadline has passed. Also add a ctx-side regression test that polls with `at_least_request_num=0` — the default scheduler cadence that would have exposed this bug. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…e evicts Addresses the post-saturation "no-recovery" bug documented in the gen-side ghost-UCX investigation: with PR NVIDIA#13056 commits 1-6 applied, the UAF/crash is fixed but every probe after a saturation event deterministically returns HTTP 500 at exactly `kv_transfer_timeout_ms`, indefinitely. Py-spy traces show Python blocked in check_gen_transfer_status waiting on at_least_num=1; the scheduler is alive and requests are being accepted, but every new gen transfer never becomes "ready." Root cause: the timeout-eviction in `checkGenTransferStatus` (`daf70d4034` / `b9450529b`) erases the `std::future<void>` from `mRequesterFutures`, but the underlying task spawned via `CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads` is still running — blocked inside `recvReadySignal` waiting on a peer that will never respond. These zombie tasks hold NIXL/UCX per-request state (TransferSession, buffer registrations). Over a saturation window ~100+ zombies accumulate and pin the receive pool, so new receives queue behind a frozen pool and every new transfer waits the 60s deadline. Empirical signatures: no crash, server responds, thread and FD counts identical to a fresh worker (damage is inside C++ lifetime state), killing only the decode pod recovers fully. Fix: extend `CacheReceiver::Impl::cancelRequest` to cover in-flight receives, and call it from `checkGenTransferStatus`'s timeout-eviction path. dataTransceiver.cpp (CacheReceiver::Impl): - Add a per-request cancel-flag registry (mInFlightCancelMutex + mInFlightCancelFlags) keyed by request id. - Register a fresh std::atomic<bool> in requestAndReceiveAsyncMultiThreads before enqueueing; attach the shared_ptr to RequestAndPromise so it survives through the worker. - Worker in request() passes the flag to requestSync; unregisters after requestSync returns (success or exception). - requestSync takes the flag, checks it at entry, passes it to receiveReadySignal, and re-checks after the ready-signal wait returns. - receiveReadySignal gains a per-request-cancel overload that plumbs the flag as the DataContext's transferTerminate — the existing polling loop in AgentConnectionManager::waitForNotification then observes the flag via ctx.getTransferTerminate().load() and exits early with isReady=false. - cancelRequest, when the request is not found in mRequestsQueue, now flips the registered flag (previously it just logged "Cannot cancel"). Returns true for both queued and in-flight cases. - ~Impl flips all per-request flags alongside setting mTerminate so process shutdown still unblocks recvReadySignal polling loops. - Impl::receiveAsync (currently unused via CacheReceiver public API) switched to a lambda to disambiguate the new requestSync overload. cacheTransceiver.cpp: - checkGenTransferStatus timeout-eviction branch now calls mCacheReceiver->cancelRequest(*request) before erase, signaling the blocked task to abort via the per-request flag. - checkContextTransferStatus timeout-eviction branch calls mCacheSender->cancelRequest(*request) symmetrically for defense in depth; sender zombies empirically unwind on peer teardown but clearing the bookkeeping is cheap and consistent. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…e send honor perRequestCancel v11b ([reqSync] instrumentation commit e61a5fa) empirically pinned the post-saturation no-recovery wedge to CacheReceiver::Impl::sendRequestInfo: pre-sendRequestInfo fires for a reqId and has no matching post-sendRequestInfo for 60 s+; after the overall kv_transfer_timeout_ms cancelRequest flips the per-request flag and logs "Flipped in-flight cancel flag", but the blocking send call inside sendRequestInfo never observes the flag because the polling loops below requestSync don't read ctx.getTransferTerminate(). The drain worker therefore never returns to the CV wait and every subsequent enqueue's notify_all() is delivered to a thread not waiting on it. Queue backs up. Every new transfer at the 60 s deadline hits Path A silent erase. Fix: plumb perRequestCancel through the send path so the blocking calls below can observe the flag. cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl): - sendRequestInfo(LlmRequest) -> overload now delegating to the new sendRequestInfo(LlmRequest, std::atomic<bool> const& perRequestCancel) that threads perRequestCancel through: * checks perRequestCancel at the top of the per-counterpart for-loop (throws to abort further peer notifications once a cancel has fired), * passes it to the non-agent sendRequestInfo(connection, info, perRequestCancel) helper, * uses it as the DataContext's transferTerminate in the TransferSession constructor, so the downstream data-phase operations (receiveSync -> unformat -> per-connection send/recv -> AgentConnection::send) also honor the cancel flag. - sendRequestInfo(Connection*, RequestInfo const&, perRequestCancel): plumbs the flag into each of the three DataContext constructions (kID_TAG / kINFO_SIZE_TAG / kINFO_TAG) used in the non-agent path. - requestSync(..., perRequestCancel): calls the new two-arg sendRequestInfo. - receiveReadySignal(session, perRequestCancel): adds a between-iteration perRequestCancel check in the per-peer for-loop so the multi-rank wait sequence bails out on cancel. cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (AgentConnection::send): - Replaces the unbounded status->wait() with a polling loop of status->wait(100ms) that checks ctx.getTransferTerminate() between polls. If cancel fires mid-send, throws TLLM_THROW so requestSync unwinds via RequestSpecificException; the drain worker's existing catch(std::exception) path handles the throw, sets the promise, and resumes. TransferStatus exposes no abort primitive, so this leaks at most one in-flight xfer descriptor per cancel event — preferable to an unrecoverable wedge. The v11b evidence showed the wedge sub-call is sendRequestInfo. This commit addresses that path (via DataContext plumbing into the underlying AgentConnection::send poll loop) and extends the same pattern to receiveReadySignal (for iteration-level cancel observance) and to the data-phase transfers that go through the TransferSession's DataContext (receiveSync's unformat -> AgentConnection::send). Still out of scope for this commit (per docs/pr-13056-option-c.md): per-sub-call cancel inside notifySyncMessage (NIXL backend internals not reachable from this layer), and any sender-side in-flight flag parity (CacheSender::Impl still lacks the flag mechanism entirely). See docs/pr-13056-no-recovery-ghost-ucx.md for the saturation repro and docs/pr-13056-option-c.md for the v11b evidence + decision tree. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…Message pre-notify check Closes the two TBDs left open after Option C (commit 7381830): 1. Sender-side in-flight cancel flag (symmetric to CacheReceiver::Impl) ----------------------------------------------------------------------- Pre-fix, CacheSender::cancelRequest could only abort requests still in the mReadyResponses queue that were NOT the mCurrentRequest. In-flight sends (request is mCurrentRequest, worker is inside sendSync's data transfer) just logged "Cannot cancel request" and returned false — leaving the send to block on the peer indefinitely when that peer's receive side has already been evicted. Add a per-request cancel-flag registry to CacheSender::Impl: - mInFlightCancelMutex + mInFlightCancelFlags map (reqId -> shared_ptr<atomic<bool>>) - getOrCreateInFlightCancelFlag(reqId) helper - sendAsync registers a flag at enqueue time - recvRequestInfo constructs the TransferSession's DataContext using the registered flag (via *cancelFlag) instead of mTerminate, so the downstream data-phase send/recv operations — specifically AgentConnection::send's poll-wait loop added in 7381830 — observe the flag via ctx.getTransferTerminate() - cancelRequest flips the flag on the previously "Cannot cancel" miss path (returns true now) - removeResponse and the sendResponse cancel-cleanup branch erase the flag after send completion / cancel - terminate() flips every registered flag before joining the response worker so shutdown still interrupts in-flight sends (the DataContext references per-request flags, not mTerminate) 2. notifySyncMessage pre-notify cancellability ---------------------------------------------- notifySyncMessage drops into the NIXL backend and is not interruptible from this layer. Two narrow windows where a cancel fired but the notify still ran — fix both: - AgentConnection::send: added a cancel check between the poll-wait's TransferState::kSUCCESS return and the notifySyncMessage call. If the cancel fires in that gap, throw instead of sending a notify the peer will never act on. - AgentConnection::sendRequestAndBufferInfo: added an optional `std::atomic<bool> const* perRequestCancel = nullptr` parameter. When non-null and flipped true, throws before notifySyncMessage. Caller in CacheReceiver::Impl::sendRequestInfo passes &perRequestCancel. Limitations still standing after this commit ---------------------------------------------- - notifySyncMessage itself is not interruptible mid-call. The pre-notify check only catches the gap between wait() returning and notify starting. If notifySyncMessage blocks internally on a stuck peer (unclear whether NIXL's genNotif is fully synchronous), the cancel can't unwind it from this layer. Full remediation would need a new cancellation API on BaseTransferAgent. - TransferStatus exposes no abort primitive; AgentConnection::send's mid-poll throw leaks at most one in-flight xfer descriptor per cancel event (acceptable vs the alternative of a drain-worker wedge). - Race window between sendAsync (sender registers flag) and recvRequestInfo on the sender side — if the control message from the peer arrives before sendAsync registers, getOrCreateInFlightCancelFlag creates the flag at recvRequestInfo time, so lifecycle is intact, but a cancel fired in that early window uses the getOrCreate flag as the authority — confirmed consistent. - No sender-side equivalent of the catch(...) drain-loop hardening from commit f63172f; CacheSender's response() worker has a top-level catch(std::exception) but no catch(...). If a non-std throw escapes from sendSync, the sender worker thread dies and all queued sends stall. Separate commit. - No regression test for sender-side in-flight cancel (same constraint as the Option C commit: requires a stalled-peer harness). Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…t catch(...) to sender worker Two pre-e2e cleanups identified during code trace: 1. Sender-side mInFlightCancelFlags erase was unsafe -------------------------------------------------------- In commit 3736d11 I erased the per-request cancel flag from mInFlightCancelFlags inside removeResponse() and inside the sendResponse cancel-cleanup else-branch. The session's DataContext (built in recvRequestInfo) holds a reference (not ownership) to the atomic managed by the shared_ptr in that map. Tracing: - Success path: sendAndRemoveResponse calls release(id) which destroys the session, THEN sendResponse calls removeResponse. With the erase in removeResponse, the flag was dropped AFTER the session — safe. - Cancel-throw path: sendSync throws (via the Option C cancel check), sendAndRemoveResponse's catch skips release(id), session remains in mRequestToSession. removeResponse still runs → flag erased while the session (with its DataContext ref) is still live. Dangling reference. - mCancelledRequests cancel path (isReady=false): release not called; inline erase of the flag similarly dangles session's DataContext. Nothing in the current code dereferences the session's DataContext after sendSync returns, so this is a latent UAF rather than an observed crash. But it's fragile and would trip the first time anyone adds post-send session iteration. Move the flag erase out of removeResponse and out of sendResponse's cancel-cleanup into release(id), where the session is actually destroyed. Cancel paths that skip release() intentionally leak the flag shared_ptr until ~Impl / terminate(), matching the existing session-leak behavior on those same error paths. release() orders session-destroy before flag-drop so DataContext reference is never dangling. 2. CacheSender::Impl::response() lacked catch(...) ---------------------------------------------------- Symmetric to the catch(...) added to CacheReceiver::Impl::request() in commit f63172f. response() is noexcept, so a non-std::exception escape (integer throw, custom type throw from NIXL/UCX backends) would call std::terminate. Adding catch(...) resolves pending promises with current_exception() so callers see a failure, then exits the worker cleanly. Worker thread still exits after the catch — sender is dead for this process, but fail-closed via promises is strictly better than terminate(). No behavior change on the happy path. No new primitives. Trace is now clean for both receiver and sender sides of the cancel plumbing. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

… diagnostic [buf] logs Trace through the receiver-side sendRequestInfo path identified a more likely wedge site than notifySyncMessage: the CV wait inside BaseTransBufferManager::assignBufferIndex (baseTransBuffer.cpp:228 pre-fix) parks on `mBuffersCV.wait(lk, predicate)` with no timeout and no awareness of the per-request cancel flag. Under saturation with N in-flight wedged transfers all holding their receive-buffer slots, every new call to sendRequestInfo parks here and never observes perRequestCancel — exactly fitting v11b's [reqSync] pre-sendRequestInfo-with-no-post-sendRequestInfo signature. Unlike notifySyncMessage (a thin wrap over NIXL's genNotif with no timeout API), the buffer-pool CV does have a natural wait_for variant we can plumb. Changes ------- baseTransBuffer.h: extend public `assignBufferIndexForRecv` and private `assignBufferIndex` with three optional parameters: - std::atomic<bool> const* perRequestCancel = nullptr - int64_t waitSliceMs = 100 - std::optional<uint64_t> requestIdForLog = std::nullopt Default values preserve existing behavior for unchanged callers; the new capability is opt-in. baseTransBuffer.cpp: replace the unconditional `mBuffersCV.wait` with a bounded loop that: - re-checks the predicate every waitSliceMs via wait_for; - checks perRequestCancel between slices and TLLM_THROWs on cancel, so the drain worker can unwind via its existing catch(std::exception); - logs [buf] markers — ENTER (WAIT) when the pool is exhausted, STILL_WAITING every ~5 s of parking (50 × 100 ms), ACQUIRED on success, CANCEL on cancel — each tagged with the reqId so they cross-reference cleanly with [reqSync] and [drain] log trails. The hot path (pool not exhausted) takes the no-wait fast-path with zero log spam. dataTransceiver.cpp (CacheReceiver::Impl::sendRequestInfo, agent path): pass &perRequestCancel and the llmRequest's mRequestId into each assignBufferIndexForRecv call. Wrap the loop with [reqSync] pre-/post-assignBufferIndexForRecv markers so the v12 diagnostic trail can distinguish pool-exhaustion wedges from deeper NIXL notifySyncMessage wedges without ambiguity. Expected behavior after this change on the v11b workload -------------------------------------------------------- If the wedge was pool exhaustion (the most-likely hypothesis): - [buf] assignBufferIndex WAIT fires whenever saturation hits - Some mix of ACQUIRED (pool drains, wait wakes up) and CANCEL (kv_transfer_timeout_ms-driven cancelRequest unblocks the wait) - No more unmatched [reqSync] pre-sendRequestInfo; all requests either finish or are cancelled with a captured latency and pool-occupancy snapshot. If the wedge was actually deeper (notifySyncMessage / NIXL genNotif): - [reqSync] pre-assignBufferIndexForRecv / post-assignBufferIndexForRecv pair cleanly (no [buf] WAIT → STILL_WAITING pattern) - [reqSync] post-sendRequestInfo is still missing for wedged reqIds - Diagnosis shifts to limitation 1 (notify backend) with the next fix direction being a supervisor thread or a NIXL API addition Either outcome narrows the remaining problem surface. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…aking pool slots on non-happy exits v12 image (PR HEAD with assignBufferIndexForRecv bounded-wait + cancel plumbing) reproduced saturation with Phase 1 = 11/104 and probes 0/5. Decode log showed the mechanism unambiguously: - 12 reqIds reached [reqSync] post-assignBufferIndexForRecv (buffer slot acquired from the size-1 pool) - 11 completed via the normal receiveSync → formatter → freeBufferIndexForRecv chain (slot released) - 1 (reqId=8584867770318859) took the `(not-ready or cancel after ready)` early-return branch because prefill's own ctx-timeout fired and signalled isReady=0 — receiveSync was skipped entirely, so the formatter never released the slot. That single leak permanently wedged the size-1 pool, and every subsequent assignBufferIndexForRecv parked on the CV wait forever. Root cause pattern ------------------ CacheReceiver::Impl::requestSync has at least six exit paths (normal, early-cancel, not-ready, cancel-after-ready, cancel from inside sendRequestInfo, cancel from inside receiveReadySignal, exception escape). Pre-fix, only one of them — normal → receiveSync → formatter — released the buffer index. All five other paths leaked it. v12 empirically proved the (not-ready) branch; the other four were latent but reachable under different workloads. Fix: RAII holder ---------------- New BufferIndexHolder in baseTransBuffer.h — move-only scoped owner for a buffer index, releases on destructor (including stack unwind from exceptions). detach() hands off ownership when the formatter inside receiveSync takes over on the happy path. Changes: cpp/tensorrt_llm/batch_manager/baseTransBuffer.h: - Added BufferIndexHolder class (inline, move-only, noexcept release). cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp: - BufferIndexHolder::release() out-of-line so it can call freeBufferIndexForRecv/Send on the manager. Emits [buf] AUTO_RELEASE on each RAII release so post-saturation traces can attribute how often non-happy-path exits fire vs formatter releases. cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl): - requestSync now acquires each receive-buffer index directly (moving the acquisition up from sendRequestInfo), wraps each acquisition in a BufferIndexHolder, and passes the raw cacheBufferIds vector into sendRequestInfo as a new parameter. - sendRequestInfo(llmRequest, perRequestCancel) becomes a thin delegator to a new three-arg overload sendRequestInfo( llmRequest, perRequestCancel, preAcquiredCacheBufferIds). The three-arg variant skips the internal acquire when the caller provides ids (requestSync path), matching pre-RAII behavior when the vector is empty (legacy / public-wrapper path). - After receiveSync returns on the happy path, detach() every holder so the formatter's existing freeBufferIndexForRecv call doesn't double-release. - All other exits from requestSync (early-cancel, not-ready, cancel-after-ready, throws from sendRequestInfo / receiveReadySignal) unwind through ~BufferIndexHolder and release the slot. No branch-specific release calls needed. Expected v13 behavior --------------------- - Phase 1 completion count substantially higher than 11/104 — pool no longer wedges after the first (not-ready) exit. - All 5 recovery probes return HTTP 200 within single-digit seconds (assuming limitations 1-3 don't trip). - [buf] AUTO_RELEASE fires on (not-ready) and cancel branches; count correlates with early-exit traffic. - [buf] WAIT → ACQUIRED pairs stay bounded; no permanent concurrence=C/C saturation. - The verification script in docs/pr-13056-raii-buffer-release.md (every post-assignBufferIndexForRecv reqId has a matching [reqSync] EXIT) runs clean. Still not addressed here ------------------------ - Sender-side parity (assignBufferIndexForSend release discipline on CacheSender's non-happy paths) — separate commit if observed to leak. - Pool size = 1 is architecturally small for conc=48 workloads. Throughput will be bounded by per-request buffer dwell time even without leaks. Separate tuning discussion. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…qId-tagged [buf] logs Extends the receiver-side RAII pattern (30f9cb2) to the two sender-side cache formatters so any non-happy exit between assignBufferIndexForSend and the explicit freeBufferIndexForSend auto-releases the send-pool slot. v13 saturation testing showed the same wedge class on the send side: one early exit permanently wedged the size-1 pool and every subsequent transfer parked forever on the CV wait. Happy-path timing is unchanged: sendAllBuffers returning is still treated as "transport done, slot safe to free immediately" (the pre-PR invariant at cacheFormatter.cpp:568 / mlaCacheFormatter.cpp:383). The holder is detached before the existing free call, so the destructor is a no-op on the happy path; only non-happy exits fire AUTO_RELEASE. Also closes three instrumentation gaps for diagnosing a wedged pool: 1. assignBufferIndexForSend gains an optional requestIdForLog parameter (parity with assignBufferIndexForRecv), forwarded into the internal bounded CV-wait so send-pool WAIT / ACQUIRED / STILL_WAITING / CANCEL lines can be attributed to a specific reqId. 2. BufferIndexHolder carries requestIdForLog, so the AUTO_RELEASE log on stack unwind names the responsible request. Without it, the v13-class bug surfaces as "something leaked" with no way to cross-reference the [reqSync] trail. Receiver-side call site now passes reqIdForLog too. 3. New SEND_ACQUIRED / SEND_RELEASE DEBUG markers at both formatter sites give per-request send-pool turnover without relying on AUTO_RELEASE firing only on error paths. rnnCacheFormatter has the same bug class at its send/recv acquire sites but is off the NIXL ghost-UCX path that drove this PR — flagged as a follow-up rather than expanding scope here. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

… FUTURE_WAIT/JOIN diagnostics Follow-up to the sender-side RAII commit. Addresses three residual risks raised during review: 1) **Race between sendAllBuffers return and the explicit free.** The previous detach() + freeBufferIndexForSend() pair left a narrow window where a throw between the two calls (e.g. from setTime's std::map::at()) would trip the destructor fallback and emit a misleading AUTO_RELEASE for what is really a correctly-handled exception. Replaced with a single noexcept BufferIndexHolder::release() (now public) that frees and disarms atomically at the API level, and moved the call to immediately after sendAllBuffers returns. Post- release throws no longer leak OR pollute the AUTO_RELEASE leak signal. BufferIndexHolder now has three API modes: - release(): happy-path, silent free + disarm. - detach(): hand off ownership to a downstream owner (recv side uses this when receiveSync's formatter takes the release responsibility). - destructor fallback: any exit path that left mHeld=true — logs AUTO_RELEASE so the non-happy exit is visible in [buf] diagnostics. 2) **No cancel on send-pool CV wait.** assignBufferIndexForSend now accepts (perRequestCancel, waitSliceMs, requestIdForLog) — full parity with assignBufferIndexForRecv. Both formatter sites pass &session.getDataContext().getTransferTerminate() so a cancel-in-flight while the sender is parked on a saturated pool unblocks via the existing bounded-wait loop in assignBufferIndex. 3) **Wrong-hypothesis risk: wedge inside sendAllBuffers, not at formatter.** Added [send] diagnostic markers in the send loops so a wedge in the transport layer is distinguishable from a formatter-level short-circuit: - [send] FUTURE_WAIT reqId=X batch=K concurrency=C remain=R (before future.get()) - [send] FUTURE_JOIN reqId=X batch=K elapsedMs=E (after future.get()) - [send] START/PRE/DONE reqId=X connIdx=N ... (per send DEBUG) These pair with the existing [buf] WAIT/ACQUIRED/AUTO_RELEASE trail so a v13-class wedge can be attributed to (reqId, batch, connIdx) even if the cause turns out to be transport-layer rather than formatter- level. FUTURE_WAIT/JOIN are WARN-level (one per concurrency batch); per-send markers stay DEBUG to avoid hot-path spam. Touches both cacheFormatter.cpp and mlaCacheFormatter.cpp symmetrically. rnnCacheFormatter still uses the simple no-arg assignBufferIndexForSend(); listed as a follow-up (off the NIXL ghost-UCX path). Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…eiver — close UAF, remove Path A/B guards, unwedge KV block leak The KV-block leak described in docs/pr-13056-next-step-asan-and-leak.md Track B (decode pod wedged at 0/213 requests scheduled after the gen transfer deadline fired) is a direct consequence of the Path A/B guards in py_executor.py deferring Python-side _terminate_request while a request is DISAGG_*_TRANS_IN_PROGRESS. Those guards exist to avoid a forensically-confirmed UAF on CacheTransceiver's raw LlmRequest* tracking (mRequestId observed as 0x5555555555555555 under MALLOC_PERTURB_=85 — glibc's free-fill pattern read through a dangling pointer). Whenever the C++ deadline eviction failed to reach an entry — e.g. the receiver-side orphan class where Python tracks IN_PROGRESS but the C++ mRequesterFutures has no future to iterate — the guards kept skipping cleanup forever and the request pinned its KV blocks. Instead of working around the UAF with a grace-period or reconciliation backstop, this commit fixes the UAF at the source: - BaseCacheTransceiver's respondAndSendAsync / requestAndReceiveSync / requestAndReceiveAsync / cancelRequest now take std::shared_ptr<LlmRequest> (not a raw pointer). Pybind already exposes LlmRequest via std::shared_ptr, so the binding auto-converts at the boundary with no Python-side changes. - CacheTransceiver::mSenderFutures and mRequesterFutures now hold std::shared_ptr<LlmRequest> in the pair. Entries are still erased on the normal completion / deadline / exception paths, so the strong reference is dropped at the end of the transfer lifecycle — no leak. - CacheReceiver::Impl::RequestAndPromise::mRequest and CacheSender::Impl::Response::mRequest now hold std::shared_ptr<LlmRequest>. The receive-async std::async lambda captures the shared_ptr by value, so Python-side _terminate_request can no longer drop the LlmRequest out from under an in-flight worker's dereferences. Same for the send-async path's Response. - With the UAF closed, the Path A guard in _handle_responses (gen-side timeout-and-skip-if-IN_PROGRESS) and the Path B guard in _check_disagg_ctx_cache_transfer_status (ctx-side ditto) are no longer needed. Both are removed: when py_kv_transfer_timed_out fires, Python calls cancel_request + _handle_errors immediately, recovering the KV blocks even when the C++ tracker has orphaned the entry. Lifecycle after this commit, for a non-orphan request: 1. Python: request_and_receive_async → C++ enqueues, async worker spins up. 2. Worker finishes (completion / peer ack / cancel) → future resolves. 3. checkGenTransferStatus sees future ready → completeEntry sets state=COMPLETE, erases entry (drops its shared_ptr). 4. Python observes state=COMPLETE → runs through normal completion path (no _terminate_request yet; request proceeds to generation). 5. Eventually Python's active_requests drops the request → pybind shared_ptr refcount=0 → LlmRequest destroyed. For an orphan (C++ lost the future but Python state still IN_PROGRESS): 1. Python _check_kv_transfer_timeout sets py_kv_transfer_timed_out. 2. _handle_responses sees the flag; with the guard removed, proceeds to cancel_request + _handle_errors. 3. _handle_errors → _terminate_request → free_resources → remove_sequence → KV blocks released. 4. LlmRequest still alive while any remaining C++ reference holds it (async worker if any); eventually worker drops, request destroyed. No UAF. Also reverts 80474ce's grace-period fallback — it was a fragile workaround (magic 2x multiplier, wall-clock vs monotonic skew, residual UAF window) for the same underlying issue the shared_ptr refactor cleanly fixes at the source. The [reqFut] instrumentation from 5904b9e stays (still useful for investigating the orphan formation path itself; the leak is now decoupled from the orphan). Scope: touches BaseCacheTransceiver's API boundary but since every caller already holds std::shared_ptr<LlmRequest> (RequestVector is std::vector<std::shared_ptr<LlmRequest>>; pybind auto-converts), existing call sites drop their .get() and compile unchanged. No behavior change for Python tests — they pass Python LlmRequest objects which pybind converts to shared_ptr. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

…ion escapes The v9 in-flight cancellation primitive (267e813) assumes the requestAndReceiveAsyncMultiThreads drain worker is alive and pulling from mRequestsQueue. In the post-saturation "no recovery" repro the worker has either exited or is no longer consuming the queue, so: - New requests enqueue but are never dequeued. - The in-flight cancel flag is registered before enqueue, so the flag map has entries for every pending request. - When the 60s deadline fires, cancelRequest finds the request still in the queue (Path A), erases it silently, and returns true. The per-request-flag flip path (Path B) is never reached because Path A always succeeds first. The fix must target the worker lifecycle. Two additions here: 1. Add a catch(...) block to CacheReceiver::Impl::request()'s drain loop. Today the only handlers are RequestSpecificException / std::exception, so any non-std throw from NIXL / UCX / C++ ABI escapes the loop, terminates the worker thread, and strands every queued receive. The catch-all sets the caller's promise (so the future resolves with an exception rather than hanging) and continues the loop. 2. Add TLLM_LOG_WARNING markers at ENTER / pre-requestSync / post-requestSync / EXIT / catch-all so the next image rebuild can distinguish between: - worker never launched (no ENTER) - worker died on non-std escape (EXIT with mTerminate=0 + a matching catch-all log) - worker deadlocked inside requestSync (pre-requestSync with no matching post-requestSync for >kv_transfer_timeout_ms) - worker healthy (many pre/post pairs, no EXIT) The decision tree in the ghost-UCX investigation doc maps each pattern to the next diagnostic step. The catch(...) is both the diagnostic that would name the culprit and a likely fix — keeping the drain loop robust to any escape has no downside and closes the lifecycle hole that the per-request cancel mechanism implicitly depends on. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…opped during d7844fc cherry-pick conflict resolution on the 4e69c14 base PR 13056 commit d7844fc preserves an is_disagg_context_complete_state branch that was introduced earlier in the PR's base. On the rc11 base (4e69c14) that guard is not present, so the conflict resolution during cherry-pick dropped it — which leaves _handle_responses vulnerable to double-terminating a COMPLETE-state request that _end_transfer_and_maybe_terminate left in active_requests (end_transfer returned False). Re-add the guard to match PR 13056's intent. nvbug/5961736. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>

…ifestations broaden sig NVIDIA#7 to a CacheSender::Impl::* bug class run9 (rc11 + our fixes + UCX) and run10 (PR NVIDIA#13056 + UCX) both used a gdb capture loop on the dataTransResp thread to pin down the exact mutex behind sig NVIDIA#7. The wedge changed character in both runs: - run9 ctx mpi worker SIGSEGVs at iter 92 of the burst inside _PyObject_GenericGetAttrWithDict (Python C-API), downstream of the sig #1 fix path firing cleanly. - run10 ctx mpi worker SIGSEGVs synchronously inside CacheSender::Impl::handleAsyncSend(AsyncSendResource&) on the very first sanity-probe request - no concurrency, no burst, no cancellation. Zero UCX/NIXL frames in the stack; root cause is most plausibly a null shared_ptr<LlmRequest> deref at line 594 since PR NVIDIA#13056's ownership model is necessary but does not enforce Response::mRequest non-null at producer or consumer. Both findings are cleanly TRT-LLM-internal and broaden sig NVIDIA#7 from a single pthread_mutex_lock deadlock to a class of CacheSender::Impl::* bugs with at least four observed manifestations across two transports and three fix bundles. Touched sections: Status block, caveat block, signature table (sig NVIDIA#7 row), sig NVIDIA#7 section variants/fix/status, Phase 12 cross-reference, new Phase 13 section, Next Steps items 7 / 7a / 7b. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor

…le structure Split the 3193-line single-file report into 13 reader-friendly files under docs/investigations/nvbug-6104831-disagg-permanent-wedge/: - README.md: top-level orientation, status, navigation, suggested reading paths. - 01-background.md: architecture diagrams, request lifecycle walkthrough, state machine, cancellation flow. - 02-failure-signatures.md: the seven failure signatures (#1-NVIDIA#7). - 03-defect-class-stack.md: NEW - the L1-L8 defect-class layering that frames the four-approach comparison. - 04-reproduction.md: how to reproduce locally, load shape, run archive index. - 05-investigation-timeline.md: chronological story (Phases 0-14). - 06-fix-approaches/: side-by-side comparison plus one file per approach (A chained, B PR NVIDIA#13056, C PR NVIDIA#13495, D combo). - 07-architectural-reflections.md: seven invariants + retrospective. - 08-next-steps-and-pr-map.md: PR map, outstanding work, deadline effort estimate, run archive index. The L1-L8 defect class stack is the new framework introduced here. It re-frames the seven signatures as the visible faces of eight underlying invariant gaps and is used to explain why the combo approach (D) is the only stack that recovers cleanly across the test matrix. The five Mermaid diagrams from the request walkthrough move into 01-background.md as the prerequisite reading for the rest of the investigation. The single-file form was hard to navigate at this size; the split follows the user's reading-path needs (cold reader, single-PR reviewer, fix-path picker, retrospective reader) called out in the README. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Adds 00-tldr.md as the front door to the investigation. Targets a 10-minute read with two inline Mermaid figures: - The disagg architecture + cancellation flow (where the bugs live). - The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which layer). Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen final solution. Doesn't compare approaches A/B/C — that comparison lives in 06-fix-approaches/README.md for readers who want depth on the landing decision. Section structure: 1. What is broken (the wedge symptom). 2. Where the bugs live (architecture figure). 3. Root cause: eight invariant gaps (L1-L8 table). 4. The fix (combo figure + per-piece breakdown). 5. Does it work (empirical recovery table). 6. Why this took 8 days to find (cascade pattern). 7. What is left to do. 8. Where to go from here (reader paths to deeper files). README.md updated to put 00-tldr.md at the top of the navigation table and add a 10-minute reading path entry. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

… colored fix components Two improvements to 00-tldr.md based on review feedback: 1. Add a 7-row table after the architecture section that gives a one-line "where it lives" + "symptom" for each signature #1-NVIDIA#7. The TL;DR previously mentioned the seven signatures by number without describing them, forcing the reader to jump to 02-failure-signatures.md to make sense of references like "sig NVIDIA#4" or "sig NVIDIA#7". 2. Color-code the four fix components in the combo diagram: - PR NVIDIA#13056: blue (lifetime + cancel-flag + RAII) - PR NVIDIA#13495: orange (NIXL TransferStatus::release) - eval-order fix: green - Python idempotency guards: purple Each L1-L8 layer node is tinted with the lighter shade of whichever fix closes it, so the "closes" arrows are reinforced by colour-matching. Makes the fix-to-layer mapping visually scannable at a glance. Note added underneath the diagram explaining the colour scheme. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

github-actions Bot assigned yifjiang Apr 15, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 15, 2026

yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 408c1da to a0be6d6 Compare April 16, 2026 18:08

yifjiang changed the title ~~[NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender side~~ [NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers Apr 16, 2026

yifjiang force-pushed the fix/prefill-sender-total-timeout branch from a0be6d6 to 70c6cd6 Compare April 16, 2026 18:09

yifjiang changed the title ~~[NVBug 5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers~~ [https://nvbugs/5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers Apr 16, 2026

yifjiang force-pushed the fix/prefill-sender-total-timeout branch 2 times, most recently from a98ed73 to 8c6c8fc Compare April 16, 2026 18:27

yifjiang marked this pull request as ready for review April 16, 2026 18:27

yifjiang requested review from a team as code owners April 16, 2026 18:27

yifjiang requested review from reasonsolo and schetlur-nv April 16, 2026 18:27

yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 8c6c8fc to d7844fc Compare April 16, 2026 18:29

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

pcastonguay requested a review from Tabrizian April 16, 2026 18:57

yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 891411d to b2b0a82 Compare April 16, 2026 22:44

Tabrizian reviewed Apr 17, 2026

View reviewed changes

yifjiang and others added 2 commits April 21, 2026 16:33

yifjiang and others added 13 commits April 21, 2026 16:34

yifjiang mentioned this pull request Apr 21, 2026

[NVBug 5969206][draft] Subset of PR 13056 cherry-picked onto 4e69c14f73 (rc11 base) — 15 C++/Python fixes #13301

Draft

yifjiang force-pushed the fix/prefill-sender-total-timeout branch from 33262c5 to 369d944 Compare April 22, 2026 23:12

yifjiang changed the title ~~[https://nvbugs/5969206][fix] Enforce kv_transfer_timeout_ms on prefill sender + guard against dangling pointers~~ [NVBug 5969206][draft] Subset of PR 13056 cherry-picked onto 4e69c14f73 (rc11 base) — 15 C++/Python fixes Apr 22, 2026

yifjiang changed the title ~~[NVBug 5969206][draft] Subset of PR 13056 cherry-picked onto 4e69c14f73 (rc11 base) — 15 C++/Python fixes~~ [NVBug 5969206][draft] Subset rebased onto 4e69c14f73 (v1.3.0rc11 tag) — 15 C++/Python fixes Apr 22, 2026

yifjiang changed the title ~~[NVBug 5969206][draft] Subset rebased onto 4e69c14f73 (v1.3.0rc11 tag) — 15 C++/Python fixes~~ [https://nvbugs/5969206][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement Apr 22, 2026

yifjiang force-pushed the fix/prefill-sender-total-timeout branch 2 times, most recently from 8217bcb to c9777c4 Compare April 22, 2026 23:33

yifjiang mentioned this pull request Apr 24, 2026

[Draft] Harden disagg transceiver request lifetime #13439

Draft

yifjiang mentioned this pull request May 2, 2026

[Draft] Fail closed on unquiesced disagg KV transfer buffers #13706

Draft

chienchunhung mentioned this pull request May 5, 2026

[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening #13713

Draft

1 task

Conversation

yifjiang commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why all 16 commits?

Commits (in applied order)

Base

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Tabrizian commented Apr 16, 2026

Uh oh!

Tabrizian commented Apr 16, 2026

Uh oh!

yifjiang commented Apr 17, 2026

Uh oh!

Tabrizian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tabrizian Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

yifjiang Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tabrizian Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

yifjiang Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

yifjiang Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Tabrizian Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

yifjiang Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifjiang commented Apr 17, 2026

Uh oh!

yifjiang commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mechanism — per-path

Update (2026-04-17): forensic confirmation of the Path B UAF via MALLOC_PERTURB_

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yifjiang commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Tabrizian left a comment •

edited

Loading

yifjiang Apr 17, 2026 •

edited

Loading

yifjiang Apr 17, 2026 •

edited

Loading

yifjiang commented Apr 17, 2026 •

edited

Loading

Update (2026-04-17): forensic confirmation of the Path B UAF via `MALLOC_PERTURB_`