[https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement#13056
Conversation
… side Cherry-pick of PR NVIDIA#13056. Adds total elapsed time check against kv_transfer_timeout_ms in checkContextTransferStatus. Without this, stuck transfers retry the per-iteration timeout forever, holding KV blocks indefinitely and exhausting the cache pool. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
408c1da to
a0be6d6
Compare
a0be6d6 to
70c6cd6
Compare
a98ed73 to
8c6c8fc
Compare
8c6c8fc to
d7844fc
Compare
📝 WalkthroughWalkthroughTwo files are modified to enhance KV cache transfer timeout handling and request state management. The C++ cache transceiver adds timeout configuration retrieval, debug logging, and timeout-based cleanup logic for pending transfers. The Python executor adds conditional state handling to defer termination of disaggregated context transmission requests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 499-513: The diagnostic loop in checkContextTransferStatus is
dereferencing potentially-stale raw pointers from mSenderFutures (it reads
req->mRequestId and req->getKvCacheTransferStart()), causing use-after-free;
change the logging so it does not dereference req (only log the pointer address
and index) or else store safe copies of req->mRequestId and the start timestamp
at the time entries are inserted into mSenderFutures so the diagnostics read
those cached values instead of accessing req. Update the loop using the
mSenderFutures entry variables (auto& [req,fut]) to avoid any req->... calls, or
refactor insertion code to capture mRequestId and getKvCacheTransferStart into
the stored struct so getKvCacheTransferStart and mRequestId are never read from
the raw pointer during logging.
- Around line 493-497: The code currently only loads senderFutureTimeoutMs and
kvTransferTimeoutMs when (!blockAll), which causes kvTransferTimeoutMs to be
unset when blockAll is true and allows future.get() to block indefinitely;
change the logic so kvTransferTimeoutMs is always read from
mCacheTransceiverConfig (call getKvTransferTimeoutMs() and assign to
kvTransferTimeoutMs) regardless of blockAll, but keep the conditional skip of
getKvTransferSenderFutureTimeoutMs() / senderFutureTimeoutMs when blockAll is
true; update the block that references mCacheTransceiverConfig,
senderFutureTimeoutMs, kvTransferTimeoutMs, and blockAll to reflect this
separation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8073f6b4-20e7-4498-88ea-a36216bad2e6
📒 Files selected for processing (2)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpptensorrt_llm/_torch/pyexecutor/py_executor.py
| if (!blockAll && mCacheTransceiverConfig.has_value()) | ||
| { | ||
| senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs(); | ||
| kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs(); | ||
| } |
There was a problem hiding this comment.
Load kvTransferTimeoutMs even when blockAll is true.
blockAll should disable the per-iteration poll timeout, not the overall transfer deadline. With the current guard, any caller that omits atLeastRequestNum still falls through to a blocking future.get() with no kv_transfer_timeout_ms enforcement, so a stuck sender can still hang forever.
Suggested fix
- if (!blockAll && mCacheTransceiverConfig.has_value())
+ if (mCacheTransceiverConfig.has_value())
{
- senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
kvTransferTimeoutMs = mCacheTransceiverConfig->getKvTransferTimeoutMs();
+ if (!blockAll)
+ {
+ senderFutureTimeoutMs = mCacheTransceiverConfig->getKvTransferSenderFutureTimeoutMs();
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 493 - 497,
The code currently only loads senderFutureTimeoutMs and kvTransferTimeoutMs when
(!blockAll), which causes kvTransferTimeoutMs to be unset when blockAll is true
and allows future.get() to block indefinitely; change the logic so
kvTransferTimeoutMs is always read from mCacheTransceiverConfig (call
getKvTransferTimeoutMs() and assign to kvTransferTimeoutMs) regardless of
blockAll, but keep the conditional skip of getKvTransferSenderFutureTimeoutMs()
/ senderFutureTimeoutMs when blockAll is true; update the block that references
mCacheTransceiverConfig, senderFutureTimeoutMs, kvTransferTimeoutMs, and
blockAll to reflect this separation.
| // Log mSenderFutures state for diagnosing dangling pointer issues. | ||
| // Each entry's pointer address and request ID are logged so we can detect | ||
| // when a pointer's underlying memory is freed (reqId changes to 0). | ||
| if (!mSenderFutures.empty()) | ||
| { | ||
| TLLM_LOG_DEBUG( | ||
| "checkContextTransferStatus: mSenderFutures.size()=%zu, blockAll=%d, " | ||
| "kvTransferTimeoutMs=%d", | ||
| mSenderFutures.size(), blockAll ? 1 : 0, kvTransferTimeoutMs.value_or(-1)); | ||
| for (size_t i = 0; i < mSenderFutures.size(); ++i) | ||
| { | ||
| auto& [req, fut] = mSenderFutures[i]; | ||
| auto startTs = req->getKvCacheTransferStart().time_since_epoch().count(); | ||
| TLLM_LOG_DEBUG(" [%zu] ptr=%p reqId=%ld startTs=%ld", i, static_cast<void const*>(req), req->mRequestId, | ||
| static_cast<long>(startTs)); |
There was a problem hiding this comment.
Don't dereference req in the dangling-pointer diagnostics.
This loop reads req->mRequestId and req->getKvCacheTransferStart() from mSenderFutures, but that container is exactly where the raw pointer may already be stale. In the failure mode this PR is fixing, the debug path itself becomes the use-after-free.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 499 - 513,
The diagnostic loop in checkContextTransferStatus is dereferencing
potentially-stale raw pointers from mSenderFutures (it reads req->mRequestId and
req->getKvCacheTransferStart()), causing use-after-free; change the logging so
it does not dereference req (only log the pointer address and index) or else
store safe copies of req->mRequestId and the start timestamp at the time entries
are inserted into mSenderFutures so the diagnostics read those cached values
instead of accessing req. Update the loop using the mSenderFutures entry
variables (auto& [req,fut]) to avoid any req->... calls, or refactor insertion
code to capture mRequestId and getKvCacheTransferStart into the stored struct so
getKvCacheTransferStart and mRequestId are never read from the raw pointer
during logging.
| if (shouldTimeout) | ||
| { | ||
| // IMPORTANT: Do NOT dereference request->setState() or call | ||
| // mCacheSender->cancelRequest(*request) here. The LlmRequest* | ||
| // in mSenderFutures is a raw pointer with no ownership — the | ||
| // Python layer may have already freed the request. Dereferencing | ||
| // a dangling pointer causes heap corruption (free(): invalid | ||
| // next size). Just erase the stale entry from mSenderFutures. | ||
| // The Python layer handles state transitions via | ||
| // _end_transfer_and_maybe_terminate when it processes the | ||
| // error/completion from check_context_transfer_status. | ||
| if (startTimeValid) | ||
| { | ||
| TLLM_LOG_WARNING( | ||
| "Context KV cache transfer timed out: elapsed %ld ms > limit %d ms. " | ||
| "Removing entry from mSenderFutures (ptr=%p).", | ||
| elapsedMs, kvTransferTimeoutMs.value(), static_cast<void const*>(request)); | ||
| } | ||
| else | ||
| { | ||
| TLLM_LOG_WARNING( | ||
| "Removing stale entry from mSenderFutures: transfer start time is " | ||
| "uninitialized (request pointer %p may be dangling).", | ||
| static_cast<void const*>(request)); | ||
| } | ||
| it = mSenderFutures.erase(it); | ||
| continue; |
There was a problem hiding this comment.
Don't erase timed-out or failed entries without surfacing the request ID back to Python.
After these erase(it) calls, check_context_transfer_status() no longer returns the request ID, but tensorrt_llm/_torch/pyexecutor/py_executor.py:3244-3278 only calls _end_transfer_and_maybe_terminate() for IDs present in finished_requests + error_requests. That means the request can stay pinned in AsyncTransferManager indefinitely unless a later cancel_request() happens to succeed.
Also applies to: 658-662
891411d to
b2b0a82
Compare
|
@yifjiang Is it possible to have a test that demonstrates the issues? Also, can you please provide a more concise summary of the issue and a minimal reproducer that reproduces the problem. Also, in the PR description is it possible to only reference PRs that are in TRT-LLM main branch (do NOT reference PRs that are closed or not merged). |
|
This link doesn't work: https://github.com/yifjiang/dynamo-gb200-test/blob/main/docs/rc11-crash-and-hang-proof.md |
|
@Tabrizian thanks for the review. Addressed in 88987cc:
|
| elif request.is_disagg_context_transmission_state: | ||
| # Request is still transferring KV cache to the decode worker. | ||
| # Do NOT terminate — the C++ CacheTransceiver still holds a raw | ||
| # pointer to this request in mSenderFutures. Terminating now | ||
| # would free the LlmRequest and create a dangling pointer, | ||
| # causing use-after-free (request ID reads as 0, transfer start | ||
| # reads as epoch). Let the transfer complete or time out via | ||
| # kv_transfer_timeout_ms in checkContextTransferStatus. | ||
| pass |
There was a problem hiding this comment.
The LlmRequest pointer is safe. It is kept alive in AsyncTransferManager until transfer is completed.
There was a problem hiding this comment.
Reviewing my earlier reply — I was sloppy. Let me split the answer by code path because your statement holds on the line you anchored on but not universally.
For this specific line (elif request.is_disagg_context_transmission_state: pass inside the force_terminate_for_partial_reuse branch of _handle_responses): you are correct. The pre-fix code would call _terminate_request → _do_terminate_request → resource_manager.free_resources(request). None of that touches async_transfer_manager; AsyncTransferManager._requests_in_transfer still holds the Python reference. So the LlmRequest itself is not destroyed. No classic LlmRequest-UAF on this path.
What the guard still prevents is some form of corruption that empirically fires under load (free(): invalid next size through LRUEvictionPolicy::claimBlock, heap metadata detected as invalid at a later unrelated free()). I previously claimed this was specifically "KV-cache-block corruption" — I retract that claim. I don't have ASan or similar verification for the actual mechanism; I was constructing a plausible story. Could be something else entirely. What I can say is that eliminating this termination reliably eliminates the crash under the overload workload.
For Path B (_check_disagg_ctx_cache_transfer_status, the py_kv_transfer_timed_out retry-cancel loop): the invariant "AsyncTransferManager keeps it alive" is not applicable. Path B calls _end_transfer_and_maybe_terminate, which calls AsyncTransferManager.end_transfer, which is precisely the code that pops _requests_in_transfer — i.e. that path itself is the "transfer is completed" signal that drops AsyncTransferManager's hold. A "Request X not found in transfer manager" warning observed once in an instrumented run is the direct fingerprint of this premature pop. Surviving Python references after Path B (active_requests membership, locals) are not load-bearing. So on Path B, LlmRequest lifetime after the pop is not guaranteed.
For Path A gen side (_handle_responses → py_kv_transfer_timed_out → cancel_request + _handle_errors, gen request in DISAGG_GENERATION_TRANS_IN_PROGRESS): gen requests never enter AsyncTransferManager._requests_in_transfer at all (only ctx requests go through async_transfer_manager.start_transfer). The only Python references are active_requests and locals. _handle_errors explicitly removes from active_requests then calls _terminate_request. No AsyncTransferManager holder exists to save us here — this guard is the only thing standing between _handle_errors and a classic UAF on mRequesterFutures.
I've pushed 568454fbd that updates the inline comments to reflect this split (and admits the force_terminate_for_partial_reuse crash mechanism is uncorroborated) rather than the blanket "dangling pointer / UAF" framing that was there before.
There was a problem hiding this comment.
This is not true, store_blocks_for_reuse has a pin argument to make sure the blocks are not deallocated even after the request is terminated.
There was a problem hiding this comment.
You're right — I hadn't accounted for the store_blocks_for_reuse(request, True) pin set at AsyncTransferManager.start_transfer (py_executor.py:207). I've confirmed:
start_transfercallskv_cache_manager.store_blocks_for_reuse(request, True)→ blocks are pinned._terminate_request→free_resources(request)callsremove_sequence(..., pin_on_release=False). Per your note, that does not drop the pin set bystore_blocks_for_reuse, so the blocks remain live through a Python-side termination.unpin_blocks_by_idis invoked only fromAsyncTransferManager.end_transfer(py_executor.py:241), i.e. the transfer-completion signal.
So for the force_terminate_for_partial_reuse path (the line you anchored on), both safety nets apply: AsyncTransferManager holds the LlmRequest Python ref, and the pin keeps the KV blocks live. My earlier "KV-block corruption from free_resources" story is ruled out by the pin. I withdraw that framing.
That leaves me more confused, not less, about why the crash actually fires on that path. Empirically, with enable_partial_reuse_for_disagg=true on rc11 (no guard), 37× storeContextBlocks: Can not find sequence warnings and free(): invalid next size (fast) through LRUEvictionPolicy::claimBlock → __libc_free reproduce in ~45–65 s at conc 48, ISL 8000. With the is_disagg_context_transmission_state guard active, that crash disappears and the storeContextBlocks count drops to 1. Something about running _terminate_request on an in-transmission request is breaking the KV cache manager's invariants in a way that later surfaces during claimBlock — but with both LlmRequest and pinned blocks accounted for, I can't offer a mechanism. My best current guess is that the storeContextBlocks: Can not find sequence warnings are a symptom of remove_sequence running with a sequence the partial-reuse path expected to later storeContextBlocks for, leaving a dangling block-id / sequence-id mapping that claimBlock later trips on — but that's still speculation.
Do you have a theory on what's actually being corrupted? If not, I'd like to run this under MALLOC_CHECK_=3 or ASan to get a writer stack on the corruption site rather than leaving it as "empirical correlation, mechanism TBD." The fix is straightforward; I just don't want to ship a comment that keeps hand-waving about the cause.
For the other two paths the pin argument doesn't change the per-path verification I posted: _check_disagg_ctx_cache_transfer_status Path B explicitly calls end_transfer which unpins the blocks (py_executor.py:241) and pops _requests_in_transfer; _handle_responses Path A gen side has no AsyncTransferManager/pin path at all (receiver-side transfers go through request_and_receive_async, not async_transfer_manager.start_transfer). So those guards are still preventing real, nameable hazards. I'll update the code comment at 3791 to reflect your correction on the pin.
There was a problem hiding this comment.
Follow-up — got the forensic mechanism. Ran the same rc11 pre-fix image under MALLOC_PERTURB_=85 / MALLOC_CHECK_=3 with the same workload (conc 48, ISL 8000, 120 s) and caught the UAF directly:
[TensorRT-LLM][ERROR] Error occurred during context transfer for
request 6148914691236517205: std::future_error: Broken promise
6148914691236517205 == 0x5555555555555555 — the glibc poison fill written into freed memory. That log line comes from checkContextTransferStatus's catch block doing LOG_ERROR("... %ld ...", it->first->mRequestId). Request IDs are monotonic 64-bit integers, so 0x55..55 is not a legal ID; the mRequestId field must have been read from a freed slot. Direct mechanistic proof the LlmRequest was freed out from under mSenderFutures.
Walking through why, given your pin correction:
AsyncTransferManager.start_transferpins viastore_blocks_for_reuse(request, True)— blocks stay live through_terminate_request(your correction, confirmed)._terminate_requestalone does not free theLlmRequest— AsyncTransferManager's strong ref survives. So theforce_terminate_for_partial_reusepath in isolation does not produce the UAF; it only produces thestoreContextBlocks: Can not find sequencewarnings (observed 37x-102x) viaremove_sequence.- The free happens on Path B in
_check_disagg_ctx_cache_transfer_status. Path B calls_end_transfer_and_maybe_terminate, which callsAsyncTransferManager.end_transfer, whichpops_requests_in_transfer— the drop you were pointing at as "until transfer is completed" is baked into this path's cancel sequence. After the pop, the iteration-localrequestis the only remaining strong ref; when the loop moves on, the pybindshared_ptrhits zero and the C++LlmRequestis destroyed. ThemSenderFuturesentry's raw pointer is now dangling. - Under
MALLOC_PERTURB_, the freed region is filled with 0x55; the catch-block read captures that pattern. WithoutMALLOC_PERTURB_(original rc11 run), the same region gets reused by a later allocation; a subsequent C++ write (setState(kDISAGG_TRANS_ERROR)at 0x06 into what is now another object's mid-word) corrupts heap metadata, and the next unrelatedfree()inLRUEvictionPolicy::claimBlockaborts withinvalid next size.
So the corrected mechanism for this PR's fix is:
- Path B guard (in
_check_disagg_ctx_cache_transfer_status): forensically confirmed to prevent a classic LlmRequest-UAF. This is the path that drops the AsyncTransferManager ref. - Path A gen-side guard (in
_handle_responses): classic UAF by construction — gen requests never enter AsyncTransferManager, so_handle_errorsis a direct free. No forensic run targeted gen-side separately in this investigation, but the argument is airtight from the code. force_terminate_for_partial_reuseguard (the line you anchored on): neither the LlmRequest nor the pinned blocks are freed by this path alone. Skipping it still matters because in the real workload the same request is later picked up by Path B oncepy_kv_transfer_timed_outfires, and Path B is where the confirmed free happens. Whether this path alone can crash independent of Path B is not directly isolated.
Caveats on the heap-debug run:
- The
MALLOC_CHECK_=3abort oninvalid next sizedid not trigger in the 12-min window — a pure read of a poisoned slot doesn't flag the check. The poison read is the proof, not an abort. - An ASan rebuild would pin the first write through the dangling pointer with its stack. That's a source rebuild, so out of scope for this image-only investigation but a clean follow-up if more forensics are wanted.
Pushed 5779e5c53df42104b063a342f98a832407fec82d updating the inline comments to reflect this — Path B guard comment now cites the 0x55..55 evidence; Path A guard splits ctx vs gen accurately; force_terminate_for_partial_reuse guard comment keeps your pin correction and frames the composite-with-Path-B story without overclaiming. Also walking the walk-back back in the C++ catch-block comments.
|
|
||
| error_request_ids = [] | ||
| # 20x the configured timeout is a generous wall-clock budget for CI jitter. | ||
| deadline = time.monotonic() + (kv_transfer_timeout_ms * 20) / 1000.0 |
There was a problem hiding this comment.
We already have a timeout in the py_executor level, this test doesn't trigger the py_executor level changes so that could be why that timeout is not triggered.
There was a problem hiding this comment.
Correct — this test exercises the C++ check_context_transfer_status API directly, bypassing py_executor. The py_executor-level _check_kv_transfer_timeout → cancel_request → _handle_errors path is intentionally not in the loop here; the point of the test is to verify the C++ kv_transfer_timeout_ms enforcement in isolation.
Why the C++-layer enforcement still matters even though py_executor already has a timeout: the py_executor-level path (_check_kv_transfer_timeout fires py_kv_transfer_timed_out → either _handle_responses Path A or _check_disagg_ctx_cache_transfer_status Path B) calls cancel_request + _handle_errors / _end_transfer_and_maybe_terminate on requests whose C++ transfer is still in flight. Under load, that reliably produces free(): invalid next size on ctx side, and creates a classic UAF window on gen side (gen requests have no AsyncTransferManager holder). See the split analysis on the py_executor.py thread above for per-path specifics.
The Python guards in this PR skip that py-executor cleanup for in-transmission requests and defer to the C++ deadline, which evicts the entry from mSenderFutures / mRequesterFutures before any Python-side free_resources can run. That's the enforcement path the test is directly exercising.
I can add a heavier py_executor-level integration test too if you'd like — it would need a full LLM + 1P1D fixture. Let me know if that's worth it vs. the C++-API unit test.
|
@Tabrizian client command added to the reproducer section of the PR description — full loadgen script (asyncio, concurrency 64, ISL ~8000, OSL ~200, 180s load + 5 recovery probes) plus a single-shot curl sanity check. It talks to the OpenAI-compatible I also updated the "what actually breaks" paragraph to reflect the mechanism you pointed out in py_executor.py:3762 — the |
|
Adding empirical backing for the two inline threads above. Numbers below, mechanism analysis further down. Update (2026-04-17): forensic confirmation via Setup — reproducible from a fresh TRT-LLM build (no prebuilt image assumed) with a small observability patch that instruments counters on the two Python paths in question. Deploy 1P1D disaggregated serving (one prefill + one decode worker) with:
Workload — asyncio client hitting the frontend's Run A — baseline (no PR #13056 fix), observability only. Counts how often each Python path operates on an in-transmission request:
The 23/23 on Run B — same setup and workload, with the Python guards in this PR applied (both Path A and Path B skip termination when the request is in a transmission state):
So:
Mechanism — per-path
Update (2026-04-17): forensic confirmation of the Path B UAF via
|
… + guard against dangling pointers Two fixes for disaggregated prefill/decode KV cache transfer: 1. **C++ total timeout enforcement** (`cacheTransceiver.cpp`): `checkContextTransferStatus` retried stuck transfers indefinitely using only the per-retry `kv_transfer_sender_future_timeout_ms`. The total `kv_transfer_timeout_ms` was plumbed through config but never enforced. Now checks total elapsed time after each per-retry timeout and removes entries that exceed the limit. All pointer dereferences are avoided in timeout and exception paths since `mSenderFutures` holds raw `LlmRequest*` that may be dangling. 2. **Python guard for in-flight transfers** (`py_executor.py`): The `force_terminate_for_partial_reuse` path terminated requests unconditionally, including those in `DISAGG_CONTEXT_TRANS_IN_PROGRESS` state. This freed the `LlmRequest` while C++ `mSenderFutures` still held a raw pointer, causing heap corruption (`free(): invalid next size`). Added `is_disagg_context_transmission_state` check before the termination path — requests in active transfer are skipped and cleaned up when the transfer completes or times out. Bug origin: PR NVIDIA#6348 (2025-09-27) moved the `is_disagg_context_transmission_state` guard into an else branch, leaving the block-reuse path unprotected. Tested on nscale B200 cluster with Qwen3-0.6B 1P1D disagg: - Without fix: crash at ~65s (conc 48, ISL 8000, free_gpu_mem=0.3) - With Python guard only: no crash, 1 storeContextBlocks warning (was 37) - With both fixes: no crash, stuck transfers cleaned up by C++ timeout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…eanup With the Python-side is_disagg_context_transmission_state guard, ALL LlmRequest* pointers in mSenderFutures are guaranteed valid — the guard prevents Python from terminating requests during active transfer. This simplifies the C++ timeout and exception paths: - Timeout: dereference request->mRequestId, setState(kDISAGG_TRANS_ERROR), add to errorRequestIds → Python calls end_transfer → unpin_blocks - Exception (Broken promise): same — report as error for cleanup Previously, timed-out entries were silently erased without reporting to Python, leaving entries in _requests_in_transfer with pinned KV blocks that were never unpinned, causing KV pool exhaustion after overload. Removed the startTimeValid heuristic — it was checking the pointer to determine if the pointer is safe, which is itself UB on a dangling pointer. With the Python guard, this check is unnecessary. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address CodeRabbit review on NVIDIA#13056: the overall transfer deadline was only loaded when !blockAll, so callers that omit atLeastRequestNum could still hang forever on a stuck sender via future.get(). - Always load kvTransferTimeoutMs from mCacheTransceiverConfig. - In the timeout branch, enforce the overall deadline regardless of blockAll; only skip-and-retry when per-iter timeout is set. - In blockAll mode, fall through to future.get() only when the overall deadline has not yet been exceeded. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…eadline Empirical testing (v5-diag run on Qwen3-0.6B 1P1D, conc 48, ISL 8000) surfaced three remaining paths that still hit use-after-free or hang: 1. Python `_handle_responses` — the `py_kv_transfer_timed_out` branch (line 3643) calls `cancel_request` + `_handle_errors` + `continue`, bypassing the existing `is_disagg_context_transmission_state` guard. `cancel_request` on the sender only drains `mReadyResponses`; the raw LlmRequest* stays in `mSenderFutures`, so `_handle_errors` frees the request and the next `checkContextTransferStatus` dereferences a dangling pointer. Same hazard on the receiver side with `mRequesterFutures`. 2. Python `_check_disagg_ctx_cache_transfer_status` — the retry-cancel loop (line 3267) iterates `requests_in_transfer`, every entry of which is in `DISAGG_CONTEXT_TRANS_IN_PROGRESS` by construction. Cancelling + terminating always frees an in-transmission request, identical to Path 1. In v5-diag this fired 23 times per 120 s run. Both Python paths now skip cancel+terminate when the request is in transmission state and defer to the C++ kv_transfer_timeout_ms path, which evicts the entry and reports it via errorRequestIds / ERROR state — at which point termination is safe. 3. C++ `checkGenTransferStatus` (receiver side) — unlike `checkContextTransferStatus`, it never enforced `kv_transfer_timeout_ms`: the happy path called `future.get()` unconditionally, so a stuck receiver hung forever and its entry stayed pinned in `mRequesterFutures`. v5f fixed the prefill UAF but the decode-side hang persisted (10/104 succeeded, probes failed) because no deadline evicted the stuck receivers. Mirror the sender restructure: load `kv_transfer_timeout_ms` (always) and `kv_transfer_sender_future_timeout_ms` (polling mode only); on `wait_for` timeout, check elapsed against the overall deadline and evict with `kDISAGG_TRANS_ERROR` when exceeded. The existing `_check_cache_transfer_errors` surfaces the ERROR state back to Python, which then safely unwinds the request (the C++ entry was already erased, so there is no dangling pointer). Additional correctness fix: `CacheReceiver::receiveAsync()` spawns the thread that calls `setKvCacheTransferStart` asynchronously. Until that thread runs, `getKvCacheTransferStart()` returns the default epoch time point and the new gen-side deadline check would falsely evict fresh entries. Set the transfer start synchronously in `CacheTransceiver::requestAndReceiveAsync` before pushing the future. Debug logging: mirror `respondAndSendAsync`'s insertion log in `requestAndReceiveAsync`, and the per-entry/summary scan in `checkGenTransferStatus` to match `checkContextTransferStatus`. These make future stuck-transfer investigations tractable without rebuilding. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…ss gate The elapsed-time check added in daf70d4 was nested inside the `status == future_status::timeout` branch of `wait_for(...)`, which is itself gated by `blockAll || toCompleteIdSet.find(...) != end()`. In polling mode (blockAll == false), `toCompleteIdSet` is populated from the initial `wait_for(0)` sweep (consensus across ranks) plus an insert-order fallback that only runs when `atLeastRequestNum > 0`. A stuck sender/receiver whose future never becomes ready is not in the consensus set, and callers of `_check_disagg_gen_cache_transfer_status` typically pass `at_least_num = 0` (see py_executor.py:2930, where `need_check_one` requires ALL non-gen-first requests to be in transmission — narrow). With `atLeastRequestNum = 0`, such entries escape the deadline check entirely and pin KV blocks forever. Empirical evidence on 1P1D Qwen3-0.6B, conc 48, ISL 8000, 120 s: - mRequesterFutures.size() stayed at 22 across 384 checkGenTransferStatus polls (blockAll=0, kvTransferTimeoutMs=60000, senderFutureTimeoutMs=10000 confirmed via TLLM_LOG_DEBUG) - 0 decode-side total-timeout evictions vs 63 on prefill (prefill is masked by `_check_disagg_ctx_cache_transfer_status(1)` callers that rotate entries through `toCompleteIdSet`) - server did not recover after phase-1 sustained load — all 5 probes timed out Hoist the deadline check above the readiness gate in both `checkContextTransferStatus` and `checkGenTransferStatus`. A stuck entry is now evicted at `kv_transfer_timeout_ms` regardless of whether the future is ready or whether the entry is in `toCompleteIdSet`. The inner fallback checks in the `wait_for(...) == timeout` branch are removed since the outer check has already evicted any entry whose deadline has passed. Also add a ctx-side regression test that polls with `at_least_request_num=0` — the default scheduler cadence that would have exposed this bug. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…e evicts Addresses the post-saturation "no-recovery" bug documented in the gen-side ghost-UCX investigation: with PR NVIDIA#13056 commits 1-6 applied, the UAF/crash is fixed but every probe after a saturation event deterministically returns HTTP 500 at exactly `kv_transfer_timeout_ms`, indefinitely. Py-spy traces show Python blocked in check_gen_transfer_status waiting on at_least_num=1; the scheduler is alive and requests are being accepted, but every new gen transfer never becomes "ready." Root cause: the timeout-eviction in `checkGenTransferStatus` (`daf70d4034` / `b9450529b`) erases the `std::future<void>` from `mRequesterFutures`, but the underlying task spawned via `CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads` is still running — blocked inside `recvReadySignal` waiting on a peer that will never respond. These zombie tasks hold NIXL/UCX per-request state (TransferSession, buffer registrations). Over a saturation window ~100+ zombies accumulate and pin the receive pool, so new receives queue behind a frozen pool and every new transfer waits the 60s deadline. Empirical signatures: no crash, server responds, thread and FD counts identical to a fresh worker (damage is inside C++ lifetime state), killing only the decode pod recovers fully. Fix: extend `CacheReceiver::Impl::cancelRequest` to cover in-flight receives, and call it from `checkGenTransferStatus`'s timeout-eviction path. dataTransceiver.cpp (CacheReceiver::Impl): - Add a per-request cancel-flag registry (mInFlightCancelMutex + mInFlightCancelFlags) keyed by request id. - Register a fresh std::atomic<bool> in requestAndReceiveAsyncMultiThreads before enqueueing; attach the shared_ptr to RequestAndPromise so it survives through the worker. - Worker in request() passes the flag to requestSync; unregisters after requestSync returns (success or exception). - requestSync takes the flag, checks it at entry, passes it to receiveReadySignal, and re-checks after the ready-signal wait returns. - receiveReadySignal gains a per-request-cancel overload that plumbs the flag as the DataContext's transferTerminate — the existing polling loop in AgentConnectionManager::waitForNotification then observes the flag via ctx.getTransferTerminate().load() and exits early with isReady=false. - cancelRequest, when the request is not found in mRequestsQueue, now flips the registered flag (previously it just logged "Cannot cancel"). Returns true for both queued and in-flight cases. - ~Impl flips all per-request flags alongside setting mTerminate so process shutdown still unblocks recvReadySignal polling loops. - Impl::receiveAsync (currently unused via CacheReceiver public API) switched to a lambda to disambiguate the new requestSync overload. cacheTransceiver.cpp: - checkGenTransferStatus timeout-eviction branch now calls mCacheReceiver->cancelRequest(*request) before erase, signaling the blocked task to abort via the per-request flag. - checkContextTransferStatus timeout-eviction branch calls mCacheSender->cancelRequest(*request) symmetrically for defense in depth; sender zombies empirically unwind on peer teardown but clearing the bookkeeping is cheap and consistent. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…e send honor perRequestCancel v11b ([reqSync] instrumentation commit e61a5fa) empirically pinned the post-saturation no-recovery wedge to CacheReceiver::Impl::sendRequestInfo: pre-sendRequestInfo fires for a reqId and has no matching post-sendRequestInfo for 60 s+; after the overall kv_transfer_timeout_ms cancelRequest flips the per-request flag and logs "Flipped in-flight cancel flag", but the blocking send call inside sendRequestInfo never observes the flag because the polling loops below requestSync don't read ctx.getTransferTerminate(). The drain worker therefore never returns to the CV wait and every subsequent enqueue's notify_all() is delivered to a thread not waiting on it. Queue backs up. Every new transfer at the 60 s deadline hits Path A silent erase. Fix: plumb perRequestCancel through the send path so the blocking calls below can observe the flag. cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl): - sendRequestInfo(LlmRequest) -> overload now delegating to the new sendRequestInfo(LlmRequest, std::atomic<bool> const& perRequestCancel) that threads perRequestCancel through: * checks perRequestCancel at the top of the per-counterpart for-loop (throws to abort further peer notifications once a cancel has fired), * passes it to the non-agent sendRequestInfo(connection, info, perRequestCancel) helper, * uses it as the DataContext's transferTerminate in the TransferSession constructor, so the downstream data-phase operations (receiveSync -> unformat -> per-connection send/recv -> AgentConnection::send) also honor the cancel flag. - sendRequestInfo(Connection*, RequestInfo const&, perRequestCancel): plumbs the flag into each of the three DataContext constructions (kID_TAG / kINFO_SIZE_TAG / kINFO_TAG) used in the non-agent path. - requestSync(..., perRequestCancel): calls the new two-arg sendRequestInfo. - receiveReadySignal(session, perRequestCancel): adds a between-iteration perRequestCancel check in the per-peer for-loop so the multi-rank wait sequence bails out on cancel. cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (AgentConnection::send): - Replaces the unbounded status->wait() with a polling loop of status->wait(100ms) that checks ctx.getTransferTerminate() between polls. If cancel fires mid-send, throws TLLM_THROW so requestSync unwinds via RequestSpecificException; the drain worker's existing catch(std::exception) path handles the throw, sets the promise, and resumes. TransferStatus exposes no abort primitive, so this leaks at most one in-flight xfer descriptor per cancel event — preferable to an unrecoverable wedge. The v11b evidence showed the wedge sub-call is sendRequestInfo. This commit addresses that path (via DataContext plumbing into the underlying AgentConnection::send poll loop) and extends the same pattern to receiveReadySignal (for iteration-level cancel observance) and to the data-phase transfers that go through the TransferSession's DataContext (receiveSync's unformat -> AgentConnection::send). Still out of scope for this commit (per docs/pr-13056-option-c.md): per-sub-call cancel inside notifySyncMessage (NIXL backend internals not reachable from this layer), and any sender-side in-flight flag parity (CacheSender::Impl still lacks the flag mechanism entirely). See docs/pr-13056-no-recovery-ghost-ucx.md for the saturation repro and docs/pr-13056-option-c.md for the v11b evidence + decision tree. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…Message pre-notify check Closes the two TBDs left open after Option C (commit 7381830): 1. Sender-side in-flight cancel flag (symmetric to CacheReceiver::Impl) ----------------------------------------------------------------------- Pre-fix, CacheSender::cancelRequest could only abort requests still in the mReadyResponses queue that were NOT the mCurrentRequest. In-flight sends (request is mCurrentRequest, worker is inside sendSync's data transfer) just logged "Cannot cancel request" and returned false — leaving the send to block on the peer indefinitely when that peer's receive side has already been evicted. Add a per-request cancel-flag registry to CacheSender::Impl: - mInFlightCancelMutex + mInFlightCancelFlags map (reqId -> shared_ptr<atomic<bool>>) - getOrCreateInFlightCancelFlag(reqId) helper - sendAsync registers a flag at enqueue time - recvRequestInfo constructs the TransferSession's DataContext using the registered flag (via *cancelFlag) instead of mTerminate, so the downstream data-phase send/recv operations — specifically AgentConnection::send's poll-wait loop added in 7381830 — observe the flag via ctx.getTransferTerminate() - cancelRequest flips the flag on the previously "Cannot cancel" miss path (returns true now) - removeResponse and the sendResponse cancel-cleanup branch erase the flag after send completion / cancel - terminate() flips every registered flag before joining the response worker so shutdown still interrupts in-flight sends (the DataContext references per-request flags, not mTerminate) 2. notifySyncMessage pre-notify cancellability ---------------------------------------------- notifySyncMessage drops into the NIXL backend and is not interruptible from this layer. Two narrow windows where a cancel fired but the notify still ran — fix both: - AgentConnection::send: added a cancel check between the poll-wait's TransferState::kSUCCESS return and the notifySyncMessage call. If the cancel fires in that gap, throw instead of sending a notify the peer will never act on. - AgentConnection::sendRequestAndBufferInfo: added an optional `std::atomic<bool> const* perRequestCancel = nullptr` parameter. When non-null and flipped true, throws before notifySyncMessage. Caller in CacheReceiver::Impl::sendRequestInfo passes &perRequestCancel. Limitations still standing after this commit ---------------------------------------------- - notifySyncMessage itself is not interruptible mid-call. The pre-notify check only catches the gap between wait() returning and notify starting. If notifySyncMessage blocks internally on a stuck peer (unclear whether NIXL's genNotif is fully synchronous), the cancel can't unwind it from this layer. Full remediation would need a new cancellation API on BaseTransferAgent. - TransferStatus exposes no abort primitive; AgentConnection::send's mid-poll throw leaks at most one in-flight xfer descriptor per cancel event (acceptable vs the alternative of a drain-worker wedge). - Race window between sendAsync (sender registers flag) and recvRequestInfo on the sender side — if the control message from the peer arrives before sendAsync registers, getOrCreateInFlightCancelFlag creates the flag at recvRequestInfo time, so lifecycle is intact, but a cancel fired in that early window uses the getOrCreate flag as the authority — confirmed consistent. - No sender-side equivalent of the catch(...) drain-loop hardening from commit f63172f; CacheSender's response() worker has a top-level catch(std::exception) but no catch(...). If a non-std throw escapes from sendSync, the sender worker thread dies and all queued sends stall. Separate commit. - No regression test for sender-side in-flight cancel (same constraint as the Option C commit: requires a stalled-peer harness). Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…t catch(...) to sender worker Two pre-e2e cleanups identified during code trace: 1. Sender-side mInFlightCancelFlags erase was unsafe -------------------------------------------------------- In commit 3736d11 I erased the per-request cancel flag from mInFlightCancelFlags inside removeResponse() and inside the sendResponse cancel-cleanup else-branch. The session's DataContext (built in recvRequestInfo) holds a reference (not ownership) to the atomic managed by the shared_ptr in that map. Tracing: - Success path: sendAndRemoveResponse calls release(id) which destroys the session, THEN sendResponse calls removeResponse. With the erase in removeResponse, the flag was dropped AFTER the session — safe. - Cancel-throw path: sendSync throws (via the Option C cancel check), sendAndRemoveResponse's catch skips release(id), session remains in mRequestToSession. removeResponse still runs → flag erased while the session (with its DataContext ref) is still live. Dangling reference. - mCancelledRequests cancel path (isReady=false): release not called; inline erase of the flag similarly dangles session's DataContext. Nothing in the current code dereferences the session's DataContext after sendSync returns, so this is a latent UAF rather than an observed crash. But it's fragile and would trip the first time anyone adds post-send session iteration. Move the flag erase out of removeResponse and out of sendResponse's cancel-cleanup into release(id), where the session is actually destroyed. Cancel paths that skip release() intentionally leak the flag shared_ptr until ~Impl / terminate(), matching the existing session-leak behavior on those same error paths. release() orders session-destroy before flag-drop so DataContext reference is never dangling. 2. CacheSender::Impl::response() lacked catch(...) ---------------------------------------------------- Symmetric to the catch(...) added to CacheReceiver::Impl::request() in commit f63172f. response() is noexcept, so a non-std::exception escape (integer throw, custom type throw from NIXL/UCX backends) would call std::terminate. Adding catch(...) resolves pending promises with current_exception() so callers see a failure, then exits the worker cleanly. Worker thread still exits after the catch — sender is dead for this process, but fail-closed via promises is strictly better than terminate(). No behavior change on the happy path. No new primitives. Trace is now clean for both receiver and sender sides of the cancel plumbing. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
… diagnostic [buf] logs
Trace through the receiver-side sendRequestInfo path identified a
more likely wedge site than notifySyncMessage: the CV wait inside
BaseTransBufferManager::assignBufferIndex (baseTransBuffer.cpp:228
pre-fix) parks on `mBuffersCV.wait(lk, predicate)` with no timeout
and no awareness of the per-request cancel flag. Under saturation
with N in-flight wedged transfers all holding their receive-buffer
slots, every new call to sendRequestInfo parks here and never
observes perRequestCancel — exactly fitting v11b's [reqSync]
pre-sendRequestInfo-with-no-post-sendRequestInfo signature.
Unlike notifySyncMessage (a thin wrap over NIXL's genNotif with no
timeout API), the buffer-pool CV does have a natural wait_for
variant we can plumb.
Changes
-------
baseTransBuffer.h: extend public `assignBufferIndexForRecv` and
private `assignBufferIndex` with three optional parameters:
- std::atomic<bool> const* perRequestCancel = nullptr
- int64_t waitSliceMs = 100
- std::optional<uint64_t> requestIdForLog = std::nullopt
Default values preserve existing behavior for unchanged callers;
the new capability is opt-in.
baseTransBuffer.cpp: replace the unconditional `mBuffersCV.wait`
with a bounded loop that:
- re-checks the predicate every waitSliceMs via wait_for;
- checks perRequestCancel between slices and TLLM_THROWs on cancel,
so the drain worker can unwind via its existing catch(std::exception);
- logs [buf] markers — ENTER (WAIT) when the pool is exhausted,
STILL_WAITING every ~5 s of parking (50 × 100 ms), ACQUIRED on
success, CANCEL on cancel — each tagged with the reqId so they
cross-reference cleanly with [reqSync] and [drain] log trails.
The hot path (pool not exhausted) takes the no-wait fast-path with
zero log spam.
dataTransceiver.cpp (CacheReceiver::Impl::sendRequestInfo, agent
path): pass &perRequestCancel and the llmRequest's mRequestId into
each assignBufferIndexForRecv call. Wrap the loop with [reqSync]
pre-/post-assignBufferIndexForRecv markers so the v12 diagnostic
trail can distinguish pool-exhaustion wedges from deeper NIXL
notifySyncMessage wedges without ambiguity.
Expected behavior after this change on the v11b workload
--------------------------------------------------------
If the wedge was pool exhaustion (the most-likely hypothesis):
- [buf] assignBufferIndex WAIT fires whenever saturation hits
- Some mix of ACQUIRED (pool drains, wait wakes up) and CANCEL
(kv_transfer_timeout_ms-driven cancelRequest unblocks the wait)
- No more unmatched [reqSync] pre-sendRequestInfo; all requests
either finish or are cancelled with a captured latency and
pool-occupancy snapshot.
If the wedge was actually deeper (notifySyncMessage / NIXL genNotif):
- [reqSync] pre-assignBufferIndexForRecv / post-assignBufferIndexForRecv
pair cleanly (no [buf] WAIT → STILL_WAITING pattern)
- [reqSync] post-sendRequestInfo is still missing for wedged reqIds
- Diagnosis shifts to limitation 1 (notify backend) with the next
fix direction being a supervisor thread or a NIXL API addition
Either outcome narrows the remaining problem surface.
Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…aking pool slots on non-happy exits
v12 image (PR HEAD with assignBufferIndexForRecv bounded-wait + cancel
plumbing) reproduced saturation with Phase 1 = 11/104 and probes 0/5.
Decode log showed the mechanism unambiguously:
- 12 reqIds reached [reqSync] post-assignBufferIndexForRecv
(buffer slot acquired from the size-1 pool)
- 11 completed via the normal receiveSync → formatter →
freeBufferIndexForRecv chain (slot released)
- 1 (reqId=8584867770318859) took the `(not-ready or cancel after
ready)` early-return branch because prefill's own ctx-timeout
fired and signalled isReady=0 — receiveSync was skipped entirely,
so the formatter never released the slot. That single leak
permanently wedged the size-1 pool, and every subsequent
assignBufferIndexForRecv parked on the CV wait forever.
Root cause pattern
------------------
CacheReceiver::Impl::requestSync has at least six exit paths (normal,
early-cancel, not-ready, cancel-after-ready, cancel from inside
sendRequestInfo, cancel from inside receiveReadySignal, exception
escape). Pre-fix, only one of them — normal → receiveSync →
formatter — released the buffer index. All five other paths leaked
it. v12 empirically proved the (not-ready) branch; the other four
were latent but reachable under different workloads.
Fix: RAII holder
----------------
New BufferIndexHolder in baseTransBuffer.h — move-only scoped owner
for a buffer index, releases on destructor (including stack unwind
from exceptions). detach() hands off ownership when the formatter
inside receiveSync takes over on the happy path.
Changes:
cpp/tensorrt_llm/batch_manager/baseTransBuffer.h:
- Added BufferIndexHolder class (inline, move-only, noexcept
release).
cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp:
- BufferIndexHolder::release() out-of-line so it can call
freeBufferIndexForRecv/Send on the manager. Emits
[buf] AUTO_RELEASE on each RAII release so post-saturation
traces can attribute how often non-happy-path exits fire vs
formatter releases.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (CacheReceiver::Impl):
- requestSync now acquires each receive-buffer index directly
(moving the acquisition up from sendRequestInfo), wraps each
acquisition in a BufferIndexHolder, and passes the raw
cacheBufferIds vector into sendRequestInfo as a new parameter.
- sendRequestInfo(llmRequest, perRequestCancel) becomes a thin
delegator to a new three-arg overload sendRequestInfo(
llmRequest, perRequestCancel, preAcquiredCacheBufferIds). The
three-arg variant skips the internal acquire when the caller
provides ids (requestSync path), matching pre-RAII behavior
when the vector is empty (legacy / public-wrapper path).
- After receiveSync returns on the happy path, detach() every
holder so the formatter's existing freeBufferIndexForRecv call
doesn't double-release.
- All other exits from requestSync (early-cancel, not-ready,
cancel-after-ready, throws from sendRequestInfo /
receiveReadySignal) unwind through ~BufferIndexHolder and
release the slot. No branch-specific release calls needed.
Expected v13 behavior
---------------------
- Phase 1 completion count substantially higher than 11/104 —
pool no longer wedges after the first (not-ready) exit.
- All 5 recovery probes return HTTP 200 within single-digit
seconds (assuming limitations 1-3 don't trip).
- [buf] AUTO_RELEASE fires on (not-ready) and cancel branches;
count correlates with early-exit traffic.
- [buf] WAIT → ACQUIRED pairs stay bounded; no permanent
concurrence=C/C saturation.
- The verification script in docs/pr-13056-raii-buffer-release.md
(every post-assignBufferIndexForRecv reqId has a matching
[reqSync] EXIT) runs clean.
Still not addressed here
------------------------
- Sender-side parity (assignBufferIndexForSend release discipline
on CacheSender's non-happy paths) — separate commit if observed
to leak.
- Pool size = 1 is architecturally small for conc=48 workloads.
Throughput will be bounded by per-request buffer dwell time
even without leaks. Separate tuning discussion.
Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
…qId-tagged [buf] logs Extends the receiver-side RAII pattern (30f9cb2) to the two sender-side cache formatters so any non-happy exit between assignBufferIndexForSend and the explicit freeBufferIndexForSend auto-releases the send-pool slot. v13 saturation testing showed the same wedge class on the send side: one early exit permanently wedged the size-1 pool and every subsequent transfer parked forever on the CV wait. Happy-path timing is unchanged: sendAllBuffers returning is still treated as "transport done, slot safe to free immediately" (the pre-PR invariant at cacheFormatter.cpp:568 / mlaCacheFormatter.cpp:383). The holder is detached before the existing free call, so the destructor is a no-op on the happy path; only non-happy exits fire AUTO_RELEASE. Also closes three instrumentation gaps for diagnosing a wedged pool: 1. assignBufferIndexForSend gains an optional requestIdForLog parameter (parity with assignBufferIndexForRecv), forwarded into the internal bounded CV-wait so send-pool WAIT / ACQUIRED / STILL_WAITING / CANCEL lines can be attributed to a specific reqId. 2. BufferIndexHolder carries requestIdForLog, so the AUTO_RELEASE log on stack unwind names the responsible request. Without it, the v13-class bug surfaces as "something leaked" with no way to cross-reference the [reqSync] trail. Receiver-side call site now passes reqIdForLog too. 3. New SEND_ACQUIRED / SEND_RELEASE DEBUG markers at both formatter sites give per-request send-pool turnover without relying on AUTO_RELEASE firing only on error paths. rnnCacheFormatter has the same bug class at its send/recv acquire sites but is off the NIXL ghost-UCX path that drove this PR — flagged as a follow-up rather than expanding scope here. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
… FUTURE_WAIT/JOIN diagnostics
Follow-up to the sender-side RAII commit. Addresses three residual risks
raised during review:
1) **Race between sendAllBuffers return and the explicit free.**
The previous detach() + freeBufferIndexForSend() pair left a narrow
window where a throw between the two calls (e.g. from setTime's
std::map::at()) would trip the destructor fallback and emit a
misleading AUTO_RELEASE for what is really a correctly-handled
exception. Replaced with a single noexcept BufferIndexHolder::release()
(now public) that frees and disarms atomically at the API level, and
moved the call to immediately after sendAllBuffers returns. Post-
release throws no longer leak OR pollute the AUTO_RELEASE leak signal.
BufferIndexHolder now has three API modes:
- release(): happy-path, silent free + disarm.
- detach(): hand off ownership to a downstream owner (recv side uses
this when receiveSync's formatter takes the release responsibility).
- destructor fallback: any exit path that left mHeld=true — logs
AUTO_RELEASE so the non-happy exit is visible in [buf] diagnostics.
2) **No cancel on send-pool CV wait.**
assignBufferIndexForSend now accepts (perRequestCancel, waitSliceMs,
requestIdForLog) — full parity with assignBufferIndexForRecv. Both
formatter sites pass &session.getDataContext().getTransferTerminate()
so a cancel-in-flight while the sender is parked on a saturated pool
unblocks via the existing bounded-wait loop in assignBufferIndex.
3) **Wrong-hypothesis risk: wedge inside sendAllBuffers, not at formatter.**
Added [send] diagnostic markers in the send loops so a wedge in the
transport layer is distinguishable from a formatter-level short-circuit:
- [send] FUTURE_WAIT reqId=X batch=K concurrency=C remain=R (before future.get())
- [send] FUTURE_JOIN reqId=X batch=K elapsedMs=E (after future.get())
- [send] START/PRE/DONE reqId=X connIdx=N ... (per send DEBUG)
These pair with the existing [buf] WAIT/ACQUIRED/AUTO_RELEASE trail
so a v13-class wedge can be attributed to (reqId, batch, connIdx) even
if the cause turns out to be transport-layer rather than formatter-
level. FUTURE_WAIT/JOIN are WARN-level (one per concurrency batch);
per-send markers stay DEBUG to avoid hot-path spam.
Touches both cacheFormatter.cpp and mlaCacheFormatter.cpp symmetrically.
rnnCacheFormatter still uses the simple no-arg assignBufferIndexForSend();
listed as a follow-up (off the NIXL ghost-UCX path).
Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…eiver — close UAF, remove Path A/B guards, unwedge KV block leak
The KV-block leak described in docs/pr-13056-next-step-asan-and-leak.md
Track B (decode pod wedged at 0/213 requests scheduled after the gen
transfer deadline fired) is a direct consequence of the Path A/B
guards in py_executor.py deferring Python-side _terminate_request
while a request is DISAGG_*_TRANS_IN_PROGRESS. Those guards exist to
avoid a forensically-confirmed UAF on CacheTransceiver's raw
LlmRequest* tracking (mRequestId observed as 0x5555555555555555 under
MALLOC_PERTURB_=85 — glibc's free-fill pattern read through a dangling
pointer). Whenever the C++ deadline eviction failed to reach an entry
— e.g. the receiver-side orphan class where Python tracks IN_PROGRESS
but the C++ mRequesterFutures has no future to iterate — the guards
kept skipping cleanup forever and the request pinned its KV blocks.
Instead of working around the UAF with a grace-period or
reconciliation backstop, this commit fixes the UAF at the source:
- BaseCacheTransceiver's respondAndSendAsync / requestAndReceiveSync /
requestAndReceiveAsync / cancelRequest now take
std::shared_ptr<LlmRequest> (not a raw pointer). Pybind already
exposes LlmRequest via std::shared_ptr, so the binding auto-converts
at the boundary with no Python-side changes.
- CacheTransceiver::mSenderFutures and mRequesterFutures now hold
std::shared_ptr<LlmRequest> in the pair. Entries are still erased on
the normal completion / deadline / exception paths, so the strong
reference is dropped at the end of the transfer lifecycle — no leak.
- CacheReceiver::Impl::RequestAndPromise::mRequest and
CacheSender::Impl::Response::mRequest now hold
std::shared_ptr<LlmRequest>. The receive-async std::async lambda
captures the shared_ptr by value, so Python-side _terminate_request
can no longer drop the LlmRequest out from under an in-flight
worker's dereferences. Same for the send-async path's Response.
- With the UAF closed, the Path A guard in _handle_responses (gen-side
timeout-and-skip-if-IN_PROGRESS) and the Path B guard in
_check_disagg_ctx_cache_transfer_status (ctx-side ditto) are no
longer needed. Both are removed: when py_kv_transfer_timed_out
fires, Python calls cancel_request + _handle_errors immediately,
recovering the KV blocks even when the C++ tracker has orphaned the
entry.
Lifecycle after this commit, for a non-orphan request:
1. Python: request_and_receive_async → C++ enqueues, async worker
spins up.
2. Worker finishes (completion / peer ack / cancel) → future
resolves.
3. checkGenTransferStatus sees future ready → completeEntry sets
state=COMPLETE, erases entry (drops its shared_ptr).
4. Python observes state=COMPLETE → runs through normal completion
path (no _terminate_request yet; request proceeds to
generation).
5. Eventually Python's active_requests drops the request → pybind
shared_ptr refcount=0 → LlmRequest destroyed.
For an orphan (C++ lost the future but Python state still IN_PROGRESS):
1. Python _check_kv_transfer_timeout sets py_kv_transfer_timed_out.
2. _handle_responses sees the flag; with the guard removed, proceeds
to cancel_request + _handle_errors.
3. _handle_errors → _terminate_request → free_resources →
remove_sequence → KV blocks released.
4. LlmRequest still alive while any remaining C++ reference holds
it (async worker if any); eventually worker drops, request
destroyed. No UAF.
Also reverts 80474ce's grace-period fallback — it was a fragile
workaround (magic 2x multiplier, wall-clock vs monotonic skew, residual
UAF window) for the same underlying issue the shared_ptr refactor
cleanly fixes at the source. The [reqFut] instrumentation from
5904b9e stays (still useful for investigating the orphan formation
path itself; the leak is now decoupled from the orphan).
Scope: touches BaseCacheTransceiver's API boundary but since every
caller already holds std::shared_ptr<LlmRequest> (RequestVector is
std::vector<std::shared_ptr<LlmRequest>>; pybind auto-converts),
existing call sites drop their .get() and compile unchanged.
No behavior change for Python tests — they pass Python LlmRequest
objects which pybind converts to shared_ptr.
Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
…ion escapes The v9 in-flight cancellation primitive (267e813) assumes the requestAndReceiveAsyncMultiThreads drain worker is alive and pulling from mRequestsQueue. In the post-saturation "no recovery" repro the worker has either exited or is no longer consuming the queue, so: - New requests enqueue but are never dequeued. - The in-flight cancel flag is registered before enqueue, so the flag map has entries for every pending request. - When the 60s deadline fires, cancelRequest finds the request still in the queue (Path A), erases it silently, and returns true. The per-request-flag flip path (Path B) is never reached because Path A always succeeds first. The fix must target the worker lifecycle. Two additions here: 1. Add a catch(...) block to CacheReceiver::Impl::request()'s drain loop. Today the only handlers are RequestSpecificException / std::exception, so any non-std throw from NIXL / UCX / C++ ABI escapes the loop, terminates the worker thread, and strands every queued receive. The catch-all sets the caller's promise (so the future resolves with an exception rather than hanging) and continues the loop. 2. Add TLLM_LOG_WARNING markers at ENTER / pre-requestSync / post-requestSync / EXIT / catch-all so the next image rebuild can distinguish between: - worker never launched (no ENTER) - worker died on non-std escape (EXIT with mTerminate=0 + a matching catch-all log) - worker deadlocked inside requestSync (pre-requestSync with no matching post-requestSync for >kv_transfer_timeout_ms) - worker healthy (many pre/post pairs, no EXIT) The decision tree in the ghost-UCX investigation doc maps each pattern to the next diagnostic step. The catch(...) is both the diagnostic that would name the culprit and a likely fix — keeping the drain loop robust to any escape has no downside and closes the lifecycle hole that the per-request cancel mechanism implicitly depends on. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
33262c5 to
369d944
Compare
…opped during d7844fc cherry-pick conflict resolution on the 4e69c14 base PR 13056 commit d7844fc preserves an is_disagg_context_complete_state branch that was introduced earlier in the PR's base. On the rc11 base (4e69c14) that guard is not present, so the conflict resolution during cherry-pick dropped it — which leaves _handle_responses vulnerable to double-terminating a COMPLETE-state request that _end_transfer_and_maybe_terminate left in active_requests (end_transfer returned False). Re-add the guard to match PR 13056's intent. nvbug/5961736. Signed-off-by: Yifan Jiang <yifjiang@users.noreply.github.com>
8217bcb to
c9777c4
Compare
…ifestations broaden sig NVIDIA#7 to a CacheSender::Impl::* bug class run9 (rc11 + our fixes + UCX) and run10 (PR NVIDIA#13056 + UCX) both used a gdb capture loop on the dataTransResp thread to pin down the exact mutex behind sig NVIDIA#7. The wedge changed character in both runs: - run9 ctx mpi worker SIGSEGVs at iter 92 of the burst inside _PyObject_GenericGetAttrWithDict (Python C-API), downstream of the sig #1 fix path firing cleanly. - run10 ctx mpi worker SIGSEGVs synchronously inside CacheSender::Impl::handleAsyncSend(AsyncSendResource&) on the very first sanity-probe request - no concurrency, no burst, no cancellation. Zero UCX/NIXL frames in the stack; root cause is most plausibly a null shared_ptr<LlmRequest> deref at line 594 since PR NVIDIA#13056's ownership model is necessary but does not enforce Response::mRequest non-null at producer or consumer. Both findings are cleanly TRT-LLM-internal and broaden sig NVIDIA#7 from a single pthread_mutex_lock deadlock to a class of CacheSender::Impl::* bugs with at least four observed manifestations across two transports and three fix bundles. Touched sections: Status block, caveat block, signature table (sig NVIDIA#7 row), sig NVIDIA#7 section variants/fix/status, Phase 12 cross-reference, new Phase 13 section, Next Steps items 7 / 7a / 7b. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
…le structure Split the 3193-line single-file report into 13 reader-friendly files under docs/investigations/nvbug-6104831-disagg-permanent-wedge/: - README.md: top-level orientation, status, navigation, suggested reading paths. - 01-background.md: architecture diagrams, request lifecycle walkthrough, state machine, cancellation flow. - 02-failure-signatures.md: the seven failure signatures (#1-NVIDIA#7). - 03-defect-class-stack.md: NEW - the L1-L8 defect-class layering that frames the four-approach comparison. - 04-reproduction.md: how to reproduce locally, load shape, run archive index. - 05-investigation-timeline.md: chronological story (Phases 0-14). - 06-fix-approaches/: side-by-side comparison plus one file per approach (A chained, B PR NVIDIA#13056, C PR NVIDIA#13495, D combo). - 07-architectural-reflections.md: seven invariants + retrospective. - 08-next-steps-and-pr-map.md: PR map, outstanding work, deadline effort estimate, run archive index. The L1-L8 defect class stack is the new framework introduced here. It re-frames the seven signatures as the visible faces of eight underlying invariant gaps and is used to explain why the combo approach (D) is the only stack that recovers cleanly across the test matrix. The five Mermaid diagrams from the request walkthrough move into 01-background.md as the prerequisite reading for the rest of the investigation. The single-file form was hard to navigate at this size; the split follows the user's reading-path needs (cold reader, single-PR reviewer, fix-path picker, retrospective reader) called out in the README. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Adds 00-tldr.md as the front door to the investigation. Targets a 10-minute read with two inline Mermaid figures: - The disagg architecture + cancellation flow (where the bugs live). - The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which layer). Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen final solution. Doesn't compare approaches A/B/C — that comparison lives in 06-fix-approaches/README.md for readers who want depth on the landing decision. Section structure: 1. What is broken (the wedge symptom). 2. Where the bugs live (architecture figure). 3. Root cause: eight invariant gaps (L1-L8 table). 4. The fix (combo figure + per-piece breakdown). 5. Does it work (empirical recovery table). 6. Why this took 8 days to find (cascade pattern). 7. What is left to do. 8. Where to go from here (reader paths to deeper files). README.md updated to put 00-tldr.md at the top of the navigation table and add a 10-minute reading path entry. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
… colored fix components Two improvements to 00-tldr.md based on review feedback: 1. Add a 7-row table after the architecture section that gives a one-line "where it lives" + "symptom" for each signature #1-NVIDIA#7. The TL;DR previously mentioned the seven signatures by number without describing them, forcing the reader to jump to 02-failure-signatures.md to make sense of references like "sig NVIDIA#4" or "sig NVIDIA#7". 2. Color-code the four fix components in the combo diagram: - PR NVIDIA#13056: blue (lifetime + cancel-flag + RAII) - PR NVIDIA#13495: orange (NIXL TransferStatus::release) - eval-order fix: green - Python idempotency guards: purple Each L1-L8 layer node is tinted with the lighter shade of whichever fix closes it, so the "closes" arrows are reinforced by colour-matching. Makes the fix-to-layer mapping visually scannable at a glance. Note added underneath the diagram explaining the colour scheme. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Summary
Fixes NVBug 6104831: disaggregated-serving wedges where prefill/decode workers either crash with
std::future_error: Broken promise/bad_optional_access, or silently leak KV-cache blocks and buffer-pool slots until the pool saturates and no new transfers can start. Full investigation notes are in the debug doc.16 coupled C++/Python commits covering: (a) closing a raw-pointer UAF class on the transceiver's futures maps, (b) enforcing
kv_transfer_timeout_mson both sender and receiver, (c) plumbingperRequestCancelthrough the data-phase paths so evicted requests actually release their resources, (d) RAII on the buffer-index pool, and (e) hardening the drain worker against non-std::exceptionescapes.One-sentence version: shared_ptr ownership + enforced deadline + working cancel path + RAII on pool slots.
Why all 16 commits?
Commit 14 (
649d1466b) is the UAF fix proper — it changesmSenderFutures/mRequesterFuturesinCacheTransceiverto holdstd::shared_ptr<LlmRequest>instead of raw pointers. Confirmed viaMALLOC_PERTURB_=85(observedmRequestId=0x5555555555555555). With that ownership change, the Python-side Path A/B guards that previously deferred cleanup to avoid the UAF are no longer needed; commit 14 removes them.But fixing the UAF alone doesn't give a working system. These are coupled fixes, not independent — each addresses a distinct failure mode on the same bug family, and removing any one leaves a different no-recovery path open:
Broken promise/bad_optional_access43b518070)a31977564,a3bce0ce3,a8c27a45b,3d0a6d183)std::asynctask blocked inrecvReadySignaleven after deadline fires6bb116dac) + 7–10 cancel-flag plumbing (62c5d24ae,7e8016a2f,e7fcbab01,2860ac50a)39f5452ea,28f836e19,6f85ce895)requestSync→ index never released → size-1 buffer pool permanently wedgedstd::exceptionin drain workerabfcb6d39)std::bad_optional_access(or similar) still kills the threadConcretely, running a 1P1D disagg loadgen stress reproducer (single prefill worker, single decode worker, concurrent request burst at the saturation point, then idle and probe for recovery):
Broken promise,bad_optional_access) disappear, but stuck transfers still never get evicted. Pool exhausts, behavior looks superficially similar to the pre-fix wedge (no crash marker, no forward progress). Still no recovery after a 180 s idle probe window.std::asynctask spinning insiderecvReadySignal, pinning NIXL state. After a saturation window, every new transfer deterministically waitskv_transfer_timeout_mseven if the system is otherwise idle.Order also matters on the removal side: commit 14 only becomes safe after the earlier commits are in place, because the deadline + cancel plumbing they add is what the Path A/B guards were substituting for. Removing the guards without that plumbing in place would regress the orphan-induced KV-block leak the guards existed to mitigate.
Commits (in applied order)
a31977564— Enforcekv_transfer_timeout_mson prefill sender + guard against dangling pointersa3bce0ce3— Report timed-out transfers to Python for full cleanupa8c27a45b— Applykv_transfer_timeout_msin blockAll mode43b518070— Close remaining UAF paths and enforce gen-side deadline3d0a6d183— Hoistkv_transfer_timeout_mscheck above readiness gate6bb116dac— Release receiver resources when gen-side deadline evicts62c5d24ae—sendRequestInfo/receiveReadySignal/ data-phase send honorperRequestCancel7e8016a2f— Sender-side in-flight cancel parity +notifySyncMessagepre-notify checke7fcbab01— Tie sender cancel-flag lifetime to session + portcatch(...)to sender worker2860ac50a— BoundassignBufferIndexCV wait + plumb cancel + diagnostic[buf]logs39f5452ea— RAIIBufferIndexHolderforrequestSync— stop leaking pool slots on non-happy exits28f836e19— RAIIBufferIndexHolderon sender formatters + reqId-tagged[buf]logs6f85ce895— Send-side RAII: atomicrelease(), cancel parity, FUTURE_WAIT/JOIN diagnostics649d1466b— Carrystd::shared_ptr<LlmRequest>through transceiver — close UAF, remove Path A/B guards, unwedge KV block leakabfcb6d39— Harden drain worker loop against non-std::exceptionescapesc9777c4ac— Re-addis_disagg_context_complete_stateguard to_handle_responsesBase
Branch is linear on top of
v1.3.0rc11(4e69c14f732a6e6afce4f71616db5b5cd2b10530) — chosen so it can be built + tested against thetensorrt-llm==1.3.0rc11wheel baked into NVIDIA NGC'sdynamo-trtllmimages. 16 commits on top of that tag.