[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity by chienchunhung · Pull Request #14768 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-05-30T00:35:02Z

Summary

First step in landing the disaggregated KV cache transfer hardening work that has been iterating in PR #13713. This PR cherry-picks the subset of those changes that close pre-existing correctness and reliability hazards which exist today independent of in-flight cancellation. The cancellation surface and the related KV-block use-after-free fix are deferred to a follow-up PR.

This split lets the baseline land cleanly while the cancel-surface (a much larger blast radius) goes through review on top of a settled foundation.

What this PR changes and why

All changes are direct ports from PR #13713. Each addresses an issue that exists today regardless of whether in-flight cancellation is enabled.

C++

mSenderFutures / mRequesterFutures hold std::shared_ptr<LlmRequest> instead of raw LlmRequest*.

Issue resolved: the async KV-transfer worker can outlive the request's scheduling lifetime. When the executor terminates the request and frees its LlmRequest while the worker thread is still mid-transfer, the raw pointer dereferences freed memory — a use-after-free that manifests as rare crashes or KV-cache corruption under load. Shared ownership keeps the request alive until both the scheduler and the worker are done with it. Affects the three async entry points (respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync) plus cancelRequest.

NIXL agent kept alive while its TransferStatus exists.

Issue resolved: the NIXL agent can be torn down (e.g., during executor recycle or shutdown) while a pending TransferStatus still holds a non-owning reference to it. Subsequent operations on the TransferStatus dereference freed agent memory. Shared-ownership extension of the agent's lifetime closes the hazard.

New BufferIndexHolder RAII class in baseTransBuffer.{h,cpp}, wired at the four buffer-index acquisition sites in cacheFormatter.cpp and mlaCacheFormatter.cpp (send + recv).

Issue resolved: transfer buffer-slot indices acquired during setup are not released when a transfer throws partway through (transient I/O error, peer disconnect, allocation failure). Each leaked slot reduces the pool's usable capacity until the pool is effectively exhausted and new transfers refuse or hang. The RAII holder releases held indices on destruction regardless of normal-return or throw. The explicit freeBufferIndexFor{Send,Recv} calls are replaced with holder.release() on the happy path; the destructor covers all other exit paths.

Observe-only KV-transfer timeout WARN in checkContextTransferStatus / checkGenTransferStatus, plus the mTimedOutSenderIds / mTimedOutRequesterIds dedup sets that suppress duplicate emissions.

Issue resolved: today a wedged disagg KV transfer (peer crash, network partition, slow disk) is silent — the request blocks indefinitely with no operator-visible signal identifying which request is stuck or for how long. The new WARN gives operators a triage anchor. Strictly diagnostic: no state transition, no eviction, no cancellation triggered. Only emitted when kv_transfer_timeout_ms is configured by the user.

What this PR explicitly excludes (deferred to follow-up)

No in-flight cancellation logic of any kind. Specifically excluded from PR #13713 to keep this PR's blast radius small:

TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL env var, is_disagg_inflight_cancel_enabled(), getEnvDisaggEnableInflightCancel(), and every code path they gated.
cancel_request mid-flight surface in kv_cache_transceiver.py.
sendHolder.poison(), mInFlightCancelFlags, has_poisoned_transfer_buffer, per-request cancel-flag propagation through AgentConnection::send/recv and sendRequestInfo.
Deadline-driven eviction from mSenderFutures / mRequesterFutures.
Deferred-cleanup paths (_can_terminate_request_now, _handle_errors deferred-termination).
Promise idempotency (catch (std::future_error const&) around mPromise->set_exception).
Python recv-side dedup sets (_disagg_gen_init_prepared_ids, _disagg_gen_kv_recv_started_ids). Those exist to make recv idempotent under the cancel-throw retry pattern; without that pattern in scope, the scheduler's state-based filter is sufficient.
Reordering of setState / receiveAsync / emplace in requestAndReceiveAsync (kept upstream ordering: receiveAsync → emplace → setState).
Unconditional setKvCacheTransferStart(LlmRequest::getSteadyClockNow()) scaffolding in requestAndReceiveAsync.

Also excluded for now: rank-symmetric collective entry on the gen and ctx sides. PR #13713 removed the if need_check: / if not recv_reqs: return early-exits in _check_disagg_gen_transfer_status and the if num_fitting_reqs == 0 ...: gate in _executor_loop_pp / _executor_loop to ensure every rank enters the downstream C++ gatherRequestIds Allgather symmetrically. Reverted in this PR to reduce the delta against upstream/main and to isolate any cross-rank divergence introduced by those changes for separate verification. Will be reintroduced in a follow-up if CI shows they are needed.

Exclusion verification (zero matches inside this PR's diff):

is_disagg_inflight_cancel_enabled | TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL |
getEnvDisaggEnableInflightCancel | sendHolder.poison | mInFlightCancelFlags |
has_poisoned_transfer_buffer | _disagg_gen_kv_recv_started_ids |
_disagg_gen_init_prepared_ids

Follow-up

A subsequent PR will introduce the in-flight cancellation surface end-to-end, gated behind the TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL opt-in, including:

The KV-block use-after-free fix that originally motivated NVBug 6104831.
The Python recv-side dedup sets needed for cancel-throw retry idempotency.
The C++ deadline-driven eviction + poison + fail-closed path.

Splitting the work this way isolates regression risk for the cancel surface to a single opt-in change on top of a settled baseline.

…exists Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…ctx-side checkContextTransferStatus Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…cle integrity. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

coderabbitai · 2026-05-30T00:43:14Z

📝 Walkthrough

Walkthrough

This PR migrates the cache transceiver request methods from raw pointer arguments to std::shared_ptr<LlmRequest> for better lifetime management, adds observe-only KV cache transfer timeout warnings, introduces a buffer holder RAII utility, and fixes rank-asymmetric deadlock risks in disaggregated serving.

Changes

Cache Transceiver and Disaggregated Safety

Layer / File(s)	Summary
Cache Transceiver Interface Contract `cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h`	`BaseCacheTransceiver` and `CacheTransceiver` virtual/override methods (`respondAndSendAsync`, `requestAndReceiveSync`, `requestAndReceiveAsync`, `cancelRequest`) change from `LlmRequest*` to `std::shared_ptr<LlmRequest>`. Internal `mSenderFutures` and `mRequesterFutures` containers updated to store shared pointers alongside futures to maintain request lifetime through async operations.
Cache Transceiver Implementation `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Method bodies for `respondAndSendAsync`, `requestAndReceiveSync`, `requestAndReceiveAsync`, and `cancelRequest` adapted to accept shared pointers and move them into future containers while preserving request state access for completion/error paths.
KV Cache Transfer Timeout Monitoring `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Both context and generation transfer status checks gain observe-only deadline mechanism (`kvTransferTimeoutMs` config). Deduplicated per-request WARN logging emitted when elapsed time exceeds budget; timeout entries erased on completion/error to allow re-warning on re-enqueue.
Call Sites and Binding Updates `cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp`, `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp`	`trtGptModelInflightBatching` passes shared pointers directly to `respondAndSendAsync` and receive methods. Nanobind `PyCacheTransceiver` overrides and `BaseTransferAgent`/`NixlTransferAgent` bindings updated with new shared pointer signatures and `keep_alive<0, 1>()` lifetime policies.
Buffer Index Holder RAII Utility `cpp/tensorrt_llm/batch_manager/baseTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp`	Move-only `BufferIndexHolder` class added to own and auto-release buffer slots from `BaseTransBufferManager`, with exception-safe out-of-line `release()` and `detach()` for responsibility transfer.
Disaggregated Cache Transfer Deadlock Prevention `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Python executor's context and generation cache transfer status checks compute rank-local `at_least_num` and unconditionally invoke collective operations, removing rank-asymmetric branching that could cause ABBA deadlocks with downstream `gatherRequestIds` collectives.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

Shixiaowei02
pcastonguay
chuangz0
joyang-nv
nv-guomingz
syuoni

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title '[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity' is specific and directly relates to the primary changes, which focus on request ownership using shared pointers and buffer index management via the new BufferIndexHolder RAII class.
Description check	✅ Passed	The PR description is comprehensive and well-structured, following the template with clear sections on summary, what changed and why, what's excluded, and follow-up plans.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 555-565: The timeout logging is computed using now - start before
verifying the transfer future completed, causing polling delay to be reported as
a timeout; update the polling/inspection logic (the loop that reads the transfer
future and emits WARNs) so you first check the transfer's completion/readiness
(e.g., future::wait_for(0) or equivalent) and skip timeout calculation/logging
if the future is already ready, then only compute elapsed = now - start and
consult kvTransferTimeoutMs and mTimedOutSenderIds when the transfer remains
incomplete; apply the same fix to the other similar blocks referenced around the
622-640 and 808-834 regions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8433b8e6-f940-404e-99bb-f9f6e4f6d3ae

📥 Commits

Reviewing files that changed from the base of the PR and between 74d7c3a and c06e70c.

📒 Files selected for processing (8)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp
cpp/tensorrt_llm/batch_manager/baseTransBuffer.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

…ve entry Restores the pre-A9/A10 behavior in _check_disagg_gen_transfer_status, _prepare_disagg_gen_init, _executor_loop_pp, and _executor_loop to match upstream/main. Reduces this PR's delta vs main and isolates the rank-symmetric collective-entry changes for separate verification. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-05-30T02:33:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-30T02:39:13Z

PR_Github #51140 [ run ] triggered by Bot. Commit: 5dfa4e9 Link to invocation

tensorrt-cicd · 2026-05-30T11:32:32Z

PR_Github #51140 [ run ] completed with state SUCCESS. Commit: 5dfa4e9
/LLM/main/L0_MergeRequest_PR pipeline #40576 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung added 3 commits May 29, 2026 16:42

[NVBUGS-6104831][fix] keep NIXL agent alive while its TransferStatus …

268200e

…exists Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6104831][fix] remove rank-asymmetric gates on disagg …

0a6f7b5

…ctx-side checkContextTransferStatus Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6104831][fix] Enforce request and buffer index lifecy…

c06e70c

…cle integrity. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung requested review from a team as code owners May 30, 2026 00:35

chienchunhung requested review from bo-nv, dongxuy04 and reasonsolo May 30, 2026 00:35

github-actions Bot assigned chienchunhung May 30, 2026

chienchunhung changed the title ~~[https://nvbugs/6104831][fix] Tier-1 always-on baseline (A1+A2+A4+A7+A8+A9+A10)~~ [https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity May 30, 2026

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp Outdated

chienchunhung marked this pull request as draft May 30, 2026 00:48

chienchunhung added 2 commits May 29, 2026 18:20

[https://nvbugs/6104831][fix] Wire BufferIndexHolder at formatter sites

a60ca3b

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768

[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768
chienchunhung wants to merge 5 commits into
NVIDIA:mainfrom
chienchunhung:nvbug6104831-tier-always-on-verify

chienchunhung commented May 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

chienchunhung commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chienchunhung commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR changes and why

C++

What this PR explicitly excludes (deferred to follow-up)

Follow-up

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chienchunhung commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chienchunhung commented May 30, 2026 •

edited

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading