Skip to content

[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768

Draft
chienchunhung wants to merge 5 commits into
NVIDIA:mainfrom
chienchunhung:nvbug6104831-tier-always-on-verify
Draft

[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity#14768
chienchunhung wants to merge 5 commits into
NVIDIA:mainfrom
chienchunhung:nvbug6104831-tier-always-on-verify

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented May 30, 2026

Summary

First step in landing the disaggregated KV cache transfer hardening work that has been iterating in PR #13713. This PR cherry-picks the subset of those changes that close pre-existing correctness and reliability hazards which exist today independent of in-flight cancellation. The cancellation surface and the related KV-block use-after-free fix are deferred to a follow-up PR.

This split lets the baseline land cleanly while the cancel-surface (a much larger blast radius) goes through review on top of a settled foundation.

What this PR changes and why

All changes are direct ports from PR #13713. Each addresses an issue that exists today regardless of whether in-flight cancellation is enabled.

C++

mSenderFutures / mRequesterFutures hold std::shared_ptr<LlmRequest> instead of raw LlmRequest*.

Issue resolved: the async KV-transfer worker can outlive the request's scheduling lifetime. When the executor terminates the request and frees its LlmRequest while the worker thread is still mid-transfer, the raw pointer dereferences freed memory — a use-after-free that manifests as rare crashes or KV-cache corruption under load. Shared ownership keeps the request alive until both the scheduler and the worker are done with it. Affects the three async entry points (respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync) plus cancelRequest.

NIXL agent kept alive while its TransferStatus exists.

Issue resolved: the NIXL agent can be torn down (e.g., during executor recycle or shutdown) while a pending TransferStatus still holds a non-owning reference to it. Subsequent operations on the TransferStatus dereference freed agent memory. Shared-ownership extension of the agent's lifetime closes the hazard.

New BufferIndexHolder RAII class in baseTransBuffer.{h,cpp}, wired at the four buffer-index acquisition sites in cacheFormatter.cpp and mlaCacheFormatter.cpp (send + recv).

Issue resolved: transfer buffer-slot indices acquired during setup are not released when a transfer throws partway through (transient I/O error, peer disconnect, allocation failure). Each leaked slot reduces the pool's usable capacity until the pool is effectively exhausted and new transfers refuse or hang. The RAII holder releases held indices on destruction regardless of normal-return or throw. The explicit freeBufferIndexFor{Send,Recv} calls are replaced with holder.release() on the happy path; the destructor covers all other exit paths.

Observe-only KV-transfer timeout WARN in checkContextTransferStatus / checkGenTransferStatus, plus the mTimedOutSenderIds / mTimedOutRequesterIds dedup sets that suppress duplicate emissions.

Issue resolved: today a wedged disagg KV transfer (peer crash, network partition, slow disk) is silent — the request blocks indefinitely with no operator-visible signal identifying which request is stuck or for how long. The new WARN gives operators a triage anchor. Strictly diagnostic: no state transition, no eviction, no cancellation triggered. Only emitted when kv_transfer_timeout_ms is configured by the user.

What this PR explicitly excludes (deferred to follow-up)

No in-flight cancellation logic of any kind. Specifically excluded from PR #13713 to keep this PR's blast radius small:

  • TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL env var, is_disagg_inflight_cancel_enabled(), getEnvDisaggEnableInflightCancel(), and every code path they gated.
  • cancel_request mid-flight surface in kv_cache_transceiver.py.
  • sendHolder.poison(), mInFlightCancelFlags, has_poisoned_transfer_buffer, per-request cancel-flag propagation through AgentConnection::send/recv and sendRequestInfo.
  • Deadline-driven eviction from mSenderFutures / mRequesterFutures.
  • Deferred-cleanup paths (_can_terminate_request_now, _handle_errors deferred-termination).
  • Promise idempotency (catch (std::future_error const&) around mPromise->set_exception).
  • Python recv-side dedup sets (_disagg_gen_init_prepared_ids, _disagg_gen_kv_recv_started_ids). Those exist to make recv idempotent under the cancel-throw retry pattern; without that pattern in scope, the scheduler's state-based filter is sufficient.
  • Reordering of setState / receiveAsync / emplace in requestAndReceiveAsync (kept upstream ordering: receiveAsync → emplace → setState).
  • Unconditional setKvCacheTransferStart(LlmRequest::getSteadyClockNow()) scaffolding in requestAndReceiveAsync.

Also excluded for now: rank-symmetric collective entry on the gen and ctx sides. PR #13713 removed the if need_check: / if not recv_reqs: return early-exits in _check_disagg_gen_transfer_status and the if num_fitting_reqs == 0 ...: gate in _executor_loop_pp / _executor_loop to ensure every rank enters the downstream C++ gatherRequestIds Allgather symmetrically. Reverted in this PR to reduce the delta against upstream/main and to isolate any cross-rank divergence introduced by those changes for separate verification. Will be reintroduced in a follow-up if CI shows they are needed.

Exclusion verification (zero matches inside this PR's diff):

is_disagg_inflight_cancel_enabled | TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL |
getEnvDisaggEnableInflightCancel | sendHolder.poison | mInFlightCancelFlags |
has_poisoned_transfer_buffer | _disagg_gen_kv_recv_started_ids |
_disagg_gen_init_prepared_ids

Follow-up

A subsequent PR will introduce the in-flight cancellation surface end-to-end, gated behind the TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL opt-in, including:

  • The KV-block use-after-free fix that originally motivated NVBug 6104831.
  • The Python recv-side dedup sets needed for cancel-throw retry idempotency.
  • The C++ deadline-driven eviction + poison + fail-closed path.

Splitting the work this way isolates regression risk for the cancel surface to a single opt-in change on top of a settled baseline.

…exists

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…ctx-side checkContextTransferStatus

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…cle integrity.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung requested review from a team as code owners May 30, 2026 00:35
@chienchunhung chienchunhung changed the title [https://nvbugs/6104831][fix] Tier-1 always-on baseline (A1+A2+A4+A7+A8+A9+A10) [https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity May 30, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

📝 Walkthrough

Walkthrough

This PR migrates the cache transceiver request methods from raw pointer arguments to std::shared_ptr<LlmRequest> for better lifetime management, adds observe-only KV cache transfer timeout warnings, introduces a buffer holder RAII utility, and fixes rank-asymmetric deadlock risks in disaggregated serving.

Changes

Cache Transceiver and Disaggregated Safety

Layer / File(s) Summary
Cache Transceiver Interface Contract
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
BaseCacheTransceiver and CacheTransceiver virtual/override methods (respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync, cancelRequest) change from LlmRequest* to std::shared_ptr<LlmRequest>. Internal mSenderFutures and mRequesterFutures containers updated to store shared pointers alongside futures to maintain request lifetime through async operations.
Cache Transceiver Implementation
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Method bodies for respondAndSendAsync, requestAndReceiveSync, requestAndReceiveAsync, and cancelRequest adapted to accept shared pointers and move them into future containers while preserving request state access for completion/error paths.
KV Cache Transfer Timeout Monitoring
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Both context and generation transfer status checks gain observe-only deadline mechanism (kvTransferTimeoutMs config). Deduplicated per-request WARN logging emitted when elapsed time exceeds budget; timeout entries erased on completion/error to allow re-warning on re-enqueue.
Call Sites and Binding Updates
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp, cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp, cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp
trtGptModelInflightBatching passes shared pointers directly to respondAndSendAsync and receive methods. Nanobind PyCacheTransceiver overrides and BaseTransferAgent/NixlTransferAgent bindings updated with new shared pointer signatures and keep_alive<0, 1>() lifetime policies.
Buffer Index Holder RAII Utility
cpp/tensorrt_llm/batch_manager/baseTransBuffer.h, cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp
Move-only BufferIndexHolder class added to own and auto-release buffer slots from BaseTransBufferManager, with exception-safe out-of-line release() and detach() for responsibility transfer.
Disaggregated Cache Transfer Deadlock Prevention
tensorrt_llm/_torch/pyexecutor/py_executor.py
Python executor's context and generation cache transfer status checks compute rank-local at_least_num and unconditionally invoke collective operations, removing rank-asymmetric branching that could cause ABBA deadlocks with downstream gatherRequestIds collectives.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • Shixiaowei02
  • pcastonguay
  • chuangz0
  • joyang-nv
  • nv-guomingz
  • syuoni
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title '[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity' is specific and directly relates to the primary changes, which focus on request ownership using shared pointers and buffer index management via the new BufferIndexHolder RAII class.
Description check ✅ Passed The PR description is comprehensive and well-structured, following the template with clear sections on summary, what changed and why, what's excluded, and follow-up plans.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 555-565: The timeout logging is computed using now - start before
verifying the transfer future completed, causing polling delay to be reported as
a timeout; update the polling/inspection logic (the loop that reads the transfer
future and emits WARNs) so you first check the transfer's completion/readiness
(e.g., future::wait_for(0) or equivalent) and skip timeout calculation/logging
if the future is already ready, then only compute elapsed = now - start and
consult kvTransferTimeoutMs and mTimedOutSenderIds when the transfer remains
incomplete; apply the same fix to the other similar blocks referenced around the
622-640 and 808-834 regions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8433b8e6-f940-404e-99bb-f9f6e4f6d3ae

📥 Commits

Reviewing files that changed from the base of the PR and between 74d7c3a and c06e70c.

📒 Files selected for processing (8)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/baseTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp Outdated
@chienchunhung chienchunhung marked this pull request as draft May 30, 2026 00:48
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…ve entry

Restores the pre-A9/A10 behavior in _check_disagg_gen_transfer_status,
_prepare_disagg_gen_init, _executor_loop_pp, and _executor_loop to
match upstream/main. Reduces this PR's delta vs main and isolates the
rank-symmetric collective-entry changes for separate verification.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51140 [ run ] triggered by Bot. Commit: 5dfa4e9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51140 [ run ] completed with state SUCCESS. Commit: 5dfa4e9
/LLM/main/L0_MergeRequest_PR pipeline #40576 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants