Skip to content

[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening#13713

Draft
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
chienchunhung:pr13056-pr13495-combo-nvbug6104831
Draft

[https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening#13713
chienchunhung wants to merge 8 commits intoNVIDIA:mainfrom
chienchunhung:pr13056-pr13495-combo-nvbug6104831

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented May 3, 2026

@coderabbitai summary

tl;dr

This PR fixes a permanent disaggregated-serving wedge triggered by long-prompt, cancellation-heavy traffic in the PyTorch backend. The root cause is a stack of cleanup/lifetime invariant gaps across C++ KV transfer, Python request termination, NIXL cancellation, and block reuse. The final solution makes disaggregated KV cleanup transport-aware and block-reuse-safe: cancelled requests stop scheduling immediately, but KV resources are only released after C++ transfer status becomes terminal, and deferred termination is retried exactly once when it becomes safe.

Background

Disaggregated serving splits a request across:

  • a context worker that runs prefill and sends KV cache;
  • a generation worker that receives KV cache and decodes tokens;
  • a disaggregated frontend that routes OpenAI-compatible requests.

Under long-prompt bursts with client-side cancellations, the old cleanup path could leave the deployment alive but permanently non-responsive: workers stayed up, health endpoints could still return 200, but every post-burst request timed out.

The investigation found that this was not one bug. It was a stack of independent lifecycle gaps. Fixing one exposed the next.

Failure signatures observed

Signature Symptom Root cause
#1 Sender-side Broken promise cancel-after-ready path erased a promise without fulfilling it
#4 Generation event loop blocked forever checkGenTransferStatus(atLeastNum=1) could call future.get() on an unready future
#5 Receiver-side Broken promise queued cancel erased receiver promise without fulfilling it
#6 Receiver buffer pool wedge requestSync() early-return path leaked a recv-buffer slot
#7 First-request SIGSEGV / mutex wedge variants request lifetime and eval-order bugs after moving from raw pointer to shared_ptr<LlmRequest>
rc13 regression Block reuse enabled caused no recovery even after the above fixes Python deferred termination while context transfer was in progress, but later skipped cleanup because it assumed early termination had already succeeded

Correctness model

The central invariant is:

A request can be cancelled at the API level immediately,
but its KV resources cannot be released/reused until:
  1. C++ KV transfer is terminal, or the transfer/buffer state is fail-closed;
  2. transfer buffer ownership has been released or poisoned;
  3. block-reuse/prefix-cache references are accounted for;
  4. Python resource cleanup runs exactly once.

Python request state alone is not sufficient. With block reuse enabled, a cancelled request may still have KV blocks referenced by the reuse tree or pinned by an in-flight context transfer. Freeing those blocks too early risks memory corruption; forgetting to free them later causes a permanent wedge.

What this PR changes

1. C++ transfer request lifetime

The C++ transceiver now uses std::shared_ptr<LlmRequest> across async send/receive state so Python-side termination cannot destroy a request object while C++ futures/workers still dereference it.

This closes the UAF class that previously manifested as corrupted request IDs and first-request crashes.

2. Sender/receiver cancellation bookkeeping

Cancellation paths now fulfill promises before dropping queued/ready work. This prevents std::future_error: Broken promise from escaping on both sender and receiver sides.

3. Non-blocking generation transfer polling

checkGenTransferStatus(atLeastNum=1) no longer blocks indefinitely on an unready future. It skips unready entries during non-blocking polling and lets timeout handling drive cancellation/error state.

4. NIXL transfer cancellation and release

The NIXL path now releases transfer handles on cancellation (TransferStatus::release()nixlAgent::releaseXferReq()), so cancelled requests do not leave backend transfer state stranded.

5. Eval-order fix after shared_ptr

After moving Response::mRequest to shared_ptr, this pattern became unsafe:

sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp));

C++ does not specify function-argument evaluation order, so std::move(resp) could run before resp.mRequest->mRequestId. This PR materializes reqId before the move.

6. Python idempotency guards

Generation-side disagg initialization and KV receive setup are now guarded by py_request_id, preventing repeated side effects such as duplicate KVCacheManager::addSequence() or duplicate request_and_receive_async() for the same request.

7. Fail-closed behavior for unknown transfer quiescence

If TRT-LLM cannot prove transport quiescence, it fails closed instead of returning buffers to the pool. This includes:

  • BufferIndexHolder::poison()
  • tri-state ready-signal handling (ready, not-ready, cancelled/unknown)
  • poison-on-NIXL-throw in cacheFormatter.cpp
  • equivalent MLA formatter hardening in mlaCacheFormatter.cpp
  • Python-side deferral/fail-safe handling for unquiesced transfer states

This protects against silent buffer-pool corruption under cancellation races.

8. Deferred cleanup for timed-out context/generation transfer

A successful cancel_request() is not a quiescence proof. The PR therefore defers Python-side resource cleanup after timeout cancellation until C++ transfer status reports the request terminal.

This preserves memory safety: resources are not freed while C++/NIXL may still reference transfer buffers or KV blocks.

9. Block-reuse-safe deferred termination

rc13 enabled block reuse with the overlap scheduler, which exposed a new liveness hole.

Before the final fix:

_handle_responses() decides request is done
→ attempts _terminate_request()
→ _terminate_request() refuses because context KV transfer is still in progress
→ later transfer completes
→ AsyncTransferManager.end_transfer() assumes early termination already happened
→ cleanup obligation is lost
→ request/resources remain pinned
→ frontend times out forever

This PR records whether termination was requested and whether resources were actually freed:

RequestTransferMetadata:
  termination_requested
  resources_freed

AsyncTransferManager.end_transfer() now returns:

EndTransferResult(
    completed: bool,
    needs_termination: bool,
)

When transfer becomes terminal, _end_transfer_and_maybe_terminate() retries termination if an earlier attempt was deferred. This preserves both invariants:

  • do not free/reuse KV resources while context transfer is still in progress;
  • do not drop the cleanup obligation after deferring it.

Why deferring cleanup is intentional

This PR intentionally defers cleanup in some cancellation paths.

That is the correct behavior because cancel_request() only requests cancellation. It does not prove that C++/NIXL has stopped touching transfer buffers or KV blocks.

The final behavior is:

request cancelled
→ stop scheduling it
→ request C++ cancellation once
→ keep transfer/KV resources alive
→ wait for terminal C++ transfer status
→ release/poison transfer buffers
→ unpin block-reuse refs
→ free Python resources exactly once

So the PR does not mean "never clean up." It means "do not clean up until it is safe, and then guarantee cleanup happens."

Tests

New/updated unit coverage includes:

  • sender cancel-after-ready promise fulfillment;
  • receiver queued-cancel promise fulfillment;
  • non-blocking checkGenTransferStatus(atLeastNum=1);
  • deferred termination surviving until async transfer completion;
  • partial-reuse path still skipping termination when no termination was requested;
  • successful early free preventing duplicate termination;
  • direct EndTransferResult.needs_termination coverage.

Local test runs:

python -m py_compile tensorrt_llm/_torch/pyexecutor/py_executor.py
pytest -v tests/unittest/_torch/executor/test_async_transfer_manager.py
  9 passed
pytest -v tests/unittest/others/test_kv_cache_transceiver.py -k \
  "test_cancel_request_in_transmission_does_not_break_sender_future or \
   test_check_gen_transfer_status_at_least_one_does_not_block_on_unready_future or \
   test_cancel_queued_gen_request_fulfills_receiver_future"
  6 passed

Notes / limitations

  • The direct UCX path has a separate throughput/backpressure boundary under high concurrency. This PR primarily fixes the NIXL customer path and the shared disaggregated cleanup invariants.
  • MLA send-path hardening is included for parity with the non-MLA formatter path, but the stress validation above uses Qwen3-0.6B and therefore primarily exercises the non-MLA formatter.
  • The local CuTe DSL import guard used during development is not part of this PR.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 3, 2026
Adds 00-tldr.md as the front door to the investigation. Targets a
10-minute read with two inline Mermaid figures:

- The disagg architecture + cancellation flow (where the bugs live).
- The L1-L8 → combo PR NVIDIA#13713 fix mapping (which piece closes which
  layer).

Written under the assumption that PR NVIDIA#13713 (combo: PR NVIDIA#13056 + PR
NVIDIA#13495 + eval-order fix + Python idempotency guards) is the chosen
final solution. Doesn't compare approaches A/B/C — that comparison
lives in 06-fix-approaches/README.md for readers who want depth on the
landing decision.

Section structure:
1. What is broken (the wedge symptom).
2. Where the bugs live (architecture figure).
3. Root cause: eight invariant gaps (L1-L8 table).
4. The fix (combo figure + per-piece breakdown).
5. Does it work (empirical recovery table).
6. Why this took 8 days to find (cascade pattern).
7. What is left to do.
8. Where to go from here (reader paths to deeper files).

README.md updated to put 00-tldr.md at the top of the navigation
table and add a 10-minute reading path entry.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch 4 times, most recently from 631975c to 262796c Compare May 4, 2026 06:57
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 5, 2026
…rt, and L9 invariant

Update the NVBug 6104831 investigation report to reflect the current
state of the combo fix path (PR NVIDIA#13713):

- Fold PR NVIDIA#13728 (fail-closed on unquiesced disagg KV transfer) into
  the combo as a fifth piece, plus the MLA-formatter port that
  closes the same hazard for DeepSeek-style models.
- Add L9 (transport quiescence on unsafe exit) to the defect-class
  stack as a defense-in-depth memory-safety invariant. L1-L8 are
  wedge-prevention; L9 is the rip-cord that prevents silent
  buffer-pool corruption on cancel/exception when transport
  quiescence is unknown.
- Update the empirical results tables across README.md, 00-tldr.md,
  06-fix-approaches/D-combo.md, and 06-fix-approaches/README.md to
  show CONC=128 (3-pair) review-fix-v3 5/5 PASS post-fold-in plus
  the prior pre-fold-in stress runs through CONC=256.
- Add MLA-stress validation as a follow-up item and motivate the
  RAII-audit task with the MLA-formatter finding.
- Update the PR map to include NVIDIA#13728 and the chained sig regression
  PRs that the combo retains as test scaffolding.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@chienchunhung chienchunhung changed the title [https://nvbugs/6104831][fix] Disagg request cancellation fix [https://nvbugs/6104831][fix] Disaggregated KV transfer: lifecycle, cancellation, and quiescence hardening May 5, 2026
yifjiang added a commit to yifjiang/TensorRT-LLM that referenced this pull request May 6, 2026
…r queues

Follow-up to the prior commit. The worker queues inside CacheSender::Impl
and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning
the executor's mSenderFutures / mRequesterFutures pinned the lifetime
but the worker thread that actually dereferences the request had only
a raw observer. If the executor erased its tracking entry while the
worker was mid-flight, the LlmRequest could be freed under the worker.

This commit closes that surface:

* CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>.
* CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>.
  Move/copy semantics simplified now that the field is a smart pointer.
* CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly.
* CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value
  in the std::async lambda so the worker thread pins the LlmRequest
  independently of the caller.
* CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive*
  pass the shared_ptr (no longer .get()-strip it) and move into the
  TrackedFuture entry where appropriate.

Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that
PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a
shared_ptr, the one-liner

    sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp));

becomes undefined behaviour because C++ argument evaluation order is
unspecified — the compiler may evaluate std::move(resp) first, leaving
resp.mRequest empty when reading mRequestId. Materialise the id into
a local before the move.

The PyCacheTransceiver trampoline's bool cancelRequest override is
updated to match the new shared_ptr signature.

The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md
spells out the full ownership chain: executor tracking + worker queue
both hold shared_ptr, and TransferSession::mRequest is an ephemeral
observer used only inside the worker frame where the shared_ptr is
already held.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/TensorRT-LLM that referenced this pull request May 6, 2026
…g, structured cancel, bounded quarantine

The C++ cache transceiver and Python executor cooperate on disaggregated
KV transfer cancellation, timeout, and process health. Earlier attempts
(PR NVIDIA#13706, PR NVIDIA#13713, PR NVIDIA#13728) each fixed one slice of the problem
but still hung in the field:

* PR NVIDIA#13706 with Python changes restarted workers on every routine
  per-request timeout (Python fail-close was too aggressive).
* PR NVIDIA#13713 / PR NVIDIA#13728 hung even with Python cleanup removed because
  the C++ status polling could still call future.get() on an unready
  worker future, freezing the executor event loop while NIXL/UCX was
  wedged.

This PR keeps each layer's responsibility separate, as described in
docs/source/features/disagg-kv-transfer-hang-restart-analysis.md and
the new docs/source/features/disagg-kv-transfer-session-lifecycle.md:

* checkContextTransferStatus / checkGenTransferStatus are now strictly
  non-blocking. They poll with wait_for(0ms) and only call future.get()
  when the future is already ready. atLeastRequestNum > 0 still admits
  additional ready entries but never selects an unready one to satisfy
  the count. drainContextTransferStatus / drainGenTransferStatus are
  the only blocking variants and are intended for shutdown drain.

* cancelRequestStructured returns a TransferCancelResult enum with
  six outcomes (NotFound, AlreadyComplete, CancelledBeforeAdvertise,
  CancelRequestedInFlight, BackendUnhealthy, NotCancellable). Only
  the first three permit Python to free request resources; the others
  keep the request owned by C++ until the worker future reaches a
  final state. The historical bool cancelRequest is preserved as a
  backward-compatible wrapper.

* The transceiver maintains a bounded quarantine counter and a global
  progress deadline. A per-request timeout marks the entry quarantined
  but keeps the future pinned so NIXL/UCX cannot write into freed
  memory. If quarantined entries exceed mQuarantineBudget or no
  worker has reached a final state for longer than
  mGlobalProgressDeadlineMs, isHealthy() flips false and getHealth()
  surfaces the snapshot for orchestration.

* PyExecutor adds _can_terminate_request_now / _inflight_cancel_-
  requested_ids so _do_terminate_request defers freeing resources for
  any request that is still in disagg transfer state or whose cancel
  is in flight. Per-request timeouts no longer clear active_requests,
  the waiting queue, or set is_shutdown — that was the PR NVIDIA#13706
  restart-loop trigger and is explicitly forbidden by the analysis
  doc.

* The ADP synchronized pending-response flush from PR NVIDIA#13112 is
  ported here (the base commit precedes that merge). Transfer-
  completion responses created in _end_transfer_and_maybe_terminate
  are buffered and flushed at synchronised loop points so every DP
  rank participates in the tp_gather collective.

Test coverage:

* tests/unittest/disaggregated/test_kv_transfer_session_lifecycle.py
  — 14 unit tests for the deferred-terminate guard, the structured-
  cancel decision tree, and the ADP flush symmetry. They run without
  GPU or MPI so they fail fast in pre-merge CI.

References:
* analysis: docs/source/features/disagg-kv-transfer-hang-restart-analysis.md
* contract: docs/source/features/disagg-kv-transfer-session-lifecycle.md

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
yifjiang added a commit to yifjiang/TensorRT-LLM that referenced this pull request May 6, 2026
…r queues

Follow-up to the prior commit. The worker queues inside CacheSender::Impl
and CacheReceiver::Impl still held raw LlmRequest* pointers — meaning
the executor's mSenderFutures / mRequesterFutures pinned the lifetime
but the worker thread that actually dereferences the request had only
a raw observer. If the executor erased its tracking entry while the
worker was mid-flight, the LlmRequest could be freed under the worker.

This commit closes that surface:

* CacheSender::Impl::Response::mRequest → std::shared_ptr<LlmRequest>.
* CacheReceiver::Impl::RequestAndPromise::mRequest → std::shared_ptr<LlmRequest>.
  Move/copy semantics simplified now that the field is a smart pointer.
* CacheSender::sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest>).
* CacheReceiver::Impl::requestAndReceiveAsyncMultiThreads similarly.
* CacheReceiver::Impl::receiveAsync now captures the shared_ptr by value
  in the std::async lambda so the worker thread pins the LlmRequest
  independently of the caller.
* CacheTransceiver::respondAndSendAsync/-LayerWise/requestAndReceive*
  pass the shared_ptr (no longer .get()-strip it) and move into the
  TrackedFuture entry where appropriate.

Also fixes the eval-order UAF in CacheSender::Impl::handleAsyncSend that
PR NVIDIA#13713 / PR NVIDIA#13728 called out: once Response::mRequest is a
shared_ptr, the one-liner

    sendAndRemoveResponse(resp.mRequest->mRequestId, std::move(resp));

becomes undefined behaviour because C++ argument evaluation order is
unspecified — the compiler may evaluate std::move(resp) first, leaving
resp.mRequest empty when reading mRequestId. Materialise the id into
a local before the move.

The PyCacheTransceiver trampoline's bool cancelRequest override is
updated to match the new shared_ptr signature.

The doc at docs/source/features/disagg-kv-transfer-session-lifecycle.md
spells out the full ownership chain: executor tracking + worker queue
both hold shared_ptr, and TransferSession::mRequest is an ephemeral
observer used only inside the worker frame where the shared_ptr is
already held.

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 7, 2026
…sig NVIDIA#8, Phase 15

The combo (PR NVIDIA#13713 + NVIDIA#13728 fold + MLA port) regressed when applied
to rc13: with rc13's default-on block reuse, `_handle_responses`'s
early-termination branch and `_end_transfer_and_maybe_terminate` can
each refuse termination under the right timing, leaving the request
with no cleanup owner. Server hangs on scenarios that succeeded on
rc11.

Captures this in the investigation report:

- 03-defect-class-stack.md: add L10 (redundant block-reuse cleanup
  mechanism on the disagg path) with code sites, customer-visible
  symptom, and the latent symptoms the Phase 1 stop-gap leaves open
  (pin leak on cancel/timeout, PP > 1 disagg without block reuse,
  eviction race in the unpin → release window, regression risk on
  adjacent code).
- 03-defect-class-stack.md: extend the layer-to-signature mermaid
  with L10 → sig NVIDIA#8 plus a dotted edge to the latent-symptom set.
- 02-failure-signatures.md: add sig NVIDIA#8 (rc13 server hang under
  disagg + block reuse + in-flight cancel) with full root-cause,
  short-term stop-gap, and medium-term Phase 2 plan.
- 05-investigation-timeline.md: add Phase 15 documenting the rc13
  regression discovery, the two competing fix proposals, and the
  recommended staged plan.
- 08-next-steps-and-pr-map.md: split item 2 into "land combo with
  rc13 stop-gap" and a new item 2a "land Phase 2 of the
  block-reuse-overlap-scheduler design".
- README.md: add the rc13 caveat callout in the status section;
  update navigation language from L1-L8/seven-sig to L1-L10/eight-sig.

Adds incentive cross-reference in the existing design doc:

- docs/design/block-reuse-overlap-scheduler/README.md: promote Phase 2
  status from "deprioritised" to "load-bearing for stable disagg
  block reuse" with cross-link to the rc13 evidence.
- docs/design/block-reuse-overlap-scheduler/phase2-unify-reuse-mechanisms.md:
  add an "Empirical confirmation: the rc13 regression" section at the
  top, framing Phase 2 as the architectural answer the rc13 bug
  predicted.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 7, 2026
Adds 09-executive-summary-rc11-to-rc13.md as the 15-minute read for
someone briefing the full journey: original rc11 wedge, how PR NVIDIA#13713
solved it, why it regressed on rc13 (block reuse → L10 dual-path),
how the short-term stop-gap unblocks rc13, and why the design doc's
Phase 2 is the right long-term answer.

Five Mermaid figures:
- L1-L9 layer stack as the rc11 root cause framework
- PR NVIDIA#13713's four-piece composition with L1-L9 mapping
- The L10 dual-path: both cleanup paths refusing termination on rc13
- Stop-gap coverage vs latent symptoms it leaves open
- Add-coordination vs delete-redundancy comparison + staged plan

Updates README.md navigation table and reading paths to point at the
new file with a "15 minutes / brief someone on the full journey"
entry.

The file extends 00-tldr.md (10-minute read on rc11 only). Where
00-tldr stops at "the rc11 wedge is fixed", 09 picks up at the rc13
regression discovery and walks through to the architectural follow-up.
Designed for an exec briefing or a teammate joining mid-investigation.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 7, 2026
…cancellation/cleanup invariants

Two changes to 09-executive-summary-rc11-to-rc13.md per review feedback:

1. Treat the file as an independent report. Removed the "extends
   00-tldr.md with the rc13 chapter" framing; removed the inline
   "> Detail: <other-file>.md" pointers from each section; restated
   the ten invariants in-place rather than referring readers to
   03-defect-class-stack.md to follow along; moved the cross-file
   pointers to a clearly-marked "Optional deep-dive pointers"
   appendix at the end. A reader can now walk away with a complete
   understanding of the bug class, the fix, the regression, and the
   plans without needing to consult any other file.

2. Add a new section 2 "The invariants for correct request
   cancellation and cleanup" that names the ten invariants the bug
   class violates, organised into four architectural categories:
     - Lifetime invariants (L2/L7/L9): what stays alive while
       transfers are in flight.
     - Resource invariants (L5/L6): every acquired resource must
       release on every exit.
     - Synchronization invariants (L1/L3/L4): no thread waits forever.
     - Coordination invariants (L8/L10): exactly one owner, no
       implicit handoffs.
   Each invariant is named, defined, and explained in terms of why it
   matters (with the rc11/rc13 evidence that violated it). Section 3
   onwards then describes PR NVIDIA#13713, the rc13 regression, the stop-gap,
   and the architectural fix in terms of which invariants they enforce
   or fail to enforce.

A new Mermaid figure visualises the four invariant categories so a
reader can see the architectural hierarchy at a glance: lifetime
governs who is alive, resource governs what is held, synchronization
governs who is waiting, coordination governs who is responsible.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request May 7, 2026
…p-dive pointers

Per review feedback:

- Title now just mentions the NVBug number: "09 — NVBug 6104831
  Executive Summary".
- Preface trimmed to one sentence that briefly mentions PR NVIDIA#13713 and
  the rc11/rc13 contrast. Audience block, reading-time block, and
  self-contained note removed.
- "Optional deep-dive pointers" appendix removed entirely.

The file is now strictly the rc11 → rc13 narrative with the four-
category invariant model, no meta-framing or cross-reference
appendix.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from e4c232d to 87c4bb8 Compare May 7, 2026 05:59
chienchunhung and others added 7 commits May 6, 2026 23:05
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…disagg cancellation.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
(cherry picked from commit 944561d)
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> 1777930306 -0700
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

# Conflicts:
#	tensorrt_llm/_torch/pyexecutor/py_executor.py
@chienchunhung chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from 87c4bb8 to 484655e Compare May 7, 2026 06:22
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47132 [ run ] triggered by Bot. Commit: 484655e Link to invocation

…transfer

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the pr13056-pr13495-combo-nvbug6104831 branch from 484655e to 5719008 Compare May 7, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants