[TRTLLM-12732][fix] Fence V2 `_batched_migrate` behind `execution_stream` by Barry-Delaney · Pull Request #14245 · NVIDIA/TensorRT-LLM

Barry-Delaney · 2026-05-18T08:17:01Z

Summary

Re-applies PR13488 (c4fbc246b2), reverted by 8355a6f349 without rationale, on top of feat/deepseek_v4. Fixes the DSv4-Flash + EPLB + MEGAMOE_DEEPGEMM IMA on 4xB300 (nvbug 6183549) and removes the two corresponding SKIP entries in waives.txt.

Root cause

StorageManager._batched_migrate copies on a transient non-blocking stream from TemporaryCudaStream's pool; forward kernels run on execution_stream. The two are distinct cudaStream_t handles (verified below). Without an explicit fence, the copy stream may recycle a slot before in-flight forward kernels release it → CUDA_ERROR_ILLEGAL_ADDRESS.
The other fences already on the branch cover different races:

Fence	Race covered
PR14128 `_prepare_inputs_event`	host-staging buffer vs in-flight H2D in `_prepare_inputs` (records on `execution_stream`)
`044c13c41e` memoryview-sync in `_KVCache.resize`	host-visible `base_page_indices` readers vs slot `ready_event`s
PR13488 (this PR)	`_batched_migrate` copy stream vs in-flight forward on `execution_stream`

Barry-Delaney · 2026-05-18T08:20:34Z

/bot run --disable-fail-fast

Barry-Delaney · 2026-05-18T08:23:56Z

/bot kill

tensorrt-cicd · 2026-05-18T08:27:44Z

PR_Github #48868 [ run ] triggered by Bot. Commit: 4be6c2b Link to invocation

tensorrt-cicd · 2026-05-18T08:31:23Z

PR_Github #48870 [ kill ] triggered by Bot. Commit: 4be6c2b Link to invocation

tensorrt-cicd · 2026-05-18T08:35:02Z

PR_Github #48868 [ run ] completed with state ABORTED. Commit: 4be6c2b

Link to invocation

tensorrt-cicd · 2026-05-18T08:35:07Z

PR_Github #48870 [ kill ] completed with state SUCCESS. Commit: 4be6c2b
Successfully killed previous jobs for commit 4be6c2b

Link to invocation

Barry-Delaney · 2026-05-18T09:06:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T09:12:31Z

PR_Github #48882 [ run ] triggered by Bot. Commit: d96585e Link to invocation

tensorrt-cicd · 2026-05-18T14:05:46Z

PR_Github #48882 [ run ] completed with state SUCCESS. Commit: d96585e
/LLM/main/L0_MergeRequest_PR pipeline #38632 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-18T15:55:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T16:03:55Z

PR_Github #48931 [ run ] triggered by Bot. Commit: d96585e Link to invocation

tensorrt-cicd · 2026-05-18T16:42:01Z

PR_Github #48931 [ run ] completed with state FAILURE. Commit: d96585e
/LLM/main/L0_MergeRequest_PR pipeline #38677 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Barry-Delaney · 2026-05-18T16:45:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T16:52:01Z

PR_Github #48940 [ run ] triggered by Bot. Commit: 3337a3f Link to invocation

tensorrt-cicd · 2026-05-18T21:56:06Z

PR_Github #48940 [ run ] completed with state SUCCESS. Commit: 3337a3f
/LLM/main/L0_MergeRequest_PR pipeline #38686 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-19T01:29:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-19T01:37:45Z

PR_Github #49027 [ run ] triggered by Bot. Commit: 3337a3f Link to invocation

tensorrt-cicd · 2026-05-19T04:43:41Z

PR_Github #49027 [ run ] completed with state SUCCESS. Commit: 3337a3f
/LLM/main/L0_MergeRequest_PR pipeline #38765 completed with status: 'SUCCESS'

CI Report

Link to invocation

Barry-Delaney · 2026-05-19T12:20:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-19T12:27:57Z

PR_Github #49187 [ run ] triggered by Bot. Commit: 1c75018 Link to invocation

…ion_stream Re-applies PR13488 (cherry-picked as c4fbc24, reverted by 8355a6f without rationale) on top of feat/deepseek_v4. Root cause ---------- `StorageManager._batched_migrate` runs cross-tier copy on a transient non-blocking stream from `TemporaryCudaStream`'s pool (handle in the ~10^14 range), while forward kernels run on `execution_stream` (one specific stream, handle in the ~10^8 range). Empirically verified via runtime stream-id dump: the two streams are distinct cudaStream_t handles. Without an explicit fence the copy stream may begin recycling a slot before in-flight forward kernels on execution_stream have released it, producing CUDA_ERROR_ILLEGAL_ADDRESS when the forward path reads or the copy path writes the slot concurrently. The pre-existing fences do not cover this race: * PR14128's `_prepare_inputs_event` fences host-staging buffer vs in-flight H2D copy enqueued during `_prepare_inputs`. * 044c13c's memoryview-sync fences host-visible base_page_indices vs slot ready_events. * Neither fences the cross-tier copy stream against forward kernels on execution_stream. Fix --- Wire `execution_stream` into `StorageManager` so `_batched_migrate` can record a fresh event on it and add it to the copy stream's prior_events wait set. This is the V2 equivalent of V1's `syncWithBufferManager()` two-way sync between the BufferManager stream and the offload/onboard streams. The companion +22 lines of interim memoryview-sync in `_KVCache.resize` (introduced by 044c13c) become redundant once this fence is in place and are removed -- exactly as PR13488 originally did. Empirical verification ---------------------- With the same DEP8 / static_eplb / MEGAMOE_DEEPGEMM workload on B300: Without fix (latest feat tip f4d5e59): RANK 3 raises kv_cache_manager_v2._exceptions.CuError: CUresult.CUDA_ERROR_ILLEGAL_ADDRESS: 700 mpirun aborts; pytest hangs. With fix: 1 passed in 4:12, 0 IMA, gsm8k accuracy 95.30. Waives removed -------------- The two corresponding SKIP entries in tests/integration/test_lists/waives.txt are removed so the fix is exercised in CI. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Barry-Delaney · 2026-05-20T05:25:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T05:32:41Z

PR_Github #49344 [ run ] triggered by Bot. Commit: e142cba Link to invocation

tensorrt-cicd · 2026-05-20T05:38:38Z

PR_Github #49187 [ run ] completed with state ABORTED. Commit: 1c75018

Link to invocation

jiaganc

LGTM

tensorrt-cicd · 2026-05-20T10:38:06Z

PR_Github #49344 [ run ] completed with state SUCCESS. Commit: e142cba
/LLM/main/L0_MergeRequest_PR pipeline #39001 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…_block_offset_guard Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>

Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>

peihu-nv · 2026-05-20T20:16:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T20:23:06Z

PR_Github #49482 [ run ] triggered by Bot. Commit: 55eb3b5 Link to invocation

tensorrt-cicd · 2026-05-20T23:17:09Z

PR_Github #49482 [ run ] completed with state SUCCESS. Commit: 55eb3b5
/LLM/main/L0_MergeRequest_PR pipeline #39124 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

github-actions Bot assigned Barry-Delaney May 18, 2026

Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch 2 times, most recently from 8bfa99b to 4be6c2b Compare May 18, 2026 08:20

Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch 5 times, most recently from ce00a01 to d96585e Compare May 18, 2026 09:05

Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from d96585e to 3337a3f Compare May 18, 2026 16:45

Barry-Delaney marked this pull request as ready for review May 19, 2026 04:21

Barry-Delaney requested a review from a team as a code owner May 19, 2026 04:21

Barry-Delaney requested a review from lfr-0531 May 19, 2026 04:54

Barry-Delaney requested review from heyuhhh and jiaganc May 19, 2026 04:54

heyuhhh approved these changes May 19, 2026

View reviewed changes

Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from 3337a3f to 1c75018 Compare May 19, 2026 09:43

Barry-Delaney changed the title ~~[TRTLLM-12732][fix] Fence V2 KV cache block-offset H2D copy on KV manager stream~~ [TRTLLM-12732][fix] Fence V2 _batched_migrate behind execution_stream May 19, 2026

Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from 1c75018 to e142cba Compare May 20, 2026 05:25

lfr-0531 added the deepseek-v4 label May 20, 2026

jiaganc approved these changes May 20, 2026

View reviewed changes

peihu-nv added 2 commits May 20, 2026 13:04

Merge branch 'feat/deepseek_v4' into user/barry/fix_nvbugs_6183549_kv…

631ee1f

…_block_offset_guard Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>

Fix pre-commit formatting after manual merge conflict resolution

55eb3b5

Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>

peihu-nv merged commit 433ac3d into NVIDIA:feat/deepseek_v4 May 20, 2026
6 checks passed

Conversation

Barry-Delaney commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Uh oh!

Barry-Delaney commented May 18, 2026

Uh oh!

Barry-Delaney commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

Barry-Delaney commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

lfr-0531 commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

Barry-Delaney commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

lfr-0531 commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

Barry-Delaney commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

Barry-Delaney commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

jiaganc left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

peihu-nv commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Barry-Delaney commented May 18, 2026 •

edited

Loading