Skip to content

[TRTLLM-12732][fix] Fence V2 _batched_migrate behind execution_stream#14245

Merged
peihu-nv merged 3 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/barry/fix_nvbugs_6183549_kv_block_offset_guard
May 20, 2026
Merged

[TRTLLM-12732][fix] Fence V2 _batched_migrate behind execution_stream#14245
peihu-nv merged 3 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/barry/fix_nvbugs_6183549_kv_block_offset_guard

Conversation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

@Barry-Delaney Barry-Delaney commented May 18, 2026

Summary

Re-applies PR13488 (c4fbc246b2), reverted by 8355a6f349 without rationale, on top of feat/deepseek_v4. Fixes the DSv4-Flash + EPLB + MEGAMOE_DEEPGEMM IMA on 4xB300 (nvbug 6183549) and removes the two corresponding SKIP entries in waives.txt.

Root cause

StorageManager._batched_migrate copies on a transient non-blocking stream from TemporaryCudaStream's pool; forward kernels run on execution_stream. The two are distinct cudaStream_t handles (verified below). Without an explicit fence, the copy stream may recycle a slot before in-flight forward kernels release it → CUDA_ERROR_ILLEGAL_ADDRESS.
The other fences already on the branch cover different races:

Fence Race covered
PR14128 _prepare_inputs_event host-staging buffer vs in-flight H2D in _prepare_inputs (records on execution_stream)
044c13c41e memoryview-sync in _KVCache.resize host-visible base_page_indices readers vs slot ready_events
PR13488 (this PR) _batched_migrate copy stream vs in-flight forward on execution_stream

@Barry-Delaney Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch 2 times, most recently from 8bfa99b to 4be6c2b Compare May 18, 2026 08:20
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48868 [ run ] triggered by Bot. Commit: 4be6c2b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48870 [ kill ] triggered by Bot. Commit: 4be6c2b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48868 [ run ] completed with state ABORTED. Commit: 4be6c2b

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48870 [ kill ] completed with state SUCCESS. Commit: 4be6c2b
Successfully killed previous jobs for commit 4be6c2b

Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch 5 times, most recently from ce00a01 to d96585e Compare May 18, 2026 09:05
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48882 [ run ] triggered by Bot. Commit: d96585e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48882 [ run ] completed with state SUCCESS. Commit: d96585e
/LLM/main/L0_MergeRequest_PR pipeline #38632 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48931 [ run ] triggered by Bot. Commit: d96585e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48931 [ run ] completed with state FAILURE. Commit: d96585e
/LLM/main/L0_MergeRequest_PR pipeline #38677 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from d96585e to 3337a3f Compare May 18, 2026 16:45
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48940 [ run ] triggered by Bot. Commit: 3337a3f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48940 [ run ] completed with state SUCCESS. Commit: 3337a3f
/LLM/main/L0_MergeRequest_PR pipeline #38686 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49027 [ run ] triggered by Bot. Commit: 3337a3f Link to invocation

@Barry-Delaney Barry-Delaney marked this pull request as ready for review May 19, 2026 04:21
@Barry-Delaney Barry-Delaney requested a review from a team as a code owner May 19, 2026 04:21
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49027 [ run ] completed with state SUCCESS. Commit: 3337a3f
/LLM/main/L0_MergeRequest_PR pipeline #38765 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Barry-Delaney Barry-Delaney requested a review from lfr-0531 May 19, 2026 04:54
@Barry-Delaney Barry-Delaney requested review from heyuhhh and jiaganc May 19, 2026 04:54
@Barry-Delaney Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from 3337a3f to 1c75018 Compare May 19, 2026 09:43
@Barry-Delaney Barry-Delaney changed the title [TRTLLM-12732][fix] Fence V2 KV cache block-offset H2D copy on KV manager stream [TRTLLM-12732][fix] Fence V2 _batched_migrate behind execution_stream May 19, 2026
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49187 [ run ] triggered by Bot. Commit: 1c75018 Link to invocation

…ion_stream

Re-applies PR13488 (cherry-picked as c4fbc24, reverted by 8355a6f
without rationale) on top of feat/deepseek_v4.

Root cause
----------
`StorageManager._batched_migrate` runs cross-tier copy on a transient
non-blocking stream from `TemporaryCudaStream`'s pool (handle in the
~10^14 range), while forward kernels run on `execution_stream` (one
specific stream, handle in the ~10^8 range).  Empirically verified via
runtime stream-id dump: the two streams are distinct cudaStream_t
handles.

Without an explicit fence the copy stream may begin recycling a slot
before in-flight forward kernels on execution_stream have released it,
producing CUDA_ERROR_ILLEGAL_ADDRESS when the forward path reads or the
copy path writes the slot concurrently.

The pre-existing fences do not cover this race:
  * PR14128's `_prepare_inputs_event` fences host-staging buffer vs
    in-flight H2D copy enqueued during `_prepare_inputs`.
  * 044c13c's memoryview-sync fences host-visible base_page_indices
    vs slot ready_events.
  * Neither fences the cross-tier copy stream against forward kernels
    on execution_stream.

Fix
---
Wire `execution_stream` into `StorageManager` so `_batched_migrate` can
record a fresh event on it and add it to the copy stream's prior_events
wait set.  This is the V2 equivalent of V1's `syncWithBufferManager()`
two-way sync between the BufferManager stream and the offload/onboard
streams.

The companion +22 lines of interim memoryview-sync in `_KVCache.resize`
(introduced by 044c13c) become redundant once this fence is in place
and are removed -- exactly as PR13488 originally did.

Empirical verification
----------------------
With the same DEP8 / static_eplb / MEGAMOE_DEEPGEMM workload on B300:

  Without fix (latest feat tip f4d5e59):
    RANK 3 raises kv_cache_manager_v2._exceptions.CuError:
      CUresult.CUDA_ERROR_ILLEGAL_ADDRESS: 700
    mpirun aborts; pytest hangs.

  With fix:
    1 passed in 4:12, 0 IMA, gsm8k accuracy 95.30.

Waives removed
--------------
The two corresponding SKIP entries in tests/integration/test_lists/waives.txt
are removed so the fix is exercised in CI.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
@Barry-Delaney Barry-Delaney force-pushed the user/barry/fix_nvbugs_6183549_kv_block_offset_guard branch from 1c75018 to e142cba Compare May 20, 2026 05:25
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49344 [ run ] triggered by Bot. Commit: e142cba Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49187 [ run ] completed with state ABORTED. Commit: 1c75018

Link to invocation

Copy link
Copy Markdown
Collaborator

@jiaganc jiaganc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49344 [ run ] completed with state SUCCESS. Commit: e142cba
/LLM/main/L0_MergeRequest_PR pipeline #39001 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

peihu-nv added 2 commits May 20, 2026 13:04
…_block_offset_guard

Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>
Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>
@peihu-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49482 [ run ] triggered by Bot. Commit: 55eb3b5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49482 [ run ] completed with state SUCCESS. Commit: 55eb3b5
/LLM/main/L0_MergeRequest_PR pipeline #39124 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@peihu-nv peihu-nv merged commit 433ac3d into NVIDIA:feat/deepseek_v4 May 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants