[TRTLLM-12732][fix] Fence V2 _batched_migrate behind execution_stream#14245
Conversation
8bfa99b to
4be6c2b
Compare
|
/bot run --disable-fail-fast |
|
/bot kill |
|
PR_Github #48868 [ run ] triggered by Bot. Commit: |
|
PR_Github #48870 [ kill ] triggered by Bot. Commit: |
|
PR_Github #48868 [ run ] completed with state |
|
PR_Github #48870 [ kill ] completed with state |
ce00a01 to
d96585e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48882 [ run ] triggered by Bot. Commit: |
|
PR_Github #48882 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48931 [ run ] triggered by Bot. Commit: |
|
PR_Github #48931 [ run ] completed with state
|
d96585e to
3337a3f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48940 [ run ] triggered by Bot. Commit: |
|
PR_Github #48940 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49027 [ run ] triggered by Bot. Commit: |
|
PR_Github #49027 [ run ] completed with state |
3337a3f to
1c75018
Compare
_batched_migrate behind execution_stream
|
/bot run --disable-fail-fast |
|
PR_Github #49187 [ run ] triggered by Bot. Commit: |
…ion_stream Re-applies PR13488 (cherry-picked as c4fbc24, reverted by 8355a6f without rationale) on top of feat/deepseek_v4. Root cause ---------- `StorageManager._batched_migrate` runs cross-tier copy on a transient non-blocking stream from `TemporaryCudaStream`'s pool (handle in the ~10^14 range), while forward kernels run on `execution_stream` (one specific stream, handle in the ~10^8 range). Empirically verified via runtime stream-id dump: the two streams are distinct cudaStream_t handles. Without an explicit fence the copy stream may begin recycling a slot before in-flight forward kernels on execution_stream have released it, producing CUDA_ERROR_ILLEGAL_ADDRESS when the forward path reads or the copy path writes the slot concurrently. The pre-existing fences do not cover this race: * PR14128's `_prepare_inputs_event` fences host-staging buffer vs in-flight H2D copy enqueued during `_prepare_inputs`. * 044c13c's memoryview-sync fences host-visible base_page_indices vs slot ready_events. * Neither fences the cross-tier copy stream against forward kernels on execution_stream. Fix --- Wire `execution_stream` into `StorageManager` so `_batched_migrate` can record a fresh event on it and add it to the copy stream's prior_events wait set. This is the V2 equivalent of V1's `syncWithBufferManager()` two-way sync between the BufferManager stream and the offload/onboard streams. The companion +22 lines of interim memoryview-sync in `_KVCache.resize` (introduced by 044c13c) become redundant once this fence is in place and are removed -- exactly as PR13488 originally did. Empirical verification ---------------------- With the same DEP8 / static_eplb / MEGAMOE_DEEPGEMM workload on B300: Without fix (latest feat tip f4d5e59): RANK 3 raises kv_cache_manager_v2._exceptions.CuError: CUresult.CUDA_ERROR_ILLEGAL_ADDRESS: 700 mpirun aborts; pytest hangs. With fix: 1 passed in 4:12, 0 IMA, gsm8k accuracy 95.30. Waives removed -------------- The two corresponding SKIP entries in tests/integration/test_lists/waives.txt are removed so the fix is exercised in CI. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
1c75018 to
e142cba
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #49344 [ run ] triggered by Bot. Commit: |
|
PR_Github #49187 [ run ] completed with state |
|
PR_Github #49344 [ run ] completed with state
|
…_block_offset_guard Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>
Signed-off-by: peihengh <259410613+peihu-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #49482 [ run ] triggered by Bot. Commit: |
|
PR_Github #49482 [ run ] completed with state |
Summary
Re-applies PR13488 (
c4fbc246b2), reverted by8355a6f349without rationale, on top offeat/deepseek_v4. Fixes the DSv4-Flash + EPLB + MEGAMOE_DEEPGEMM IMA on 4xB300 (nvbug 6183549) and removes the two corresponding SKIP entries inwaives.txt.Root cause
StorageManager._batched_migratecopies on a transient non-blocking stream fromTemporaryCudaStream's pool; forward kernels run onexecution_stream. The two are distinctcudaStream_thandles (verified below). Without an explicit fence, the copy stream may recycle a slot before in-flight forward kernels release it →CUDA_ERROR_ILLEGAL_ADDRESS.The other fences already on the branch cover different races:
_prepare_inputs_event_prepare_inputs(records onexecution_stream)044c13c41ememoryview-sync in_KVCache.resizebase_page_indicesreaders vs slotready_events_batched_migratecopy stream vs in-flight forward onexecution_stream