[https://nvbugs/6095421][fix] fix PP>=3 executor shutdown hang in broadcast sample state loop#13267
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #44686 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe changes enhance request completion tracking in the PyExecutor by introducing per-request synchronization primitives and methods, while refactoring KV cache capacity configuration to synchronize across distributed ranks. MPI communicator handling for the broadcast loop is also optimized to pre-duplicate and reuse the communicator instead of creating/freeing it per iteration. Changes
Sequence Diagram(s)sequenceDiagram
participant Root as Root Rank (0)
participant Other as Other Ranks
participant PyExec as PyExecutor
participant CUDA as CUDA Memory
Root->>PyExec: enqueue_warmup_requests(req_ids)
PyExec->>PyExec: distribute to workers
Root->>PyExec: track_request_completions(req_ids)
PyExec->>PyExec: store in tracked_request_completion_ids
Root->>PyExec: start_worker()
par
Root->>PyExec: await_responses(req_ids)<br/>(rank==0 only)
Other->>PyExec: process locally
end
PyExec->>PyExec: _do_terminate_request()<br/>(all ranks)
PyExec->>PyExec: notify_all() on<br/>request_completion_cv
Root->>PyExec: await_request_completions(req_ids)<br/>(barrier across ranks)
Other->>PyExec: await_request_completions(req_ids)<br/>(barrier across ranks)
par
Root->>CUDA: read memory statistics
Other->>CUDA: ready
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 433-446: The current change removes rank-local error propagation:
instead of only calling py_executor.await_responses(req_ids) and checking
ExecutorResponse.has_error()/error_msg on rank 0, restore per-rank error
checking (call py_executor.await_responses(req_ids) and inspect each
ExecutorResponse.has_error() on every rank) before reading memory stats, or
implement an explicit distributed error reduction (use py_executor.dist.rank and
a collective to fail-fast) that aggregates any per-rank error flags/messages and
raises a RuntimeError with the combined message; ensure checks reference
py_executor.await_responses, ExecutorResponse.has_error, and response.error_msg
and run on non-zero ranks prior to
py_executor.await_request_completions(req_ids).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 32f5b885-213e-4e83-8e0c-210281e417cb
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
|
PR_Github #44686 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45209 [ run ] triggered by Bot. Commit: |
|
PR_Github #45209 [ run ] completed with state
|
63ecd5e to
999e25a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45858 [ run ] triggered by Bot. Commit: |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot kill |
|
PR_Github #45946 [ kill ] triggered by Bot. Commit: |
|
PR_Github #45946 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #46153 [ run ] triggered by Bot. Commit: |
|
PR_Github #46153 [ run ] completed with state
|
999e25a to
feb9fe0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46270 [ run ] triggered by Bot. Commit: |
|
PR_Github #46270 [ run ] completed with state
|
feb9fe0 to
9980b92
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46408 [ run ] triggered by Bot. Commit: |
|
PR_Github #46408 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46547 [ run ] triggered by Bot. Commit: |
|
PR_Github #46547 [ run ] completed with state
|
9980b92 to
1a5b4eb
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46944 [ run ] triggered by Bot. Commit: |
|
PR_Github #46944 [ run ] completed with state
|
…adcast sample state loop Signed-off-by: Yihan Wang <yihwang@nvidia.com>
1a5b4eb to
0049242
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47291 [ run ] triggered by Bot. Commit: |
|
PR_Github #47291 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47398 [ run ] triggered by Bot. Commit: |
|
PR_Github #47398 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47575 [ run ] triggered by Bot. Commit: |
|
PR_Github #47575 [ run ] completed with state |
…adcast sample state loop (NVIDIA#13267) Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Description
Background
TestLlama3_1_8BInstruct::test_bfloat16_4gpus[pp4-attn_backend=FLASHINFER-torch_compile=False]and other PP4 cases deadlock during KV cache size estimation, right after the single dummy request is processed. The hang consistently reproduces withoutlogger.infocalls on the wait path and disappears when any I/O is injected — a classic timing-sensitive MPI/UCX deadlock. All cases were previously waived undernvbug/6095421.Root cause is three independent PP event-loop / MPI-transport issues that compound at shutdown. Fixing any one of them reduces but does not eliminate the hang; all three must be fixed.
Root cause & fix
1.
MPI_Comm_dup()race at startupThe
broadcast_sample_state_handlerthread called the collectiveMPI_Comm_dupon itself, concurrently with the event-loop thread's point-to-point traffic on the original communicator. Thread-scheduling jitter could let the event loop'sisend/recv_objectgrab the pkl5 intracomm send/recv locks beforeDup()returned, producing an intermittent startup deadlock.Fix: Move
Dup()to the main thread insidestart_worker(), before either the worker or broadcast thread starts. All ranks participate in the collective while no rank is yet issuing point-to-point calls.2. Freeing the duplicated comm while peers still had pending pkl5 sub-requests
The PP ring is asymmetric:
3 → 0 → 1 → 2 is_second_last_pp_rank (rank 2): only recv, no send is_last_pp_rank (rank 3): only send, no recv
Rank 2's broadcast thread therefore finishes its end-of-loop flush as a no-op, receives the shutdown sentinel, enters
finally, and callsbroadcast_mpi_comm.Free()almost immediately. pkl5 large objects are delivered as multiple underlyingMPI_Requestobjects (header + data chunks); once the receiving side has torn down its handle to the communicator,MPI_Teston the peer ranks' subsidiary requests stops making progress, leavingwait_on_pp_send_handlesspinning forever.Fix: Do not
Free()the duplicated communicator in the broadcast thread'sfinally. Let it live until process teardown;MPI_Finalizereclaims it. Cost is at most a couple of leaked comms per process lifetime (estimation + real executor).3. UCX worker progress starvation at shutdown
Even after (1) and (2), the remaining rank's broadcast thread was stuck in
MPI_Waitall(count=2) → opal_progress → ucp_worker_progress → pthread_spin_lock (busy-spin, wchan=0)
while peer ranks had moved on to
MPI_Allreduceon the main communicator inside the next executor'scalculate_max_num_blocks. UCX rendezvous sends need two-sided progress to finalize, but no peer was polling the broadcast communicator anymore; the stuck rank could not push the handshake through on its own.Fix: Remove the end-of-iteration
wait_on_pp_send_handlesin_broadcast_sample_state_loop. Correctness is preserved because the drain at the top of the next_ring_broadcast_sample_statealready waits on the previous isend for the samemicrobatch_idbefore posting a new one. During shutdown the shared UCX worker drives broadcast-comm progress implicitly via subsequent MPI activity on the main communicator; remaining requests are reclaimed atMPI_Finalize.Side refactor (not load-bearing for the hang fix)
KV cache estimation previously had every rank wait on
await_responses. Only rank 0 actually needs the response payload (to surfaceresponse.has_error()). Estimation now has:await_responses(req_ids)and error-checkstrack_request_completions/await_request_completions)This makes the estimation lifecycle easier to reason about and removes the non-rank-0 dependency on response delivery, which is unnecessary for estimation.
Files changed
tensorrt_llm/_torch/pyexecutor/py_executor.pystart_worker(): moveMPI_Comm_dup()to main thread_broadcast_sample_state_loop(): use pre-duplicated comm, drop end-of-loop flush, dropFree()_do_terminate_request(): notify tracked req-id completiontrack_request_completions,await_request_completionstensorrt_llm/_torch/pyexecutor/_util.pyconfigure_kv_cache_capacity(): rank-0 awaits response, all ranks await local terminationtests/integration/test_lists/waives.txtVerification
Locally on 4× B300 (inside
tensorrt_llm-develcontainer), each case previously waived undernvbug/6095421withpp4in its id:TestLlama3_1_8BInstruct::test_bfloat16_4gpus[pp4-attn_backend=FLASHINFER-torch_compile=False](82 s)TestLlama3_1_8BInstruct::test_fp8_4gpus[pp4-fp8kv=True-attn_backend=TRTLLM-torch_compile=False]TestDeepSeekV3Lite::test_bfloat16_4gpus[pp4-mtp_nextn=0-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=0-pp4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-low_precision_combine=False-torch_compile=True]TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=0-pp4-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-low_precision_combine=False-torch_compile=True]TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[pp4-mtp_nextn=0-fp8kv=False-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False-sampler_async_worker=False]TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=2-pp4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-low_precision_combine=False-torch_compile=False]— no longer hangs; now fails with an unrelated MoE NVFP4 autotune assertion atmoe_gemm_template_dispatch_tma_ws.h:108("No Smem epilogue schedule is not supported for block scaled types or finalize fusion"). Tracked separately.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests