[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure)#14968
Conversation
…gle-GPU test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8] in test_mm_encoder_standalone.py fails at setup on the pre-merge DGX_B200 single-GPU stage with a NIXL CacheTransceiver init assertion (status == NIXL_SUCCESS, transferAgent.cpp:614) when the pd_disagg LLM constructs its disaggregated KV-cache transfer agent. This is a fleet-wide failure on the single-GPU pre-merge stage, observed across many unrelated PRs (e.g. NVIDIA#13978, NVIDIA#13925, NVIDIA#14841, NVIDIA#14599, NVIDIA#14524, NVIDIA#14941, NVIDIA#14398); it passes only intermittently (~1/3) depending on node. Waiving until the single-GPU NIXL EPD-disagg path is fixed or the variant is gated to multi-GPU. NVBug: https://nvbugs/6269683 Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
1f744f4 to
0e10683
Compare
|
/bot run |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a single waiver entry to the test skip list for a failing multimodal encoder test case parameterized with ChangesMultimodal encoder test waiver
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #52139 [ run ] triggered by Bot. Commit: |
|
PR_Github #52139 [ run ] completed with state
|
|
/bot run |
|
PR_Github #52157 [ run ] triggered by Bot. Commit: |
|
PR_Github #52157 [ run ] completed with state
|
Summary
Waive
test_mm_encoder_standalone.py::test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8], which fails fleet-wide on the pre-mergeDGX_B200-PyTorchsingle-GPU stage.NVBug: https://nvbugs/6269683
Failure
At test setup, constructing the
pd_disaggLLM's disaggregated KV-cache transfer agent hits a hard NIXL assertion:i.e. NIXL
registerMemfails while pinning the transceiver's transfer buffers during executor-worker construction on a single GPU. The siblingno_pd_disagg/raw_inputssub-cases pass (they never build a NIXLCacheTransceiver).Exact failing builds (CI Report)
Not specific to any one PR — the same case fails identically across many unrelated pipelines (stage
DGX_B200-PyTorch, single-GPU):L0_PostMergebuild 2759 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_PostMerge&build=2759Also observed on #14599 (41365), #14524 (41373), #14941 (41337), #14908 (41360), #14948 (41357), #12958 (41363); related
Hang detected on rank 0 in PyExecutoron #14812/#14891. Passes only intermittently (~1/3, node-dependent) — consistent with single-GPU NIXL EPD-disagg not being reliably supported on theDGX_B200single-GPU nodes (no IB/RoCE fabric and/or memory pressure from the multi-instance disagg fixture).Scope
waives.txtentry; no source/product changes.pd_disaggvariant is gated to multi-GPU (tracked in https://nvbugs/6269683).Summary by CodeRabbit