Skip to content

[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure)#14968

Open
venkywonka wants to merge 1 commit into
NVIDIA:mainfrom
venkywonka:venky/waive-qwen3-30b-pd-disagg-nixl
Open

[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure)#14968
venkywonka wants to merge 1 commit into
NVIDIA:mainfrom
venkywonka:venky/waive-qwen3-30b-pd-disagg-nixl

Conversation

@venkywonka
Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka commented Jun 4, 2026

Summary

Waive test_mm_encoder_standalone.py::test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8], which fails fleet-wide on the pre-merge DGX_B200-PyTorch single-GPU stage.

NVBug: https://nvbugs/6269683

Failure

At test setup, constructing the pd_disagg LLM's disaggregated KV-cache transfer agent hits a hard NIXL assertion:

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: status == NIXL_SUCCESS
  (.../executor/cache_transmission/nixl_utils/transferAgent.cpp:614)
  AgentConnectionManager::AgentConnectionManager(...) -> CacheTransceiver::CacheTransceiver(...)
RuntimeError: Executor worker returned error

i.e. NIXL registerMem fails while pinning the transceiver's transfer buffers during executor-worker construction on a single GPU. The sibling no_pd_disagg / raw_inputs sub-cases pass (they never build a NIXL CacheTransceiver).

Exact failing builds (CI Report)

Not specific to any one PR — the same case fails identically across many unrelated pipelines (stage DGX_B200-PyTorch, single-GPU):

Also observed on #14599 (41365), #14524 (41373), #14941 (41337), #14908 (41360), #14948 (41357), #12958 (41363); related Hang detected on rank 0 in PyExecutor on #14812/#14891. Passes only intermittently (~1/3, node-dependent) — consistent with single-GPU NIXL EPD-disagg not being reliably supported on the DGX_B200 single-GPU nodes (no IB/RoCE fabric and/or memory pressure from the multi-instance disagg fixture).

Scope

  • One-line waives.txt entry; no source/product changes.
  • Stopgap until the single-GPU NIXL EPD-disagg path is fixed, or the pd_disagg variant is gated to multi-GPU (tracked in https://nvbugs/6269683).

Summary by CodeRabbit

  • Tests
    • Updated test configuration to skip a specific test case.

…gle-GPU

test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8] in test_mm_encoder_standalone.py fails at setup on the pre-merge DGX_B200 single-GPU stage with a NIXL CacheTransceiver init assertion (status == NIXL_SUCCESS, transferAgent.cpp:614) when the pd_disagg LLM constructs its disaggregated KV-cache transfer agent.

This is a fleet-wide failure on the single-GPU pre-merge stage, observed across many unrelated PRs (e.g. NVIDIA#13978, NVIDIA#13925, NVIDIA#14841, NVIDIA#14599, NVIDIA#14524, NVIDIA#14941, NVIDIA#14398); it passes only intermittently (~1/3) depending on node. Waiving until the single-GPU NIXL EPD-disagg path is fixed or the variant is gated to multi-GPU.

NVBug: https://nvbugs/6269683
Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>
@venkywonka venkywonka force-pushed the venky/waive-qwen3-30b-pd-disagg-nixl branch from 1f744f4 to 0e10683 Compare June 4, 2026 18:27
@venkywonka venkywonka marked this pull request as ready for review June 4, 2026 18:57
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: efeb623d-b322-4a2c-a8de-fa463083cb1a

📥 Commits

Reviewing files that changed from the base of the PR and between 33b0a32 and 0e10683.

📒 Files selected for processing (1)
  • tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

This PR adds a single waiver entry to the test skip list for a failing multimodal encoder test case parameterized with pd_disagg-qwen3_30b_a3b_fp8, tracked by nvbug 6269683.

Changes

Multimodal encoder test waiver

Layer / File(s) Summary
Test waiver entry for multimodal encoder
tests/integration/test_lists/waives.txt
Added a SKIP waiver entry for the test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8] test case in the multimodal encoder unit tests, referencing nvbug 6269683.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14854: Updates tests/integration/test_lists/waives.txt by adding SKIP/waiver entries for failing tests using the same mechanism.
  • NVIDIA/TensorRT-LLM#14883: Modifies tests/integration/test_lists/waives.txt by adding new SKIP/waiver entries for failing tests, including waiving nvbug 6269683-related cases.
  • NVIDIA/TensorRT-LLM#14789: Modifies the same tests/integration/test_lists/waives.txt file by adding SKIP waiver entries for failing multimodal/openai-disagg/multigpu test cases.

Suggested reviewers

  • crazydemo
  • jieli-matrix
  • LarryXFly
  • StanleySun639
  • xinhe-nv
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically describes the main change: waiving a specific test case for qwen3_30b_a3b_fp8 with pd_disagg on single-GPU due to NIXL setup failure.
Description check ✅ Passed The pull request description provides a comprehensive explanation of the issue, failure details, exact failing builds with CI links, and scope, exceeding the template requirements for clarity and justification.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52139 [ run ] triggered by Bot. Commit: 0e10683 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52139 [ run ] completed with state FAILURE. Commit: 0e10683
/LLM/main/L0_MergeRequest_PR pipeline #41463 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@venkywonka venkywonka requested a review from 2ez4bz June 4, 2026 20:11
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52157 [ run ] triggered by Bot. Commit: 0e10683 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52157 [ run ] completed with state SUCCESS. Commit: 0e10683
/LLM/main/L0_MergeRequest_PR pipeline #41479 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants