[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure) by venkywonka · Pull Request #14968 · NVIDIA/TensorRT-LLM

venkywonka · 2026-06-04T16:45:41Z

Summary

Waive test_mm_encoder_standalone.py::test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8], which fails fleet-wide on the pre-merge DGX_B200-PyTorch single-GPU stage.

NVBug: https://nvbugs/6269683

Failure

At test setup, constructing the pd_disagg LLM's disaggregated KV-cache transfer agent hits a hard NIXL assertion:

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: status == NIXL_SUCCESS
  (.../executor/cache_transmission/nixl_utils/transferAgent.cpp:614)
  AgentConnectionManager::AgentConnectionManager(...) -> CacheTransceiver::CacheTransceiver(...)
RuntimeError: Executor worker returned error

i.e. NIXL registerMem fails while pinning the transceiver's transfer buffers during executor-worker construction on a single GPU. The sibling no_pd_disagg / raw_inputs sub-cases pass (they never build a NIXL CacheTransceiver).

Exact failing builds (CI Report)

Not specific to any one PR — the same case fails identically across many unrelated pipelines (stage DGX_B200-PyTorch, single-GPU):

PR [TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) #13978, build 41402 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_MergeRequest_PR&build=41402
PR [None][fix] Fix AutoDeploy accuracy tests #13925, build 41352 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_MergeRequest_PR&build=41352
PR [None][fix] Stabilize Mamba replay state update #14841, build 41381 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_MergeRequest_PR&build=41381
PR [TRTLLM-12842][feat] Maximal LLMAPI capture in usage telemetry #14398, build 41375 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_MergeRequest_PR&build=41375
Post-merge (main), L0_PostMerge build 2759 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_PostMerge&build=2759
This PR ([TRTLLM-12467][feat] EPD improvements #13864), build 41351 — http://tensorrt-llm.tensorrt-llm-ci-report.sc2-paas.nvidia.com/?job=%2FLLM%2Fmain%2FL0_MergeRequest_PR&build=41351

Also observed on #14599 (41365), #14524 (41373), #14941 (41337), #14908 (41360), #14948 (41357), #12958 (41363); related Hang detected on rank 0 in PyExecutor on #14812/#14891. Passes only intermittently (~1/3, node-dependent) — consistent with single-GPU NIXL EPD-disagg not being reliably supported on the DGX_B200 single-GPU nodes (no IB/RoCE fabric and/or memory pressure from the multi-instance disagg fixture).

Scope

One-line waives.txt entry; no source/product changes.
Stopgap until the single-GPU NIXL EPD-disagg path is fixed, or the pd_disagg variant is gated to multi-GPU (tracked in https://nvbugs/6269683).

Summary by CodeRabbit

Tests
- Updated test configuration to skip a specific test case.

…gle-GPU test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8] in test_mm_encoder_standalone.py fails at setup on the pre-merge DGX_B200 single-GPU stage with a NIXL CacheTransceiver init assertion (status == NIXL_SUCCESS, transferAgent.cpp:614) when the pd_disagg LLM constructs its disaggregated KV-cache transfer agent. This is a fleet-wide failure on the single-GPU pre-merge stage, observed across many unrelated PRs (e.g. NVIDIA#13978, NVIDIA#13925, NVIDIA#14841, NVIDIA#14599, NVIDIA#14524, NVIDIA#14941, NVIDIA#14398); it passes only intermittently (~1/3) depending on node. Waiving until the single-GPU NIXL EPD-disagg path is fixed or the variant is gated to multi-GPU. NVBug: https://nvbugs/6269683 Signed-off-by: venkywonka <23023424+venkywonka@users.noreply.github.com>

venkywonka · 2026-06-04T18:57:35Z

/bot run

coderabbitai · 2026-06-04T19:00:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: efeb623d-b322-4a2c-a8de-fa463083cb1a

📥 Commits

Reviewing files that changed from the base of the PR and between 33b0a32 and 0e10683.

📒 Files selected for processing (1)

tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

This PR adds a single waiver entry to the test skip list for a failing multimodal encoder test case parameterized with pd_disagg-qwen3_30b_a3b_fp8, tracked by nvbug 6269683.

Changes

Multimodal encoder test waiver

Layer / File(s)	Summary
Test waiver entry for multimodal encoder `tests/integration/test_lists/waives.txt`	Added a SKIP waiver entry for the `test_single_request_chat_multiple_images[pd_disagg-qwen3_30b_a3b_fp8]` test case in the multimodal encoder unit tests, referencing nvbug 6269683.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14854: Updates tests/integration/test_lists/waives.txt by adding SKIP/waiver entries for failing tests using the same mechanism.
NVIDIA/TensorRT-LLM#14883: Modifies tests/integration/test_lists/waives.txt by adding new SKIP/waiver entries for failing tests, including waiving nvbug 6269683-related cases.
NVIDIA/TensorRT-LLM#14789: Modifies the same tests/integration/test_lists/waives.txt file by adding SKIP waiver entries for failing multimodal/openai-disagg/multigpu test cases.

Suggested reviewers

crazydemo
jieli-matrix
LarryXFly
StanleySun639
xinhe-nv

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: waiving a specific test case for qwen3_30b_a3b_fp8 with pd_disagg on single-GPU due to NIXL setup failure.
Description check	✅ Passed	The pull request description provides a comprehensive explanation of the issue, failure details, exact failing builds with CI links, and scope, exceeding the template requirements for clarity and justification.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-06-04T19:03:06Z

PR_Github #52139 [ run ] triggered by Bot. Commit: 0e10683 Link to invocation

tensorrt-cicd · 2026-06-04T19:52:55Z

PR_Github #52139 [ run ] completed with state FAILURE. Commit: 0e10683
/LLM/main/L0_MergeRequest_PR pipeline #41463 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

venkywonka · 2026-06-04T20:12:03Z

/bot run

tensorrt-cicd · 2026-06-04T20:18:06Z

PR_Github #52157 [ run ] triggered by Bot. Commit: 0e10683 Link to invocation

tensorrt-cicd · 2026-06-05T00:59:41Z

PR_Github #52157 [ run ] completed with state SUCCESS. Commit: 0e10683
/LLM/main/L0_MergeRequest_PR pipeline #41479 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned venkywonka Jun 4, 2026

venkywonka force-pushed the venky/waive-qwen3-30b-pd-disagg-nixl branch from 1f744f4 to 0e10683 Compare June 4, 2026 18:27

venkywonka marked this pull request as ready for review June 4, 2026 18:57

venkywonka requested a review from 2ez4bz June 4, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure)#14968

[None][test] Waive qwen3_30b_a3b_fp8 pd_disagg mm-encoder test on single-GPU (NIXL setup failure)#14968
venkywonka wants to merge 1 commit into
NVIDIA:mainfrom
venkywonka:venky/waive-qwen3-30b-pd-disagg-nixl

venkywonka commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

venkywonka commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

venkywonka commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

venkywonka commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Failure

Exact failing builds (CI Report)

Scope

Summary by CodeRabbit

Uh oh!

venkywonka commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

venkywonka commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

venkywonka commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading