[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… by bo-nv · Pull Request #12656 · NVIDIA/TensorRT-LLM

bo-nv · 2026-04-01T09:46:56Z

…rch-1

Summary by CodeRabbit

Release Notes

Tests
- Updated test infrastructure with improved configuration for disaggregated serving scenarios.
- Removed test waivers for multiple disaggregated serving test configurations, indicating these tests are now expected to pass reliably.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

bo-nv · 2026-04-01T09:49:36Z

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

coderabbitai · 2026-04-01T09:52:12Z

📝 Walkthrough

Walkthrough

Added UCX TLS configuration (UCX_TLS = "^ib") to the per-worker environment for both context and generation servers in disaggregated serving launches. Removed test skip/waive entries for multiple disaggregated serving test cases, suggesting these tests should now pass.

Changes

Cohort / File(s)	Summary
Disaggregated Serving Environment Configuration `tests/integration/defs/accuracy/test_disaggregated_serving.py`	Added `env["UCX_TLS"] = "^ib"` configuration to per-worker environment for context servers and generation servers in `launch_disaggregated_llm`.
Test Waive List Updates `tests/integration/test_lists/waives.txt`	Removed 7 SKIP waive entries for disaggregated serving tests, including entries for TestDeepSeekV32Exp, TestDeepSeekV3Lite, TestNemotron3Super120B, and TestQwen3_8B test cases.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only template boilerplate with empty sections (Description and Test Coverage); no actual explanation of the changes, problem statement, or testing approach is provided.	Complete the Description section to explain the hang issue, why UCX_TLS configuration fixes it, and list the test cases that validate the fix in the Test Coverage section.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title references a specific issue (NVBugs `#6027560`) and indicates a fix for hang issues on DGX_B200-8 GPUs, which aligns with the changeset's purpose of modifying UCX configuration and removing test waivers to resolve hanging issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch main-6027560

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

270-283: ⚠️ Potential issue | 🟠 Major

Preserve the ^ib workaround instead of overwriting it later.

Lines 12/45 set UCX_TLS="^ib" (B200 hang mitigation), but lines 24/56 unconditionally overwrite it to UCX_TLS="^cuda_ipc" when has_nvlink() returns false, entirely dropping the hang mitigation. Use valid UCX syntax ^ib,cuda_ipc to exclude both transports in a single assignment.

Proposed fix

     for i, port in enumerate(ctx_ports):
         env = base_env.copy()
         env["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        # Need to set UCX_TLS to ^ib to avoid hangs on CI B200 cluster.
-        env["UCX_TLS"] = "^ib"
+        # Exclude ib to avoid hangs on CI B200 cluster; also exclude cuda_ipc on non-NVLink systems.
+        ucx_excludes = ["ib"]
+        if not has_nvlink():
+            ucx_excludes.append("cuda_ipc")
+        env["UCX_TLS"] = "^" + ",".join(ucx_excludes)
         if enable_perf:
             env["TRTLLM_KVCACHE_TIME_OUTPUT_PATH"] = kv_cache_perf_dir
@@ -22,8 +22,6 @@
         gpu_range = range(current_gpu_offset,
                           current_gpu_offset + ctx_total_gpus)
         env["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpu_range))
-        if not has_nvlink():
-            env["UCX_TLS"] = "^cuda_ipc"
         current_gpu_offset += ctx_total_gpus
 
     for i, port in enumerate(gen_ports):
         env = base_env.copy()
         env["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        # Need to set UCX_TLS to ^ib to avoid hangs on CI B200 cluster.
-        env["UCX_TLS"] = "^ib"
+        # Exclude ib to avoid hangs on CI B200 cluster; also exclude cuda_ipc on non-NVLink systems.
+        ucx_excludes = ["ib"]
+        if not has_nvlink():
+            ucx_excludes.append("cuda_ipc")
+        env["UCX_TLS"] = "^" + ",".join(ucx_excludes)
         if enable_perf:
             env["TRTLLM_KVCACHE_TIME_OUTPUT_PATH"] = kv_cache_perf_dir
@@ -52,8 +52,6 @@
         gpu_range = range(current_gpu_offset,
                           current_gpu_offset + gen_total_gpus)
         env["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpu_range))
-        if not has_nvlink():
-            env["UCX_TLS"] = "^cuda_ipc"
         current_gpu_offset += gen_total_gpus

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_disaggregated_serving.py` around lines
270 - 283, The current logic sets env["UCX_TLS"] = "^ib" then later overwrites
it to "^cuda_ipc" when has_nvlink() is false, losing the "^ib" workaround;
change the update inside the has_nvlink() check to preserve the ib exclusion by
assigning env["UCX_TLS"] = "^ib,cuda_ipc" (or append ",cuda_ipc" to the existing
env["UCX_TLS"] if present) so both transports are excluded; modify the code
around the existing env assignment and the has_nvlink() branch (referencing env
and has_nvlink()) to perform this combined exclusion instead of an unconditional
overwrite.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/integration/defs/accuracy/test_disaggregated_serving.py`:
- Around line 270-283: The current logic sets env["UCX_TLS"] = "^ib" then later
overwrites it to "^cuda_ipc" when has_nvlink() is false, losing the "^ib"
workaround; change the update inside the has_nvlink() check to preserve the ib
exclusion by assigning env["UCX_TLS"] = "^ib,cuda_ipc" (or append ",cuda_ipc" to
the existing env["UCX_TLS"] if present) so both transports are excluded; modify
the code around the existing env assignment and the has_nvlink() branch
(referencing env and has_nvlink()) to perform this combined exclusion instead of
an unconditional overwrite.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfdbdea3-8a31-4614-a76c-151681570382

📥 Commits

Reviewing files that changed from the base of the PR and between 075e36a and 940132d.

📒 Files selected for processing (2)

tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

bo-nv · 2026-04-01T09:55:54Z

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

tensorrt-cicd · 2026-04-01T09:55:56Z

PR_Github #41176 [ run ] triggered by Bot. Commit: 9b8e0d6 Link to invocation

tensorrt-cicd · 2026-04-01T10:01:45Z

PR_Github #41178 [ run ] triggered by Bot. Commit: 3be3f4c Link to invocation

tensorrt-cicd · 2026-04-01T10:01:48Z

PR_Github #41176 [ run ] completed with state ABORTED. Commit: 9b8e0d6

Link to invocation

bo-nv · 2026-04-01T10:02:01Z

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

tensorrt-cicd · 2026-04-01T10:07:58Z

PR_Github #41180 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

tensorrt-cicd · 2026-04-01T10:08:00Z

PR_Github #41178 [ run ] completed with state ABORTED. Commit: 3be3f4c

Link to invocation

tensorrt-cicd · 2026-04-01T15:05:06Z

PR_Github #41180 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32143 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

bo-nv · 2026-04-01T15:52:59Z

/bot run

tensorrt-cicd · 2026-04-01T15:58:29Z

PR_Github #41223 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

tensorrt-cicd · 2026-04-01T19:07:24Z

PR_Github #41223 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-01T19:53:44Z

/bot run

tensorrt-cicd · 2026-04-01T19:59:31Z

PR_Github #41249 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

tensorrt-cicd · 2026-04-01T21:56:05Z

PR_Github #41249 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32207 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-01T22:54:19Z

/bot run

tensorrt-cicd · 2026-04-01T23:00:27Z

PR_Github #41273 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

tensorrt-cicd · 2026-04-02T00:59:24Z

PR_Github #41273 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32231 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…rch-1 Signed-off-by: Bo Deng <deemod@nvidia.com>

Signed-off-by: Bo Deng <deemod@nvidia.com>

tensorrt-cicd · 2026-04-02T18:25:42Z

PR_Github #41427 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-03T01:34:38Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T01:41:15Z

PR_Github #41538 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

tensorrt-cicd · 2026-04-03T04:04:22Z

PR_Github #41538 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32454 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-03T06:08:56Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T06:14:19Z

PR_Github #41604 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

tensorrt-cicd · 2026-04-03T06:56:28Z

PR_Github #41604 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32514 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-03T08:11:39Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T08:18:06Z

PR_Github #41632 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

tensorrt-cicd · 2026-04-03T08:50:52Z

PR_Github #41632 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32540 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-03T10:32:55Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T10:38:25Z

PR_Github #41652 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

bo-nv · 2026-04-03T13:42:04Z

/bot run --add-multi-gpu-test --disable-fail-fast

brb-nv · 2026-04-03T15:49:54Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-03T15:55:26Z

PR_Github #41698 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

tensorrt-cicd · 2026-04-03T23:33:59Z

PR_Github #41698 [ run ] completed with state SUCCESS. Commit: 48a0ad4
/LLM/main/L0_MergeRequest_PR pipeline #32601 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-04T02:55:32Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-04T03:01:07Z

PR_Github #41792 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

tensorrt-cicd · 2026-04-04T07:09:05Z

PR_Github #41792 [ run ] completed with state SUCCESS. Commit: 48a0ad4
/LLM/main/L0_MergeRequest_PR pipeline #32686 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-04T15:07:53Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-04T15:14:12Z

PR_Github #41831 [ run ] triggered by Bot. Commit: 02ea75e Link to invocation

tensorrt-cicd · 2026-04-05T02:14:49Z

PR_Github #41831 [ run ] completed with state SUCCESS. Commit: 02ea75e
/LLM/main/L0_MergeRequest_PR pipeline #32704 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv · 2026-04-05T06:28:59Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-05T06:34:56Z

PR_Github #41856 [ run ] triggered by Bot. Commit: 02ea75e Link to invocation

tensorrt-cicd · 2026-04-05T09:21:10Z

PR_Github #41856 [ run ] completed with state SUCCESS. Commit: 02ea75e
/LLM/main/L0_MergeRequest_PR pipeline #32725 completed with status: 'SUCCESS'

CI Report

Link to invocation

NVIDIA#12656) Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv self-assigned this Apr 1, 2026

bo-nv requested a review from a team as a code owner April 1, 2026 09:46

bo-nv requested review from Shixiaowei02 and chuangz0 April 1, 2026 09:47

bo-nv force-pushed the main-6027560 branch from 940132d to 047fcd0 Compare April 1, 2026 09:49

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

bo-nv force-pushed the main-6027560 branch from 047fcd0 to 9b8e0d6 Compare April 1, 2026 09:55

bo-nv force-pushed the main-6027560 branch from 9b8e0d6 to 3be3f4c Compare April 1, 2026 09:56

Shixiaowei02 approved these changes Apr 1, 2026

View reviewed changes

bo-nv added 2 commits April 2, 2026 01:51

[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo…

1b5c7cb

…rch-1 Signed-off-by: Bo Deng <deemod@nvidia.com>

unwaive tests

668ec2a

Signed-off-by: Bo Deng <deemod@nvidia.com>

Merge branch 'main' into main-6027560

48a0ad4

Merge branch 'main' into main-6027560

02ea75e

bo-nv merged commit 1d31029 into NVIDIA:main Apr 5, 2026
5 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… (

1eca31a

NVIDIA#12656) Signed-off-by: Bo Deng <deemod@nvidia.com>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… (

be967de

NVIDIA#12656) Signed-off-by: Bo Deng <deemod@nvidia.com>

Conversation

bo-nv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

bo-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

bo-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

bo-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

bo-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

bo-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

bo-nv commented Apr 3, 2026

bo-nv commented Apr 1, 2026 •

edited

Loading