Skip to content

[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo…#12656

Merged
bo-nv merged 7 commits intoNVIDIA:mainfrom
bo-nv:main-6027560
Apr 5, 2026
Merged

[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo…#12656
bo-nv merged 7 commits intoNVIDIA:mainfrom
bo-nv:main-6027560

Conversation

@bo-nv
Copy link
Copy Markdown
Collaborator

@bo-nv bo-nv commented Apr 1, 2026

…rch-1

Summary by CodeRabbit

Release Notes

  • Tests
    • Updated test infrastructure with improved configuration for disaggregated serving scenarios.
    • Removed test waivers for multiple disaggregated serving test configurations, indicating these tests are now expected to pass reliably.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@bo-nv bo-nv self-assigned this Apr 1, 2026
@bo-nv bo-nv requested a review from a team as a code owner April 1, 2026 09:46
@bo-nv bo-nv requested review from Shixiaowei02 and chuangz0 April 1, 2026 09:47
@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Added UCX TLS configuration (UCX_TLS = "^ib") to the per-worker environment for both context and generation servers in disaggregated serving launches. Removed test skip/waive entries for multiple disaggregated serving test cases, suggesting these tests should now pass.

Changes

Cohort / File(s) Summary
Disaggregated Serving Environment Configuration
tests/integration/defs/accuracy/test_disaggregated_serving.py
Added env["UCX_TLS"] = "^ib" configuration to per-worker environment for context servers and generation servers in launch_disaggregated_llm.
Test Waive List Updates
tests/integration/test_lists/waives.txt
Removed 7 SKIP waive entries for disaggregated serving tests, including entries for TestDeepSeekV32Exp, TestDeepSeekV3Lite, TestNemotron3Super120B, and TestQwen3_8B test cases.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only template boilerplate with empty sections (Description and Test Coverage); no actual explanation of the changes, problem statement, or testing approach is provided. Complete the Description section to explain the hang issue, why UCX_TLS configuration fixes it, and list the test cases that validate the fix in the Test Coverage section.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title references a specific issue (NVBugs #6027560) and indicates a fix for hang issues on DGX_B200-8 GPUs, which aligns with the changeset's purpose of modifying UCX configuration and removing test waivers to resolve hanging issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch main-6027560

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

270-283: ⚠️ Potential issue | 🟠 Major

Preserve the ^ib workaround instead of overwriting it later.

Lines 12/45 set UCX_TLS="^ib" (B200 hang mitigation), but lines 24/56 unconditionally overwrite it to UCX_TLS="^cuda_ipc" when has_nvlink() returns false, entirely dropping the hang mitigation. Use valid UCX syntax ^ib,cuda_ipc to exclude both transports in a single assignment.

Proposed fix
     for i, port in enumerate(ctx_ports):
         env = base_env.copy()
         env["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        # Need to set UCX_TLS to ^ib to avoid hangs on CI B200 cluster.
-        env["UCX_TLS"] = "^ib"
+        # Exclude ib to avoid hangs on CI B200 cluster; also exclude cuda_ipc on non-NVLink systems.
+        ucx_excludes = ["ib"]
+        if not has_nvlink():
+            ucx_excludes.append("cuda_ipc")
+        env["UCX_TLS"] = "^" + ",".join(ucx_excludes)
         if enable_perf:
             env["TRTLLM_KVCACHE_TIME_OUTPUT_PATH"] = kv_cache_perf_dir
@@ -22,8 +22,6 @@
         gpu_range = range(current_gpu_offset,
                           current_gpu_offset + ctx_total_gpus)
         env["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpu_range))
-        if not has_nvlink():
-            env["UCX_TLS"] = "^cuda_ipc"
         current_gpu_offset += ctx_total_gpus
 
     for i, port in enumerate(gen_ports):
         env = base_env.copy()
         env["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        # Need to set UCX_TLS to ^ib to avoid hangs on CI B200 cluster.
-        env["UCX_TLS"] = "^ib"
+        # Exclude ib to avoid hangs on CI B200 cluster; also exclude cuda_ipc on non-NVLink systems.
+        ucx_excludes = ["ib"]
+        if not has_nvlink():
+            ucx_excludes.append("cuda_ipc")
+        env["UCX_TLS"] = "^" + ",".join(ucx_excludes)
         if enable_perf:
             env["TRTLLM_KVCACHE_TIME_OUTPUT_PATH"] = kv_cache_perf_dir
@@ -52,8 +52,6 @@
         gpu_range = range(current_gpu_offset,
                           current_gpu_offset + gen_total_gpus)
         env["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpu_range))
-        if not has_nvlink():
-            env["UCX_TLS"] = "^cuda_ipc"
         current_gpu_offset += gen_total_gpus
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_disaggregated_serving.py` around lines
270 - 283, The current logic sets env["UCX_TLS"] = "^ib" then later overwrites
it to "^cuda_ipc" when has_nvlink() is false, losing the "^ib" workaround;
change the update inside the has_nvlink() check to preserve the ib exclusion by
assigning env["UCX_TLS"] = "^ib,cuda_ipc" (or append ",cuda_ipc" to the existing
env["UCX_TLS"] if present) so both transports are excluded; modify the code
around the existing env assignment and the has_nvlink() branch (referencing env
and has_nvlink()) to perform this combined exclusion instead of an unconditional
overwrite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/integration/defs/accuracy/test_disaggregated_serving.py`:
- Around line 270-283: The current logic sets env["UCX_TLS"] = "^ib" then later
overwrites it to "^cuda_ipc" when has_nvlink() is false, losing the "^ib"
workaround; change the update inside the has_nvlink() check to preserve the ib
exclusion by assigning env["UCX_TLS"] = "^ib,cuda_ipc" (or append ",cuda_ipc" to
the existing env["UCX_TLS"] if present) so both transports are excluded; modify
the code around the existing env assignment and the has_nvlink() branch
(referencing env and has_nvlink()) to perform this combined exclusion instead of
an unconditional overwrite.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfdbdea3-8a31-4614-a76c-151681570382

📥 Commits

Reviewing files that changed from the base of the PR and between 075e36a and 940132d.

📒 Files selected for processing (2)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41176 [ run ] triggered by Bot. Commit: 9b8e0d6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41178 [ run ] triggered by Bot. Commit: 3be3f4c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41176 [ run ] completed with state ABORTED. Commit: 9b8e0d6

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41180 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41178 [ run ] completed with state ABORTED. Commit: 3be3f4c

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41180 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32143 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41223 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41223 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41249 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41249 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32207 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 1, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41273 [ run ] triggered by Bot. Commit: 2964699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41273 [ run ] completed with state SUCCESS. Commit: 2964699
/LLM/main/L0_MergeRequest_PR pipeline #32231 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bo-nv added 2 commits April 2, 2026 01:51
…rch-1

Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41427 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32360 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41538 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41538 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32454 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41604 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41604 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32514 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41632 [ run ] triggered by Bot. Commit: 01a48f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41632 [ run ] completed with state SUCCESS. Commit: 01a48f9
/LLM/main/L0_MergeRequest_PR pipeline #32540 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41652 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

1 similar comment
@brb-nv
Copy link
Copy Markdown
Collaborator

brb-nv commented Apr 3, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41698 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41698 [ run ] completed with state SUCCESS. Commit: 48a0ad4
/LLM/main/L0_MergeRequest_PR pipeline #32601 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 4, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41792 [ run ] triggered by Bot. Commit: 48a0ad4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41792 [ run ] completed with state SUCCESS. Commit: 48a0ad4
/LLM/main/L0_MergeRequest_PR pipeline #32686 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 4, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41831 [ run ] triggered by Bot. Commit: 02ea75e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41831 [ run ] completed with state SUCCESS. Commit: 02ea75e
/LLM/main/L0_MergeRequest_PR pipeline #32704 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bo-nv
Copy link
Copy Markdown
Collaborator Author

bo-nv commented Apr 5, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41856 [ run ] triggered by Bot. Commit: 02ea75e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41856 [ run ] completed with state SUCCESS. Commit: 02ea75e
/LLM/main/L0_MergeRequest_PR pipeline #32725 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bo-nv bo-nv merged commit 1d31029 into NVIDIA:main Apr 5, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants