Skip to content

[https://nvbugs/6094100][fix] set UCX_TLS for gb300 to enable disagg test#14168

Merged
chuangz0 merged 1 commit into
NVIDIA:mainfrom
chuangz0:ucx_env_on_aws_cmh_slurm_gb300
May 18, 2026
Merged

[https://nvbugs/6094100][fix] set UCX_TLS for gb300 to enable disagg test#14168
chuangz0 merged 1 commit into
NVIDIA:mainfrom
chuangz0:ucx_env_on_aws_cmh_slurm_gb300

Conversation

@chuangz0
Copy link
Copy Markdown
Collaborator

@chuangz0 chuangz0 commented May 15, 2026

Summary by CodeRabbit

  • Tests
    • Improved disaggregated inference test configuration to automatically adapt based on GPU SM version.
    • Enabled two previously waived disaggregated inference tests to run.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chuangz0 chuangz0 requested a review from a team as a code owner May 15, 2026 04:28
@chuangz0 chuangz0 force-pushed the ucx_env_on_aws_cmh_slurm_gb300 branch from 64a5757 to 925a6df Compare May 15, 2026 04:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08d50bbe-e654-4e1c-bb27-e48435cadb96

📥 Commits

Reviewing files that changed from the base of the PR and between a3e0f56 and ef6e6f7.

📒 Files selected for processing (2)
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

The PR centralizes UCX_TLS transport detection into an SM version-aware helper function, then applies it consistently through the disaggregated test runner and individual test cases, eliminating hardcoded per-test transport strings.

Changes

UCX Transport Centralization

Layer / File(s) Summary
SM-aware UCX transport detection
tests/integration/defs/disaggregated/test_disaggregated.py
get_ucx_tls() computes transport strings from SM version, with special handling for SM 103 and differentiated thresholds for SM < 90 vs newer GPUs.
Test runner environment setup
tests/integration/defs/disaggregated/test_disaggregated.py
run_disaggregated_test() creates run_env from the provided environment, injects UCX_TLS via get_ucx_tls(), and passes it to both cluster setup and client test execution.
Individual test UCX configuration
tests/integration/defs/disaggregated/test_disaggregated.py
Three test cases—test_disaggregated_benchmark_gen_only_insufficient_kv, test_disaggregated_logprobs_serving, test_disaggregated_mamba_conc_greater_than_mbs—switch from hardcoded transport values to dynamic setup via get_ucx_tls().

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#13594: Both PRs modify how UCX_TLS is set/propagated for disaggregated test runs—this PR dynamically computes/injects UCX_TLS via run_env, while the other clears UCX_TLS and adjusts related launch command environment handling.

Suggested reviewers

  • niukuo
  • bo-nv
  • dpitman-nvda
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete, containing only the template with empty sections (Description, Test Coverage) and an unfilled checklist; the author provided no actual implementation details or test coverage information. Fill in the Description section explaining the issue and solution, and the Test Coverage section listing relevant tests that safeguard these changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: setting UCX_TLS for gb300 to enable disaggregated tests, which matches the code modifications to handle dynamic UCX_TLS configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@chuangz0 chuangz0 force-pushed the ucx_env_on_aws_cmh_slurm_gb300 branch from 925a6df to ef6e6f7 Compare May 15, 2026 04:31
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48518 [ run ] triggered by Bot. Commit: ef6e6f7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48518 [ run ] completed with state FAILURE. Commit: ef6e6f7
/LLM/main/L0_MergeRequest_PR pipeline #38312 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
@chuangz0 chuangz0 force-pushed the ucx_env_on_aws_cmh_slurm_gb300 branch from ef6e6f7 to e4a32e3 Compare May 15, 2026 06:03
@chuangz0
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48541 [ run ] triggered by Bot. Commit: e4a32e3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48541 [ run ] completed with state SUCCESS. Commit: e4a32e3
/LLM/main/L0_MergeRequest_PR pipeline #38332 completed with status: 'SUCCESS'

CI Report

Link to invocation

@chuangz0 chuangz0 enabled auto-merge (squash) May 18, 2026 01:07
Copy link
Copy Markdown
Collaborator

@fredricz-20070104 fredricz-20070104 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@chuangz0 chuangz0 merged commit 1109e1c into NVIDIA:main May 18, 2026
8 checks passed
KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 19, 2026
…test (NVIDIA#14168)

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants