[None][chore] Fix GB300 support issues #10196

fredricz-20070104 · 2025-12-22T09:19:28Z

Summary by CodeRabbit

New Features
- Added GB300 GPU support for SLURM configurations.
- Added new performance and accuracy test cases.
Bug Fixes
- Improved test execution reliability with enhanced error handling and logging backup mechanisms.
- Strengthened result verification and log preservation workflows.
Chores
- Migrated cache transceiver backend from UCX to NIXL for specific test configurations.
- Updated metric log file naming conventions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Fix GB300 support issues

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>

coderabbitai · 2025-12-22T09:24:04Z

📝 Walkthrough

Walkthrough

Refactored test infrastructure for disaggregated execution with improved error handling, updated configuration defaults, and enhanced SLURM management capabilities. Changes include method signature updates for job status checking, unified test completion handling, new SLURM configuration options, and metric log file naming updates, along with control flow adjustments to increase robustness.

Changes

Cohort / File(s)	Summary
Execution error handling & job management `tests/integration/defs/perf/disagg/execution/executor.py`	Refactored job status checking from `check_job_status() → str` to `check_job_exists() → bool`; enhanced `wait_for_completion()` to return `None` and use existence checks for completion determination; added try/except wrapping for result checking, directory cleanup, and log printing with pre-existence validation; improved error propagation into result structures.
Test workflow simplification `tests/integration/defs/perf/disagg/test_disagg.py`	Unified perf and accuracy test flows with centralized job submission and result handling; introduced debug mode job_id bypass; replaced separate completion/error handling with single `wait_for_completion()` call; added finally block for consistent log backup regardless of test outcome; removed per-case manual cancellation and result backup logic.
SLURM configuration & utilities `tests/integration/defs/perf/disagg/utils/common.py`	Added `get_slurm_extra_args()` method for per-GPU-type SLURM arguments; enabled GB300 support for SLURM segment flag; changed disaggregated context directory naming prefix from `ctx` to `disagg_ctx`.
Configuration defaults & overrides `tests/integration/defs/perf/disagg/utils/config_loader.py`	Updated metric log file names (`accuracy_eval.log` → `7_accuracy_eval.log`, `bench.log` → `6_bench.log`); added environment-driven override for SLURM extra arguments via `EnvManager.get_slurm_extra_args()`.
Backend configuration updates `tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml`, `tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml`, `tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_8k1k_ctx8_gen1_dep32_bs256_eplb416_mtp0_ccb-NIXL.yaml`	Changed `cache_transceiver_config.backend` from UCX to NIXL in worker configurations.
Test list updates `tests/integration/defs/perf/disagg/testlist/all.txt`, `tests/integration/defs/perf/disagg/testlist/wideep.txt`	Added new test cases for Qwen3, deepseek-r1, and kimi-k2 variants with NIXL backend; replaced UCX-based kimi-k2 perf entries with NIXL variants.
Benchmark submission refactoring `examples/disaggregated/slurm/benchmark/submit.py`	Replaced nested allocation loop with flat single-level iteration; updated command construction to reference allocation object fields directly; expanded shell quoting for concurrency lists, file paths, model paths, dataset files, and log directories for robust parsing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

executor.py requires careful review of new error handling paths, try/except blocks, and return type changes to wait_for_completion() and signature change to check_job_exists() to ensure correctness of job existence/completion logic.
test_disagg.py control flow refactoring with unified wait/backup patterns should be traced through both perf and accuracy test paths to verify proper exception handling and log backup execution.
common.py and config_loader.py field additions and behavior changes (GB300 support, log file naming, context directory prefix) require verification that downstream components correctly reference updated names.
submit.py refactored loop and expanded quoting should be validated to ensure command construction remains correct and shell parsing is robust across all code paths.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is minimal and lacks required sections from the template (Description, Test Coverage, and PR Checklist).	Expand the description to explain the issue being fixed, list relevant test cases ensuring sufficient coverage, and complete the PR Checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main objective of the PR - fixing GB300 support issues across multiple files and configurations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

examples/disaggregated/slurm/benchmark/submit.py (1)
276-277: Consider using next(iter(...)) for single-element access.

Once the critical data structure issue is resolved, consider refactoring line 277 to use next(iter(allocation["nodes"].values())) instead of list(allocation["nodes"].values())[0]. This is more idiomatic, avoids creating an intermediate list, and is slightly more efficient.
🔎 Proposed refactor
-        cuda_devices = ",".join(
-            [str(device) for device in list(allocation["nodes"].values())[0]])
+        cuda_devices = ",".join(
+            [str(device) for device in next(iter(allocation["nodes"].values()))])
tests/integration/defs/perf/disagg/execution/executor.py (2)

559-563: Caller cannot distinguish timeout from normal completion.

After timeout handling (cancelling the job and sleeping), the function returns implicitly with None, same as normal completion. If the caller needs to know whether the job completed normally or was cancelled due to timeout, consider returning a status indicator or raising an exception.

However, per the docstring, "The actual success/failure will be determined by log file parsing," so this design may be intentional.

648-654: Consider extracting hardcoded log filename to a constant.

The filename "7_accuracy_eval.log" is hardcoded here. If this filename is used elsewhere or might change, consider extracting it to a module-level constant for maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e9523c and 3209553.

📒 Files selected for processing (10)

examples/disaggregated/slurm/benchmark/submit.py
tests/integration/defs/perf/disagg/execution/executor.py
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_8k1k_ctx8_gen1_dep32_bs256_eplb416_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/testlist/all.txt
tests/integration/defs/perf/disagg/testlist/wideep.txt
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/utils/config_loader.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/execution/executor.py
tests/integration/defs/perf/disagg/utils/config_loader.py
examples/disaggregated/slurm/benchmark/submit.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/execution/executor.py
tests/integration/defs/perf/disagg/utils/config_loader.py
examples/disaggregated/slurm/benchmark/submit.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7785
File: tests/integration/defs/perf/utils.py:321-333
Timestamp: 2025-09-17T06:01:01.836Z
Learning: In test infrastructure code for disaggregated serving tests, prefer logging errors and continuing execution rather than raising exceptions on timeout, to avoid disrupting test cleanup and causing cascading failures.

📚 Learning: 2025-08-22T19:08:10.822Z

Learnt from: yuanjingx87
Repo: NVIDIA/TensorRT-LLM PR: 7176
File: jenkins/L0_Test.groovy:361-389
Timestamp: 2025-08-22T19:08:10.822Z
Learning: In Slurm job monitoring scripts, when jobs have built-in timeouts configured (via --time parameter or partition/system timeouts), an additional timeout mechanism in the monitoring loop is typically unnecessary. When a Slurm job times out, it gets terminated and removed from the active queue, causing `squeue -j $jobId` to return non-zero and break monitoring loops naturally. The job's final status can then be checked via `sacct` to determine if it failed due to timeout.

Applied to files:

tests/integration/defs/perf/disagg/execution/executor.py

🧬 Code graph analysis (3)

tests/integration/defs/perf/disagg/test_disagg.py (3)

tests/integration/defs/perf/disagg/execution/executor.py (4)

JobManager (186-838)

wait_for_completion (504-563)

get_result_dir (348-366)

backup_logs (257-325)

tests/integration/defs/perf/disagg/utils/logger.py (1)

error (89-91)

tests/integration/defs/perf/disagg/utils/trackers.py (2)

end_test_case (30-38)

get_timestamps (40-55)

tests/integration/defs/perf/disagg/execution/executor.py (2)

tests/integration/defs/perf/disagg/utils/common.py (1)

get_work_dir (85-86)

tests/integration/defs/perf/disagg/reporting/report.py (4)

LogWriter (12-28)

print_to_console (16-28)

LogParser (31-238)

parse (73-101)

tests/integration/defs/perf/disagg/utils/config_loader.py (1)

tests/integration/defs/perf/disagg/utils/common.py (2)

EnvManager (44-155)

get_slurm_extra_args (70-74)

🪛 Ruff (0.14.8)

tests/integration/defs/perf/disagg/test_disagg.py

134-134: Do not catch blind exception: Exception

(BLE001)

216-216: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/disagg/execution/executor.py

413-413: Do not catch blind exception: Exception

(BLE001)

415-415: Use explicit conversion flag

Replace with conversion flag

(RUF010)

423-423: Do not catch blind exception: Exception

(BLE001)

498-498: Do not catch blind exception: Exception

(BLE001)

examples/disaggregated/slurm/benchmark/submit.py

277-277: Prefer next(iter(allocation["nodes"].values())) over single element slice

Replace with next(iter(allocation["nodes"].values()))

(RUF015)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (15)

examples/disaggregated/slurm/benchmark/submit.py (1)

295-302: LGTM: Improved shell command quoting.

The addition of single quotes around arguments like model_path, dataset_file, concurrency_list, log_dir, cur_worker_env_var, and profiling ranges improves shell safety by preventing word splitting and glob expansion. This is a solid defensive coding practice.

Also applies to: 344-344, 351-351, 358-358

tests/integration/defs/perf/disagg/utils/config_loader.py (2)

80-80: LGTM! Log file naming improved with sequence prefixes.

The numeric prefixes (6_, 7_) added to log filenames improve clarity and suggest execution ordering. This aligns with structured logging practices for multi-stage benchmarks.

Also applies to: 88-88, 99-99

481-481: LGTM! GB300 SLURM support added correctly.

The new extra_args mapping enables per-GPU-type SLURM arguments via EnvManager.get_slurm_extra_args(), correctly returning --gres=gpu:4 for GB200 and an empty string for GB300 (which doesn't require gres).

tests/integration/defs/perf/disagg/utils/common.py (3)

64-67: LGTM! GB300 segment support enabled.

GB300 now correctly supports SLURM segment flags (changed from False to True), aligning with the PR objective to fix GB300 support issues.

69-74: LGTM! SLURM extra arguments properly configured per GPU type.

The new method correctly returns GPU-specific SLURM arguments (--gres=gpu:4 for GB200, empty for GB300), enabling flexible SLURM configuration across different hardware.

199-202: Verify any test result parsing or monitoring that depends on the old directory naming.

The context directory naming changed from ctx to disagg_ctx prefix in the test utility (lines 199-202). This affects temporary directory structure for disaggregated serving performance tests. While the codebase search found no hardcoded references to the old ctx*_gen* directory pattern in automation scripts, any custom monitoring or result aggregation tools that parse these directory names should be updated to handle the new disagg_ctx*_gen* pattern. This is a test infrastructure refactoring, not a core API breaking change.

tests/integration/defs/perf/disagg/test_disagg.py (3)

80-112: LGTM! Test execution flow simplified with unified completion handling.

The refactored perf test correctly:

Initializes placeholders for job tracking

Uses unified wait_for_completion for timeout and early failure detection

Maintains debug mode support

Simplifies control flow for better maintainability

127-135: LGTM! Robust log backup with appropriate error handling.

The finally block correctly ensures logs are always backed up, with exception handling that prevents backup failures from masking primary test failures. The broad exception catch is intentional here to guarantee cleanup occurs.

Based on learnings, test infrastructure should log errors and continue execution rather than raising exceptions during cleanup.

156-217: LGTM! Accuracy test follows consistent pattern with robust error handling.

The accuracy test refactoring mirrors the perf test improvements:

Unified completion handling with appropriate timeout (3 hours for accuracy vs 2 hours for perf)

Consistent log backup in finally block

Safe None-checking for result variable

Proper exception handling that doesn't mask primary failures

Based on learnings, this approach aligns with preferred test infrastructure patterns.

tests/integration/defs/perf/disagg/execution/executor.py (6)

396-424: Good defensive error handling pattern.

The initialization of a default result before the try block, combined with exception handling that logs errors and continues, aligns well with the test infrastructure's need to avoid cascading failures. This ensures check_result always returns a valid result dict.

Based on learnings, broad exception handling is preferred in test infrastructure to avoid disrupting test cleanup.

485-501: Cleaner API with boolean existence check.

The simplification from check_job_status() -> str to check_job_exists() -> bool is a good refactor. The logic correctly treats squeue failures as "job no longer exists," which aligns with the learning that SLURM job monitoring loops naturally break when jobs leave the queue.

584-595: Good defensive file existence checks.

Adding pre-checks for SLURM log and result directory existence before attempting to print logs prevents FileNotFoundError exceptions and provides meaningful warning messages instead.

740-753: Consistent pre-check pattern with accuracy result checking.

Good symmetry with _check_accuracy_result. Both methods now validate file existence before attempting to parse, preventing cryptic downstream errors.

759-768: Good error propagation to result dict.

Propagating meaningful error messages into the result structure (instead of just logging) allows callers to access failure reasons programmatically.

1-2: Verify NVIDIA copyright header presence.

As per coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header with the year of its latest modification. The provided code starts with the module docstring at line 1. Please ensure the copyright header exists above line 1 (not shown in the review context).

examples/disaggregated/slurm/benchmark/submit.py

fredricz-20070104 · 2025-12-22T10:26:18Z

/bot run --skip-test

tensorrt-cicd · 2025-12-22T10:33:58Z

PR_Github #29412 [ run ] triggered by Bot. Commit: ecca266

tensorrt-cicd · 2025-12-22T11:44:25Z

PR_Github #29412 [ run ] completed with state FAILURE. Commit: ecca266
/LLM/main/L0_MergeRequest_PR pipeline #22598 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 · 2025-12-22T11:49:37Z

/bot run --skip-test

tensorrt-cicd · 2025-12-22T11:57:54Z

PR_Github #29417 [ run ] triggered by Bot. Commit: aa9d8ac

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

tensorrt-cicd · 2025-12-22T14:14:52Z

PR_Github #29417 [ run ] completed with state SUCCESS. Commit: aa9d8ac
/LLM/main/L0_MergeRequest_PR pipeline #22602 (Partly Tested) completed with status: 'SUCCESS'

fredricz-20070104 · 2025-12-23T01:10:18Z

/bot reuse-pipeline

tensorrt-cicd · 2025-12-23T01:16:48Z

PR_Github #29480 [ reuse-pipeline ] triggered by Bot. Commit: 5bbad9e

tensorrt-cicd · 2025-12-23T01:58:20Z

PR_Github #29480 [ reuse-pipeline ] completed with state SUCCESS. Commit: 5bbad9e
Reusing PR_Github #29417 (Partly Tested) for commit 5bbad9e

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 added 9 commits December 18, 2025 10:27

fix confilict

f5fe42a

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

add adapt code

edc0b60

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx not consistent ref

0d789a1

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

update all disagg and wideep cases

08bf31b

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

change kimi k2's backend to NIXL

b7254bf

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

simplify the current job wait logic

21b9d93

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx log dir issue

54851b9

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

add new submit

85f983c

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx

3dbbc98

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 requested review from a team as code owners December 22, 2025 09:19

fredricz-20070104 requested review from chuangz0, laikhtewari, nv-guomingz and schetlur-nv December 22, 2025 09:19

Merge branch 'main' into feature/fz_gb300

3209553

Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/submit.py Outdated Show resolved Hide resolved

fredricz-20070104 assigned kaiyux and unassigned kaiyux Dec 22, 2025

fredricz-20070104 requested a review from kaiyux December 22, 2025 10:21

Merge branch 'main' into feature/fz_gb300

ecca266

fx pre-commit error

aa9d8ac

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx

e4109ed

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 enabled auto-merge (squash) December 22, 2025 12:33

fredricz-20070104 added 2 commits December 22, 2025 12:54

fx pre-commit error

8ffce1d

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

Merge branch 'main' into feature/fz_gb300

e133d93

Merge branch 'main' into feature/fz_gb300

5bbad9e

fredricz-20070104 requested a review from Shixiaowei02 December 23, 2025 01:12

Shixiaowei02 approved these changes Dec 23, 2025

View reviewed changes

fredricz-20070104 merged commit 621156a into NVIDIA:main Dec 23, 2025
5 checks passed

[None][chore] Fix GB300 support issues #10196

[None][chore] Fix GB300 support issues #10196

Uh oh!

Conversation

fredricz-20070104 commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fredricz-20070104 commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

fredricz-20070104 commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

tensorrt-cicd commented Dec 22, 2025

Uh oh!

fredricz-20070104 commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fredricz-20070104 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading