Skip to content

Conversation

@fredricz-20070104
Copy link
Collaborator

@fredricz-20070104 fredricz-20070104 commented Dec 22, 2025

Summary by CodeRabbit

  • New Features

    • Added GB300 GPU support for SLURM configurations.
    • Added new performance and accuracy test cases.
  • Bug Fixes

    • Improved test execution reliability with enhanced error handling and logging backup mechanisms.
    • Strengthened result verification and log preservation workflows.
  • Chores

    • Migrated cache transceiver backend from UCX to NIXL for specific test configurations.
    • Updated metric log file naming conventions.

✏️ Tip: You can customize this high-level summary in your review settings.

Fix GB300 support issues

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Refactored test infrastructure for disaggregated execution with improved error handling, updated configuration defaults, and enhanced SLURM management capabilities. Changes include method signature updates for job status checking, unified test completion handling, new SLURM configuration options, and metric log file naming updates, along with control flow adjustments to increase robustness.

Changes

Cohort / File(s) Summary
Execution error handling & job management
tests/integration/defs/perf/disagg/execution/executor.py
Refactored job status checking from check_job_status() → str to check_job_exists() → bool; enhanced wait_for_completion() to return None and use existence checks for completion determination; added try/except wrapping for result checking, directory cleanup, and log printing with pre-existence validation; improved error propagation into result structures.
Test workflow simplification
tests/integration/defs/perf/disagg/test_disagg.py
Unified perf and accuracy test flows with centralized job submission and result handling; introduced debug mode job_id bypass; replaced separate completion/error handling with single wait_for_completion() call; added finally block for consistent log backup regardless of test outcome; removed per-case manual cancellation and result backup logic.
SLURM configuration & utilities
tests/integration/defs/perf/disagg/utils/common.py
Added get_slurm_extra_args() method for per-GPU-type SLURM arguments; enabled GB300 support for SLURM segment flag; changed disaggregated context directory naming prefix from ctx to disagg_ctx.
Configuration defaults & overrides
tests/integration/defs/perf/disagg/utils/config_loader.py
Updated metric log file names (accuracy_eval.log7_accuracy_eval.log, bench.log6_bench.log); added environment-driven override for SLURM extra arguments via EnvManager.get_slurm_extra_args().
Backend configuration updates
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml, tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml, tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_8k1k_ctx8_gen1_dep32_bs256_eplb416_mtp0_ccb-NIXL.yaml
Changed cache_transceiver_config.backend from UCX to NIXL in worker configurations.
Test list updates
tests/integration/defs/perf/disagg/testlist/all.txt, tests/integration/defs/perf/disagg/testlist/wideep.txt
Added new test cases for Qwen3, deepseek-r1, and kimi-k2 variants with NIXL backend; replaced UCX-based kimi-k2 perf entries with NIXL variants.
Benchmark submission refactoring
examples/disaggregated/slurm/benchmark/submit.py
Replaced nested allocation loop with flat single-level iteration; updated command construction to reference allocation object fields directly; expanded shell quoting for concurrency lists, file paths, model paths, dataset files, and log directories for robust parsing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

  • executor.py requires careful review of new error handling paths, try/except blocks, and return type changes to wait_for_completion() and signature change to check_job_exists() to ensure correctness of job existence/completion logic.
  • test_disagg.py control flow refactoring with unified wait/backup patterns should be traced through both perf and accuracy test paths to verify proper exception handling and log backup execution.
  • common.py and config_loader.py field additions and behavior changes (GB300 support, log file naming, context directory prefix) require verification that downstream components correctly reference updated names.
  • submit.py refactored loop and expanded quoting should be validated to ensure command construction remains correct and shell parsing is robust across all code paths.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and lacks required sections from the template (Description, Test Coverage, and PR Checklist). Expand the description to explain the issue being fixed, list relevant test cases ensuring sufficient coverage, and complete the PR Checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective of the PR - fixing GB300 support issues across multiple files and configurations.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
examples/disaggregated/slurm/benchmark/submit.py (1)

276-277: Consider using next(iter(...)) for single-element access.

Once the critical data structure issue is resolved, consider refactoring line 277 to use next(iter(allocation["nodes"].values())) instead of list(allocation["nodes"].values())[0]. This is more idiomatic, avoids creating an intermediate list, and is slightly more efficient.

🔎 Proposed refactor
-        cuda_devices = ",".join(
-            [str(device) for device in list(allocation["nodes"].values())[0]])
+        cuda_devices = ",".join(
+            [str(device) for device in next(iter(allocation["nodes"].values()))])
tests/integration/defs/perf/disagg/execution/executor.py (2)

559-563: Caller cannot distinguish timeout from normal completion.

After timeout handling (cancelling the job and sleeping), the function returns implicitly with None, same as normal completion. If the caller needs to know whether the job completed normally or was cancelled due to timeout, consider returning a status indicator or raising an exception.

However, per the docstring, "The actual success/failure will be determined by log file parsing," so this design may be intentional.


648-654: Consider extracting hardcoded log filename to a constant.

The filename "7_accuracy_eval.log" is hardcoded here. If this filename is used elsewhere or might change, consider extracting it to a module-level constant for maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e9523c and 3209553.

📒 Files selected for processing (10)
  • examples/disaggregated/slurm/benchmark/submit.py
  • tests/integration/defs/perf/disagg/execution/executor.py
  • tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_8k1k_ctx8_gen1_dep32_bs256_eplb416_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/testlist/all.txt
  • tests/integration/defs/perf/disagg/testlist/wideep.txt
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/utils/config_loader.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/execution/executor.py
  • tests/integration/defs/perf/disagg/utils/config_loader.py
  • examples/disaggregated/slurm/benchmark/submit.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/execution/executor.py
  • tests/integration/defs/perf/disagg/utils/config_loader.py
  • examples/disaggregated/slurm/benchmark/submit.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7785
File: tests/integration/defs/perf/utils.py:321-333
Timestamp: 2025-09-17T06:01:01.836Z
Learning: In test infrastructure code for disaggregated serving tests, prefer logging errors and continuing execution rather than raising exceptions on timeout, to avoid disrupting test cleanup and causing cascading failures.
📚 Learning: 2025-08-22T19:08:10.822Z
Learnt from: yuanjingx87
Repo: NVIDIA/TensorRT-LLM PR: 7176
File: jenkins/L0_Test.groovy:361-389
Timestamp: 2025-08-22T19:08:10.822Z
Learning: In Slurm job monitoring scripts, when jobs have built-in timeouts configured (via --time parameter or partition/system timeouts), an additional timeout mechanism in the monitoring loop is typically unnecessary. When a Slurm job times out, it gets terminated and removed from the active queue, causing `squeue -j $jobId` to return non-zero and break monitoring loops naturally. The job's final status can then be checked via `sacct` to determine if it failed due to timeout.

Applied to files:

  • tests/integration/defs/perf/disagg/execution/executor.py
🧬 Code graph analysis (3)
tests/integration/defs/perf/disagg/test_disagg.py (3)
tests/integration/defs/perf/disagg/execution/executor.py (4)
  • JobManager (186-838)
  • wait_for_completion (504-563)
  • get_result_dir (348-366)
  • backup_logs (257-325)
tests/integration/defs/perf/disagg/utils/logger.py (1)
  • error (89-91)
tests/integration/defs/perf/disagg/utils/trackers.py (2)
  • end_test_case (30-38)
  • get_timestamps (40-55)
tests/integration/defs/perf/disagg/execution/executor.py (2)
tests/integration/defs/perf/disagg/utils/common.py (1)
  • get_work_dir (85-86)
tests/integration/defs/perf/disagg/reporting/report.py (4)
  • LogWriter (12-28)
  • print_to_console (16-28)
  • LogParser (31-238)
  • parse (73-101)
tests/integration/defs/perf/disagg/utils/config_loader.py (1)
tests/integration/defs/perf/disagg/utils/common.py (2)
  • EnvManager (44-155)
  • get_slurm_extra_args (70-74)
🪛 Ruff (0.14.8)
tests/integration/defs/perf/disagg/test_disagg.py

134-134: Do not catch blind exception: Exception

(BLE001)


216-216: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/disagg/execution/executor.py

413-413: Do not catch blind exception: Exception

(BLE001)


415-415: Use explicit conversion flag

Replace with conversion flag

(RUF010)


423-423: Do not catch blind exception: Exception

(BLE001)


498-498: Do not catch blind exception: Exception

(BLE001)

examples/disaggregated/slurm/benchmark/submit.py

277-277: Prefer next(iter(allocation["nodes"].values())) over single element slice

Replace with next(iter(allocation["nodes"].values()))

(RUF015)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (15)
examples/disaggregated/slurm/benchmark/submit.py (1)

295-302: LGTM: Improved shell command quoting.

The addition of single quotes around arguments like model_path, dataset_file, concurrency_list, log_dir, cur_worker_env_var, and profiling ranges improves shell safety by preventing word splitting and glob expansion. This is a solid defensive coding practice.

Also applies to: 344-344, 351-351, 358-358

tests/integration/defs/perf/disagg/utils/config_loader.py (2)

80-80: LGTM! Log file naming improved with sequence prefixes.

The numeric prefixes (6_, 7_) added to log filenames improve clarity and suggest execution ordering. This aligns with structured logging practices for multi-stage benchmarks.

Also applies to: 88-88, 99-99


481-481: LGTM! GB300 SLURM support added correctly.

The new extra_args mapping enables per-GPU-type SLURM arguments via EnvManager.get_slurm_extra_args(), correctly returning --gres=gpu:4 for GB200 and an empty string for GB300 (which doesn't require gres).

tests/integration/defs/perf/disagg/utils/common.py (3)

64-67: LGTM! GB300 segment support enabled.

GB300 now correctly supports SLURM segment flags (changed from False to True), aligning with the PR objective to fix GB300 support issues.


69-74: LGTM! SLURM extra arguments properly configured per GPU type.

The new method correctly returns GPU-specific SLURM arguments (--gres=gpu:4 for GB200, empty for GB300), enabling flexible SLURM configuration across different hardware.


199-202: Verify any test result parsing or monitoring that depends on the old directory naming.

The context directory naming changed from ctx to disagg_ctx prefix in the test utility (lines 199-202). This affects temporary directory structure for disaggregated serving performance tests. While the codebase search found no hardcoded references to the old ctx*_gen* directory pattern in automation scripts, any custom monitoring or result aggregation tools that parse these directory names should be updated to handle the new disagg_ctx*_gen* pattern. This is a test infrastructure refactoring, not a core API breaking change.

tests/integration/defs/perf/disagg/test_disagg.py (3)

80-112: LGTM! Test execution flow simplified with unified completion handling.

The refactored perf test correctly:

  • Initializes placeholders for job tracking
  • Uses unified wait_for_completion for timeout and early failure detection
  • Maintains debug mode support
  • Simplifies control flow for better maintainability

127-135: LGTM! Robust log backup with appropriate error handling.

The finally block correctly ensures logs are always backed up, with exception handling that prevents backup failures from masking primary test failures. The broad exception catch is intentional here to guarantee cleanup occurs.

Based on learnings, test infrastructure should log errors and continue execution rather than raising exceptions during cleanup.


156-217: LGTM! Accuracy test follows consistent pattern with robust error handling.

The accuracy test refactoring mirrors the perf test improvements:

  • Unified completion handling with appropriate timeout (3 hours for accuracy vs 2 hours for perf)
  • Consistent log backup in finally block
  • Safe None-checking for result variable
  • Proper exception handling that doesn't mask primary failures

Based on learnings, this approach aligns with preferred test infrastructure patterns.

tests/integration/defs/perf/disagg/execution/executor.py (6)

396-424: Good defensive error handling pattern.

The initialization of a default result before the try block, combined with exception handling that logs errors and continues, aligns well with the test infrastructure's need to avoid cascading failures. This ensures check_result always returns a valid result dict.

Based on learnings, broad exception handling is preferred in test infrastructure to avoid disrupting test cleanup.


485-501: Cleaner API with boolean existence check.

The simplification from check_job_status() -> str to check_job_exists() -> bool is a good refactor. The logic correctly treats squeue failures as "job no longer exists," which aligns with the learning that SLURM job monitoring loops naturally break when jobs leave the queue.


584-595: Good defensive file existence checks.

Adding pre-checks for SLURM log and result directory existence before attempting to print logs prevents FileNotFoundError exceptions and provides meaningful warning messages instead.


740-753: Consistent pre-check pattern with accuracy result checking.

Good symmetry with _check_accuracy_result. Both methods now validate file existence before attempting to parse, preventing cryptic downstream errors.


759-768: Good error propagation to result dict.

Propagating meaningful error messages into the result structure (instead of just logging) allows callers to access failure reasons programmatically.


1-2: Verify NVIDIA copyright header presence.

As per coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header with the year of its latest modification. The provided code starts with the module docstring at line 1. Please ensure the copyright header exists above line 1 (not shown in the review context).

@fredricz-20070104
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29412 [ run ] triggered by Bot. Commit: ecca266

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29412 [ run ] completed with state FAILURE. Commit: ecca266
/LLM/main/L0_MergeRequest_PR pipeline #22598 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
@fredricz-20070104
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29417 [ run ] triggered by Bot. Commit: aa9d8ac

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
@fredricz-20070104 fredricz-20070104 enabled auto-merge (squash) December 22, 2025 12:33
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #29417 [ run ] completed with state SUCCESS. Commit: aa9d8ac
/LLM/main/L0_MergeRequest_PR pipeline #22602 (Partly Tested) completed with status: 'SUCCESS'

@fredricz-20070104
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29480 [ reuse-pipeline ] triggered by Bot. Commit: 5bbad9e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29480 [ reuse-pipeline ] completed with state SUCCESS. Commit: 5bbad9e
Reusing PR_Github #29417 (Partly Tested) for commit 5bbad9e

@fredricz-20070104 fredricz-20070104 merged commit 621156a into NVIDIA:main Dec 23, 2025
5 checks passed
JunyiXu-nv pushed a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request Dec 30, 2025
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants