-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][chore] Fix GB300 support issues #10196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][chore] Fix GB300 support issues #10196
Conversation
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>
📝 WalkthroughWalkthroughRefactored test infrastructure for disaggregated execution with improved error handling, updated configuration defaults, and enhanced SLURM management capabilities. Changes include method signature updates for job status checking, unified test completion handling, new SLURM configuration options, and metric log file naming updates, along with control flow adjustments to increase robustness. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
examples/disaggregated/slurm/benchmark/submit.py (1)
276-277: Consider usingnext(iter(...))for single-element access.Once the critical data structure issue is resolved, consider refactoring line 277 to use
next(iter(allocation["nodes"].values()))instead oflist(allocation["nodes"].values())[0]. This is more idiomatic, avoids creating an intermediate list, and is slightly more efficient.🔎 Proposed refactor
- cuda_devices = ",".join( - [str(device) for device in list(allocation["nodes"].values())[0]]) + cuda_devices = ",".join( + [str(device) for device in next(iter(allocation["nodes"].values()))])tests/integration/defs/perf/disagg/execution/executor.py (2)
559-563: Caller cannot distinguish timeout from normal completion.After timeout handling (cancelling the job and sleeping), the function returns implicitly with
None, same as normal completion. If the caller needs to know whether the job completed normally or was cancelled due to timeout, consider returning a status indicator or raising an exception.However, per the docstring, "The actual success/failure will be determined by log file parsing," so this design may be intentional.
648-654: Consider extracting hardcoded log filename to a constant.The filename
"7_accuracy_eval.log"is hardcoded here. If this filename is used elsewhere or might change, consider extracting it to a module-level constant for maintainability.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/disaggregated/slurm/benchmark/submit.pytests/integration/defs/perf/disagg/execution/executor.pytests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/kimi-k2-thinking-fp4_8k1k_ctx8_gen1_dep32_bs256_eplb416_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/testlist/all.txttests/integration/defs/perf/disagg/testlist/wideep.txttests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/utils/config_loader.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/execution/executor.pytests/integration/defs/perf/disagg/utils/config_loader.pyexamples/disaggregated/slurm/benchmark/submit.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/execution/executor.pytests/integration/defs/perf/disagg/utils/config_loader.pyexamples/disaggregated/slurm/benchmark/submit.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7785
File: tests/integration/defs/perf/utils.py:321-333
Timestamp: 2025-09-17T06:01:01.836Z
Learning: In test infrastructure code for disaggregated serving tests, prefer logging errors and continuing execution rather than raising exceptions on timeout, to avoid disrupting test cleanup and causing cascading failures.
📚 Learning: 2025-08-22T19:08:10.822Z
Learnt from: yuanjingx87
Repo: NVIDIA/TensorRT-LLM PR: 7176
File: jenkins/L0_Test.groovy:361-389
Timestamp: 2025-08-22T19:08:10.822Z
Learning: In Slurm job monitoring scripts, when jobs have built-in timeouts configured (via --time parameter or partition/system timeouts), an additional timeout mechanism in the monitoring loop is typically unnecessary. When a Slurm job times out, it gets terminated and removed from the active queue, causing `squeue -j $jobId` to return non-zero and break monitoring loops naturally. The job's final status can then be checked via `sacct` to determine if it failed due to timeout.
Applied to files:
tests/integration/defs/perf/disagg/execution/executor.py
🧬 Code graph analysis (3)
tests/integration/defs/perf/disagg/test_disagg.py (3)
tests/integration/defs/perf/disagg/execution/executor.py (4)
JobManager(186-838)wait_for_completion(504-563)get_result_dir(348-366)backup_logs(257-325)tests/integration/defs/perf/disagg/utils/logger.py (1)
error(89-91)tests/integration/defs/perf/disagg/utils/trackers.py (2)
end_test_case(30-38)get_timestamps(40-55)
tests/integration/defs/perf/disagg/execution/executor.py (2)
tests/integration/defs/perf/disagg/utils/common.py (1)
get_work_dir(85-86)tests/integration/defs/perf/disagg/reporting/report.py (4)
LogWriter(12-28)print_to_console(16-28)LogParser(31-238)parse(73-101)
tests/integration/defs/perf/disagg/utils/config_loader.py (1)
tests/integration/defs/perf/disagg/utils/common.py (2)
EnvManager(44-155)get_slurm_extra_args(70-74)
🪛 Ruff (0.14.8)
tests/integration/defs/perf/disagg/test_disagg.py
134-134: Do not catch blind exception: Exception
(BLE001)
216-216: Do not catch blind exception: Exception
(BLE001)
tests/integration/defs/perf/disagg/execution/executor.py
413-413: Do not catch blind exception: Exception
(BLE001)
415-415: Use explicit conversion flag
Replace with conversion flag
(RUF010)
423-423: Do not catch blind exception: Exception
(BLE001)
498-498: Do not catch blind exception: Exception
(BLE001)
examples/disaggregated/slurm/benchmark/submit.py
277-277: Prefer next(iter(allocation["nodes"].values())) over single element slice
Replace with next(iter(allocation["nodes"].values()))
(RUF015)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
examples/disaggregated/slurm/benchmark/submit.py (1)
295-302: LGTM: Improved shell command quoting.The addition of single quotes around arguments like
model_path,dataset_file,concurrency_list,log_dir,cur_worker_env_var, and profiling ranges improves shell safety by preventing word splitting and glob expansion. This is a solid defensive coding practice.Also applies to: 344-344, 351-351, 358-358
tests/integration/defs/perf/disagg/utils/config_loader.py (2)
80-80: LGTM! Log file naming improved with sequence prefixes.The numeric prefixes (6_, 7_) added to log filenames improve clarity and suggest execution ordering. This aligns with structured logging practices for multi-stage benchmarks.
Also applies to: 88-88, 99-99
481-481: LGTM! GB300 SLURM support added correctly.The new
extra_argsmapping enables per-GPU-type SLURM arguments viaEnvManager.get_slurm_extra_args(), correctly returning--gres=gpu:4for GB200 and an empty string for GB300 (which doesn't require gres).tests/integration/defs/perf/disagg/utils/common.py (3)
64-67: LGTM! GB300 segment support enabled.GB300 now correctly supports SLURM segment flags (changed from
FalsetoTrue), aligning with the PR objective to fix GB300 support issues.
69-74: LGTM! SLURM extra arguments properly configured per GPU type.The new method correctly returns GPU-specific SLURM arguments (
--gres=gpu:4for GB200, empty for GB300), enabling flexible SLURM configuration across different hardware.
199-202: Verify any test result parsing or monitoring that depends on the old directory naming.The context directory naming changed from
ctxtodisagg_ctxprefix in the test utility (lines 199-202). This affects temporary directory structure for disaggregated serving performance tests. While the codebase search found no hardcoded references to the oldctx*_gen*directory pattern in automation scripts, any custom monitoring or result aggregation tools that parse these directory names should be updated to handle the newdisagg_ctx*_gen*pattern. This is a test infrastructure refactoring, not a core API breaking change.tests/integration/defs/perf/disagg/test_disagg.py (3)
80-112: LGTM! Test execution flow simplified with unified completion handling.The refactored perf test correctly:
- Initializes placeholders for job tracking
- Uses unified
wait_for_completionfor timeout and early failure detection- Maintains debug mode support
- Simplifies control flow for better maintainability
127-135: LGTM! Robust log backup with appropriate error handling.The
finallyblock correctly ensures logs are always backed up, with exception handling that prevents backup failures from masking primary test failures. The broad exception catch is intentional here to guarantee cleanup occurs.Based on learnings, test infrastructure should log errors and continue execution rather than raising exceptions during cleanup.
156-217: LGTM! Accuracy test follows consistent pattern with robust error handling.The accuracy test refactoring mirrors the perf test improvements:
- Unified completion handling with appropriate timeout (3 hours for accuracy vs 2 hours for perf)
- Consistent log backup in
finallyblock- Safe None-checking for result variable
- Proper exception handling that doesn't mask primary failures
Based on learnings, this approach aligns with preferred test infrastructure patterns.
tests/integration/defs/perf/disagg/execution/executor.py (6)
396-424: Good defensive error handling pattern.The initialization of a default result before the try block, combined with exception handling that logs errors and continues, aligns well with the test infrastructure's need to avoid cascading failures. This ensures
check_resultalways returns a valid result dict.Based on learnings, broad exception handling is preferred in test infrastructure to avoid disrupting test cleanup.
485-501: Cleaner API with boolean existence check.The simplification from
check_job_status() -> strtocheck_job_exists() -> boolis a good refactor. The logic correctly treats squeue failures as "job no longer exists," which aligns with the learning that SLURM job monitoring loops naturally break when jobs leave the queue.
584-595: Good defensive file existence checks.Adding pre-checks for SLURM log and result directory existence before attempting to print logs prevents
FileNotFoundErrorexceptions and provides meaningful warning messages instead.
740-753: Consistent pre-check pattern with accuracy result checking.Good symmetry with
_check_accuracy_result. Both methods now validate file existence before attempting to parse, preventing cryptic downstream errors.
759-768: Good error propagation to result dict.Propagating meaningful error messages into the result structure (instead of just logging) allows callers to access failure reasons programmatically.
1-2: Verify NVIDIA copyright header presence.As per coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header with the year of its latest modification. The provided code starts with the module docstring at line 1. Please ensure the copyright header exists above line 1 (not shown in the review context).
|
/bot run --skip-test |
|
PR_Github #29412 [ run ] triggered by Bot. Commit: |
|
PR_Github #29412 [ run ] completed with state
|
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
|
/bot run --skip-test |
|
PR_Github #29417 [ run ] triggered by Bot. Commit: |
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
|
PR_Github #29417 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #29480 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #29480 [ reuse-pipeline ] completed with state |
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> Signed-off-by: fredricz-20070104 <226039983+fredricz-20070104@users.noreply.github.com>
Summary by CodeRabbit
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.
Fix GB300 support issues