-
Notifications
You must be signed in to change notification settings - Fork 42
Always set number of nodes for srun cmd #729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Walkthroughgen_srun_prefix gained a new with_num_nodes parameter; when true it reads num_nodes from cached spec and inserts a -N{num_nodes} token into generated srun prefixes. Call sites, tests, and many reference sbatch fixtures were updated to add or suppress explicit Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧬 Code graph analysis (2)src/cloudai/workloads/bash_cmd/bash_cmd.py (2)
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py (2)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryThis PR ensures that the
Confidence Score: 5/5
Important Files ChangedFile Analysis
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (23)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py(1 hunks)tests/ref_data/ai-dynamo.sbatch(2 hunks)tests/ref_data/ddlb.sbatch(1 hunks)tests/ref_data/deepep-benchmark.sbatch(1 hunks)tests/ref_data/gpt-no-hook.sbatch(1 hunks)tests/ref_data/gpt-pre-test.sbatch(1 hunks)tests/ref_data/grok-no-hook.sbatch(1 hunks)tests/ref_data/grok-pre-test.sbatch(1 hunks)tests/ref_data/megatron-run.sbatch(1 hunks)tests/ref_data/nccl.sbatch(1 hunks)tests/ref_data/nemo-run-no-hook.sbatch(1 hunks)tests/ref_data/nemo-run-pre-test.sbatch(1 hunks)tests/ref_data/nemo-run-vboost.sbatch(1 hunks)tests/ref_data/nixl-kvbench.sbatch(1 hunks)tests/ref_data/nixl-perftest.sbatch(1 hunks)tests/ref_data/nixl_bench.sbatch(1 hunks)tests/ref_data/sleep.sbatch(1 hunks)tests/ref_data/slurm_container.sbatch(1 hunks)tests/ref_data/triton-inference.sbatch(1 hunks)tests/ref_data/ucc.sbatch(1 hunks)tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py(1 hunks)tests/slurm_command_gen_strategy/test_slurm_container_slurm_command_gen_strategy.py(3 hunks)tests/test_single_sbatch_runner.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py (1)
get_cached_nodes_spec(182-232)
🔇 Additional comments (32)
tests/ref_data/nccl.sbatch (1)
13-17:-N1additions match the new srun prefix behaviorAll three sruns now explicitly specify
-N1after--mpi=pmix, which is consistent with the updatedgen_srun_prefixand the goal of always setting a node count when we care about per-node task behavior. No issues here.tests/ref_data/grok-pre-test.sbatch (1)
15-19: Pre-test sruns correctly updated to include-N1The mapping, metadata, and nccl pre-test sruns now explicitly specify
-N1alongside--mpi=pmix, aligning with the new command-generation strategy. Ordering with the--output/--erroroptions looks fine and there are no redundant node flags.tests/ref_data/ddlb.sbatch (1)
13-17: Explicit-N1on all ddlb sruns is consistent with the strategyEach of the ddlb-related sruns now includes
-N1directly after--mpi=pmix, matching the updated srun prefix semantics and ensuring the node count is explicit. Looks good.tests/ref_data/slurm_container.sbatch (1)
13-17: Container sbatch sruns correctly updated with-N1All container-related sruns now carry an explicit
-N1after--mpi=pmix, in line with the new Slurm command-generation strategy and the referenced bugfix. No further issues.tests/ref_data/ucc.sbatch (1)
13-17: UCC sbatch sruns now explicitly constrain to a single nodeThe added
-N1for mapping, metadata, anducc_perftestsruns aligns with the new srun prefix and makes the single-node allocation explicit. This looks correct.tests/ref_data/gpt-no-hook.sbatch (1)
15-17: GPT no-hook mapping/metadata sruns properly updated with-N1Both mapping and metadata sruns now explicitly set
-N1alongside--mpi=pmix, which aligns with the new command-generation behavior and the goal of always specifying a node count for these steps. No concerns.src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
240-259: _enable_vboost_cmd is intentionally different; duplicate-Nrisk fromextra_srun_argsis valid but unlikely in practiceThe concerns in the original review are partially accurate:
_enable_vboost_cmd is NOT inconsistent: It intentionally uses
--ntasks={num_nodes}(line 311-321) rather than-N, which is semantically correct for vboost—it runs one privileged GPU operation per node as a single task per node, not as a distributed parallel job. This is different from the main test's-N{num_nodes}which sets node distribution for MPI/parallel workloads.Duplicate
-Nfromextra_srun_argsis a valid but low-risk concern: The basepre_test_srun_extra_args()returns an empty list (line 155) with no overrides found, so that path is safe. However,system.extra_srun_args(a free-form string appended at line 257) could theoretically contain-Nor--nodes, creating a duplicate. SLURM would use the last value, but it's confusing. This is unlikely in typical configs but worth documenting.No actionable changes needed—the implementation is sound. If you want to harden against misconfiguration, you could validate and strip explicit
-Nfromextra_srun_argsbefore appending, but that's optional refactoring.tests/ref_data/megatron-run.sbatch (1)
13-17: LGTM: Consistent node count specification.All three srun invocations correctly specify -N1, matching the sbatch allocation and ensuring that --ntasks-per-node is not ignored.
tests/ref_data/triton-inference.sbatch (1)
18-20: LGTM: Mapping and metadata commands correctly specify -N3.These srun invocations correctly use -N3 to match the full allocation.
tests/ref_data/nemo-run-no-hook.sbatch (1)
14-18: LGTM: Consistent node count specification.All three srun invocations correctly specify -N1, matching the sbatch allocation.
tests/ref_data/nemo-run-vboost.sbatch (1)
17-21: LGTM: Container commands correctly specify -N1.The three container-based srun invocations correctly specify -N1.
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py (1)
274-284: LGTM: Test expectation correctly updated.The test now expects the -N flag to be included in the srun command prefix, aligning with the PR's objective.
tests/ref_data/ai-dynamo.sbatch (2)
13-15: LGTM: Mapping and metadata commands correctly specify -N2.These srun invocations correctly use -N2 to match the allocation.
21-48: Approved: Main command correctly specifies -N2.The srun command specifies both -N2 and --nodes=2, which is redundant but not incorrect since they specify the same value. While redundant, this ensures the node count is explicit and doesn't create conflicts.
tests/ref_data/deepep-benchmark.sbatch (1)
23-27: LGTM: Consistent node count specification.All three srun invocations correctly specify -N2, matching the sbatch allocation.
tests/ref_data/sleep.sbatch (1)
13-17: LGTM: Consistent node count specification.All three srun invocations correctly specify -N1, matching the sbatch allocation.
tests/ref_data/gpt-pre-test.sbatch (1)
15-19: LGTM!The explicit
-N1flag is correctly added to all three auxiliary srun invocations (mapping, metadata, and nccl pre-test), consistent with the#SBATCH -N 1allocation header on line 7.tests/slurm_command_gen_strategy/test_slurm_container_slurm_command_gen_strategy.py (3)
46-53: LGTM!The test expectation correctly includes
-N{test_run.num_nodes}in the srun command, validating the new explicit node-count behavior.
64-71: LGTM!Consistent with other test cases, the nsys test expectation now includes the explicit node count flag.
84-92: LGTM!The extra_srun_args test correctly positions
-N{test_run.num_nodes}before the extra arguments, maintaining proper argument ordering.tests/ref_data/grok-no-hook.sbatch (1)
15-17: LGTM!The explicit
-N1flag is correctly added to the mapping and metadata srun invocations, matching the#SBATCH -N 1allocation.tests/test_single_sbatch_runner.py (4)
241-255: LGTM!The bare metal test correctly expects
-N{sleep_tr.num_nodes}in both metadata and ranks mapping srun commands.
271-286: LGTM!The container test correctly expects
-N{nccl_tr.num_nodes}in the metadata and ranks mapping srun commands for containerized workloads.
321-327: LGTM!The single test run block correctly includes
-N{sleep_tr.num_nodes}in the expected srun command structure.
516-529: LGTM!The pre-test command expectation correctly includes
-N{sleep_tr.num_nodes}in the srun command.tests/ref_data/nixl-kvbench.sbatch (1)
15-17: LGTM!The explicit
-N2flag is correctly added to the mapping and metadata srun invocations, matching the#SBATCH -N 2allocation.tests/ref_data/nemo-run-pre-test.sbatch (4)
14-14: LGTM! Explicit node count added for mapping command.The addition of
-N1ensures srun respects the node allocation for the mapping command, preventing potential issues where--ntasks-per-nodemight be ignored.
16-16: LGTM! Explicit node count added for metadata collection.The addition of
-N1ensures the metadata collection runs on exactly one node, addressing the issue where--ntasks-per-node=1could be ignored without explicit-N.
18-18: LGTM! Explicit node count added for NCCL pre-test.The addition of
-N1ensures the NCCL pre-test runs on a single node as intended, preventing potential distribution issues.
22-22: LGTM! Explicit node count added for main execution.The addition of
-N1ensures the main nemorun execution respects the single-node allocation, consistent with thetrainer.num_nodes=1parameter in the command.tests/ref_data/nixl_bench.sbatch (2)
15-15: LGTM! Explicit node count added for mapping command.The addition of
-N2correctly ensures the mapping command runs across both allocated nodes.
17-17: LGTM! Explicit node count added for metadata collection.The addition of
-N2ensures metadata is collected from both nodes, preventing--ntasks-per-node=1from being ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_single_sbatch_runner.py (1)
241-255: Explicit-Nin bare‑metal auxsruncommands looks correctIncluding
-N{sleep_tr.num_nodes}in both metadata and ranks‑mappingsruncommands matches the PR’s intent and should prevent--ntasks-per-nodefrom being ignored on multi‑node allocations. Consider also asserting on-Nin thetest_max_nodes_used_for_metadata*tests so regressions in the max‑nodes path are caught, not just changes to--ntasks.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tests/test_single_sbatch_runner.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_single_sbatch_runner.py (1)
tests/report_generation_strategy/conftest.py (1)
nccl_tr(26-61)
🔇 Additional comments (2)
tests/test_single_sbatch_runner.py (2)
272-283: Container auxsruncommands now correctly carry node countThe added
-N{nccl_tr.num_nodes}in both container metadata and ranks‑mapping commands is consistent with the bare‑metal path and aligns with the Slurm bugfix around--ntasks-per-nodehandling.
516-522: Pre‑testsrunnow pins node count explicitlyAdding
-N{sleep_tr.num_nodes}to the pre‑test command ensures pre‑tests are also run with an explicit node count, consistent with the main aux/test commands and the Slurm behavior you’re targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
29 files reviewed, no comments
Summary
When
--ntasks-per-nodeis used,-Nseems to be very important to set as well, otherwise it might be ignored.Fixes internal bug.
Test Plan
Additional Notes
Here there is a simple example to demonstrate this behavior:
Output is:
Running with -N eos0364 eos0366 Running without -N srun: warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 16 with the number of requested nodes 2. Ignoring --ntasks-per-node. eos0366 eos0366 eos0366 eos0366 eos0366 eos0366 eos0366 eos0366 eos0364 eos0364 eos0364 eos0364 eos0364 eos0364 eos0364Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.