Skip to content

Conversation

@amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Nov 27, 2025

Summary

When --ntasks-per-node is used, -N seems to be very important to set as well, otherwise it might be ignored.

Fixes internal bug.

Test Plan

  1. CI
  2. Manual runs

Additional Notes

Here there is a simple example to demonstrate this behavior:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=8

echo "Running with -N"
srun -N2 --mpi=pmix --ntasks-per-node=1 hostname

echo "Running without -N"
srun --mpi=pmix --ntasks-per-node=1 hostname

Output is:

Running with -N
eos0364
eos0366
Running without -N
srun: warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 16 with the number of requested nodes 2. Ignoring --ntasks-per-node.
eos0366
eos0366
eos0366
eos0366
eos0366
eos0366
eos0366
eos0366
eos0364
eos0364
eos0364
eos0364
eos0364
eos0364
eos0364

Summary by CodeRabbit

  • New Features

    • Slurm job submissions now include explicit node-count flags (e.g., -N1/-N2/-N3) to enforce per-job node allocation across benchmarks and workloads.
    • Selected workflow steps now opt out of automatic node-count insertion to allow explicit, manual node specifications where needed.
  • Tests

    • Reference sbatch scripts and test cases updated to validate the new explicit node-count behavior across multiple workload scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@amaslenn amaslenn added the bug Something isn't working label Nov 27, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Walkthrough

gen_srun_prefix gained a new with_num_nodes parameter; when true it reads num_nodes from cached spec and inserts a -N{num_nodes} token into generated srun prefixes. Call sites, tests, and many reference sbatch fixtures were updated to add or suppress explicit -N flags.

Changes

Cohort / File(s) Change Summary
Core strategy
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
Added with_num_nodes: bool = True to gen_srun_prefix; when true, fetches num_nodes from get_cached_nodes_spec() and inserts -N{num_nodes} into the srun prefix.
Container strategy / wrapper
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py, src/cloudai/workloads/bash_cmd/bash_cmd.py
Propagated new with_num_nodes parameter in signatures and super() calls so container-related prefixes honor the flag.
Call sites that disable auto -N
src/cloudai/workloads/common/nixl.py, src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py, src/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py
Updated calls to gen_srun_prefix(with_num_nodes=False) to suppress automatic -N insertion where node flags are composed manually.
Tests — command-generation expectations
tests/slurm_command_gen_strategy/*, tests/test_single_sbatch_runner.py
Adjusted expected srun command strings to include -N{num_nodes} or explicitly call with_num_nodes=False per new behavior.
Reference sbatch fixtures — many files
tests/ref_data/* (e.g. ddlb.sbatch, gpt-*.sbatch, grok-*.sbatch, megatron-run.sbatch, nccl.sbatch, nemo-*.sbatch, nixl-*.sbatch, sleep.sbatch, slurm_container.sbatch, ucc.sbatch, ai-dynamo.sbatch, deepep-benchmark.sbatch, triton-inference.sbatch, nixl_bench.sbatch, nixl-kvbench.sbatch)
Inserted explicit -N flags into srun invocations (values vary per fixture: -N1, -N2, or -N3) and adjusted placement (commonly after --mpi=pmix) to reflect generated prefixes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Attention points:
    • src/cloudai/systems/slurm/slurm_command_gen_strategy.py — verify default True, correct retrieval of num_nodes, and exact token placement/whitespace.
    • Call sites that now pass with_num_nodes=False — ensure they still append their manual --nodes/-N options without duplication.
    • Tests and tests/ref_data/* — confirm expected sbatch lines match Slurm syntax and generator output.

Poem

🐇 I slipped a dash-N into each run,

paws tapped keys until the change was done.
Nodes now counted, prefixes sing,
Tests hopped by on nimble spring.
Carrots for CI — a soft drum, a fun.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.26% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: ensuring the number of nodes (-N) is consistently set for srun commands across the codebase to prevent srun from ignoring --ntasks-per-node.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch am/nnodes

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db9de5d and 531acef.

📒 Files selected for processing (2)
  • src/cloudai/workloads/bash_cmd/bash_cmd.py (1 hunks)
  • src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/cloudai/workloads/bash_cmd/bash_cmd.py (2)
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py (1)
  • gen_srun_prefix (38-41)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
  • gen_srun_prefix (240-261)
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py (2)
src/cloudai/workloads/bash_cmd/bash_cmd.py (1)
  • gen_srun_prefix (49-50)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
  • gen_srun_prefix (240-261)
🔇 Additional comments (2)
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py (1)

38-41: Propagating with_num_nodes through the container strategy looks correct

The override keeps the signature aligned with SlurmCommandGenStrategy.gen_srun_prefix, correctly forwards both use_pretest_extras and with_num_nodes to super(), and then appends container-specific extra_srun_args. This preserves existing behavior while ensuring the new -N handling in the base class also applies to container tests.

src/cloudai/workloads/bash_cmd/bash_cmd.py (1)

49-50: Signature now correctly matches SlurmCommandGenStrategy while preserving BashCmd semantics

Updating gen_srun_prefix to accept with_num_nodes: bool = True keeps this override compatible with the base SlurmCommandGenStrategy.gen_srun_prefix while still returning an empty list, which matches the existing design where BashCmd workloads rely on the user-provided cmd rather than framework-generated srun prefixes. No functional issues from this change.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

TaekyungHeo
TaekyungHeo previously approved these changes Nov 27, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 27, 2025

Greptile Overview

Greptile Summary

This PR ensures that the -N (number of nodes) flag is always explicitly set for srun commands when using --ntasks-per-node. This fixes a Slurm behavior issue where --ntasks-per-node can be silently ignored if -N is not specified, leading to incorrect task distribution across nodes.

  • Added with_num_nodes parameter to gen_srun_prefix() method, defaulting to True to include -N{num_nodes} in srun commands
  • Workloads that explicitly set their own node counts (NIXL, Triton Inference) now pass with_num_nodes=False to avoid redundant/conflicting flags
  • Updated all test reference sbatch files and test expectations to reflect the new behavior
  • Interface-only change for BashCmdCommandGenStrategy which returns an empty list regardless

Confidence Score: 5/5

  • This PR is safe to merge. It addresses a documented Slurm behavior issue with a minimal, focused change.
  • The change is well-scoped, fixing a specific Slurm command generation issue. All test files have been updated consistently. The new parameter defaults to True for backward compatibility, and workloads that need custom node handling properly opt out. No logical errors found.
  • No files require special attention.

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/systems/slurm/slurm_command_gen_strategy.py 5/5 Core change: Added with_num_nodes parameter to gen_srun_prefix() to conditionally include -N{num_nodes} flag in srun commands. This ensures proper node allocation when using --ntasks-per-node.
src/cloudai/workloads/common/nixl.py 5/5 NIXL commands that explicitly set their own node counts now use with_num_nodes=False to avoid conflicts with manual -N1 specifications.
src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py 5/5 Matrix generation command now passes with_num_nodes=False since it explicitly sets -N1 for single-node execution.
src/cloudai/workloads/slurm_container/slurm_command_gen_strategy.py 5/5 Updated gen_srun_prefix() to properly forward both parameters to the parent class implementation.
src/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py 5/5 Server and client srun builders now pass with_num_nodes=False since they set --nodes= explicitly for their respective node counts.
tests/test_single_sbatch_runner.py 5/5 Updated multiple test expectations to include -N{num_nodes} in srun commands for metadata, mapping, and test blocks.

Sequence Diagram

sequenceDiagram
    participant User
    participant SlurmCommandGenStrategy
    participant gen_srun_prefix
    participant Slurm

    User->>SlurmCommandGenStrategy: gen_exec_command()
    SlurmCommandGenStrategy->>gen_srun_prefix: gen_srun_prefix(with_num_nodes=True)
    gen_srun_prefix->>gen_srun_prefix: get_cached_nodes_spec()
    gen_srun_prefix-->>SlurmCommandGenStrategy: ["srun", "--export=ALL", "--mpi=pmix", "-N{num_nodes}", ...]
    SlurmCommandGenStrategy->>Slurm: Submit sbatch with srun -N{num_nodes} --ntasks-per-node=X
    Slurm-->>User: Correct task distribution across nodes

    Note over User, Slurm: For workloads with custom node specs:
    User->>SlurmCommandGenStrategy: _build_server_srun() (TritonInference)
    SlurmCommandGenStrategy->>gen_srun_prefix: gen_srun_prefix(with_num_nodes=False)
    gen_srun_prefix-->>SlurmCommandGenStrategy: ["srun", "--export=ALL", "--mpi=pmix", ...]
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Append --nodes={custom_count}
    SlurmCommandGenStrategy->>Slurm: Submit with explicit node count
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a48d097 and ceca439.

📒 Files selected for processing (23)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1 hunks)
  • tests/ref_data/ai-dynamo.sbatch (2 hunks)
  • tests/ref_data/ddlb.sbatch (1 hunks)
  • tests/ref_data/deepep-benchmark.sbatch (1 hunks)
  • tests/ref_data/gpt-no-hook.sbatch (1 hunks)
  • tests/ref_data/gpt-pre-test.sbatch (1 hunks)
  • tests/ref_data/grok-no-hook.sbatch (1 hunks)
  • tests/ref_data/grok-pre-test.sbatch (1 hunks)
  • tests/ref_data/megatron-run.sbatch (1 hunks)
  • tests/ref_data/nccl.sbatch (1 hunks)
  • tests/ref_data/nemo-run-no-hook.sbatch (1 hunks)
  • tests/ref_data/nemo-run-pre-test.sbatch (1 hunks)
  • tests/ref_data/nemo-run-vboost.sbatch (1 hunks)
  • tests/ref_data/nixl-kvbench.sbatch (1 hunks)
  • tests/ref_data/nixl-perftest.sbatch (1 hunks)
  • tests/ref_data/nixl_bench.sbatch (1 hunks)
  • tests/ref_data/sleep.sbatch (1 hunks)
  • tests/ref_data/slurm_container.sbatch (1 hunks)
  • tests/ref_data/triton-inference.sbatch (1 hunks)
  • tests/ref_data/ucc.sbatch (1 hunks)
  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py (1 hunks)
  • tests/slurm_command_gen_strategy/test_slurm_container_slurm_command_gen_strategy.py (3 hunks)
  • tests/test_single_sbatch_runner.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py (1)
  • get_cached_nodes_spec (182-232)
🔇 Additional comments (32)
tests/ref_data/nccl.sbatch (1)

13-17: -N1 additions match the new srun prefix behavior

All three sruns now explicitly specify -N1 after --mpi=pmix, which is consistent with the updated gen_srun_prefix and the goal of always setting a node count when we care about per-node task behavior. No issues here.

tests/ref_data/grok-pre-test.sbatch (1)

15-19: Pre-test sruns correctly updated to include -N1

The mapping, metadata, and nccl pre-test sruns now explicitly specify -N1 alongside --mpi=pmix, aligning with the new command-generation strategy. Ordering with the --output/--error options looks fine and there are no redundant node flags.

tests/ref_data/ddlb.sbatch (1)

13-17: Explicit -N1 on all ddlb sruns is consistent with the strategy

Each of the ddlb-related sruns now includes -N1 directly after --mpi=pmix, matching the updated srun prefix semantics and ensuring the node count is explicit. Looks good.

tests/ref_data/slurm_container.sbatch (1)

13-17: Container sbatch sruns correctly updated with -N1

All container-related sruns now carry an explicit -N1 after --mpi=pmix, in line with the new Slurm command-generation strategy and the referenced bugfix. No further issues.

tests/ref_data/ucc.sbatch (1)

13-17: UCC sbatch sruns now explicitly constrain to a single node

The added -N1 for mapping, metadata, and ucc_perftest sruns aligns with the new srun prefix and makes the single-node allocation explicit. This looks correct.

tests/ref_data/gpt-no-hook.sbatch (1)

15-17: GPT no-hook mapping/metadata sruns properly updated with -N1

Both mapping and metadata sruns now explicitly set -N1 alongside --mpi=pmix, which aligns with the new command-generation behavior and the goal of always specifying a node count for these steps. No concerns.

src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)

240-259: _enable_vboost_cmd is intentionally different; duplicate -N risk from extra_srun_args is valid but unlikely in practice

The concerns in the original review are partially accurate:

  1. _enable_vboost_cmd is NOT inconsistent: It intentionally uses --ntasks={num_nodes} (line 311-321) rather than -N, which is semantically correct for vboost—it runs one privileged GPU operation per node as a single task per node, not as a distributed parallel job. This is different from the main test's -N{num_nodes} which sets node distribution for MPI/parallel workloads.

  2. Duplicate -N from extra_srun_args is a valid but low-risk concern: The base pre_test_srun_extra_args() returns an empty list (line 155) with no overrides found, so that path is safe. However, system.extra_srun_args (a free-form string appended at line 257) could theoretically contain -N or --nodes, creating a duplicate. SLURM would use the last value, but it's confusing. This is unlikely in typical configs but worth documenting.

No actionable changes needed—the implementation is sound. If you want to harden against misconfiguration, you could validate and strip explicit -N from extra_srun_args before appending, but that's optional refactoring.

tests/ref_data/megatron-run.sbatch (1)

13-17: LGTM: Consistent node count specification.

All three srun invocations correctly specify -N1, matching the sbatch allocation and ensuring that --ntasks-per-node is not ignored.

tests/ref_data/triton-inference.sbatch (1)

18-20: LGTM: Mapping and metadata commands correctly specify -N3.

These srun invocations correctly use -N3 to match the full allocation.

tests/ref_data/nemo-run-no-hook.sbatch (1)

14-18: LGTM: Consistent node count specification.

All three srun invocations correctly specify -N1, matching the sbatch allocation.

tests/ref_data/nemo-run-vboost.sbatch (1)

17-21: LGTM: Container commands correctly specify -N1.

The three container-based srun invocations correctly specify -N1.

tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py (1)

274-284: LGTM: Test expectation correctly updated.

The test now expects the -N flag to be included in the srun command prefix, aligning with the PR's objective.

tests/ref_data/ai-dynamo.sbatch (2)

13-15: LGTM: Mapping and metadata commands correctly specify -N2.

These srun invocations correctly use -N2 to match the allocation.


21-48: Approved: Main command correctly specifies -N2.

The srun command specifies both -N2 and --nodes=2, which is redundant but not incorrect since they specify the same value. While redundant, this ensures the node count is explicit and doesn't create conflicts.

tests/ref_data/deepep-benchmark.sbatch (1)

23-27: LGTM: Consistent node count specification.

All three srun invocations correctly specify -N2, matching the sbatch allocation.

tests/ref_data/sleep.sbatch (1)

13-17: LGTM: Consistent node count specification.

All three srun invocations correctly specify -N1, matching the sbatch allocation.

tests/ref_data/gpt-pre-test.sbatch (1)

15-19: LGTM!

The explicit -N1 flag is correctly added to all three auxiliary srun invocations (mapping, metadata, and nccl pre-test), consistent with the #SBATCH -N 1 allocation header on line 7.

tests/slurm_command_gen_strategy/test_slurm_container_slurm_command_gen_strategy.py (3)

46-53: LGTM!

The test expectation correctly includes -N{test_run.num_nodes} in the srun command, validating the new explicit node-count behavior.


64-71: LGTM!

Consistent with other test cases, the nsys test expectation now includes the explicit node count flag.


84-92: LGTM!

The extra_srun_args test correctly positions -N{test_run.num_nodes} before the extra arguments, maintaining proper argument ordering.

tests/ref_data/grok-no-hook.sbatch (1)

15-17: LGTM!

The explicit -N1 flag is correctly added to the mapping and metadata srun invocations, matching the #SBATCH -N 1 allocation.

tests/test_single_sbatch_runner.py (4)

241-255: LGTM!

The bare metal test correctly expects -N{sleep_tr.num_nodes} in both metadata and ranks mapping srun commands.


271-286: LGTM!

The container test correctly expects -N{nccl_tr.num_nodes} in the metadata and ranks mapping srun commands for containerized workloads.


321-327: LGTM!

The single test run block correctly includes -N{sleep_tr.num_nodes} in the expected srun command structure.


516-529: LGTM!

The pre-test command expectation correctly includes -N{sleep_tr.num_nodes} in the srun command.

tests/ref_data/nixl-kvbench.sbatch (1)

15-17: LGTM!

The explicit -N2 flag is correctly added to the mapping and metadata srun invocations, matching the #SBATCH -N 2 allocation.

tests/ref_data/nemo-run-pre-test.sbatch (4)

14-14: LGTM! Explicit node count added for mapping command.

The addition of -N1 ensures srun respects the node allocation for the mapping command, preventing potential issues where --ntasks-per-node might be ignored.


16-16: LGTM! Explicit node count added for metadata collection.

The addition of -N1 ensures the metadata collection runs on exactly one node, addressing the issue where --ntasks-per-node=1 could be ignored without explicit -N.


18-18: LGTM! Explicit node count added for NCCL pre-test.

The addition of -N1 ensures the NCCL pre-test runs on a single node as intended, preventing potential distribution issues.


22-22: LGTM! Explicit node count added for main execution.

The addition of -N1 ensures the main nemorun execution respects the single-node allocation, consistent with the trainer.num_nodes=1 parameter in the command.

tests/ref_data/nixl_bench.sbatch (2)

15-15: LGTM! Explicit node count added for mapping command.

The addition of -N2 correctly ensures the mapping command runs across both allocated nodes.


17-17: LGTM! Explicit node count added for metadata collection.

The addition of -N2 ensures metadata is collected from both nodes, preventing --ntasks-per-node=1 from being ignored.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_single_sbatch_runner.py (1)

241-255: Explicit -N in bare‑metal aux srun commands looks correct

Including -N{sleep_tr.num_nodes} in both metadata and ranks‑mapping srun commands matches the PR’s intent and should prevent --ntasks-per-node from being ignored on multi‑node allocations. Consider also asserting on -N in the test_max_nodes_used_for_metadata* tests so regressions in the max‑nodes path are caught, not just changes to --ntasks.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceca439 and 85402bb.

📒 Files selected for processing (1)
  • tests/test_single_sbatch_runner.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_single_sbatch_runner.py (1)
tests/report_generation_strategy/conftest.py (1)
  • nccl_tr (26-61)
🔇 Additional comments (2)
tests/test_single_sbatch_runner.py (2)

272-283: Container aux srun commands now correctly carry node count

The added -N{nccl_tr.num_nodes} in both container metadata and ranks‑mapping commands is consistent with the bare‑metal path and aligns with the Slurm bugfix around --ntasks-per-node handling.


516-522: Pre‑test srun now pins node count explicitly

Adding -N{sleep_tr.num_nodes} to the pre‑test command ensures pre‑tests are also run with an explicit node count, consistent with the main aux/test commands and the Slurm behavior you’re targeting.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@amaslenn amaslenn merged commit 84b7161 into main Nov 28, 2025
5 checks passed
@amaslenn amaslenn deleted the am/nnodes branch November 28, 2025 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants