Skip to content

Add srun mpi mode option#268

Merged
nkeilbart merged 4 commits intomainfrom
fix/slurm-mpi-mode
Apr 17, 2026
Merged

Add srun mpi mode option#268
nkeilbart merged 4 commits intomainfrom
fix/slurm-mpi-mode

Conversation

@nkeilbart
Copy link
Copy Markdown
Collaborator

Summary

  • add optional execution_config.srun_mpi for Slurm job runners
  • pass the configured value through as srun --mpi= and omit the flag when unset
  • add validation and test coverage for parsing, direct-mode rejection, and srun argument construction

Testing

  • cargo test --test test_execution_config test_srun_mpi -- --nocapture
  • cargo test --test test_srun_args test_srun_mpi_mode -- --nocapture
  • cargo test --test test_async_cli_command -- --nocapture
  • cargo fmt -- --check
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • dprint check
  • pre-commit run --files src/client/async_cli_command.rs src/client/job_runner.rs src/client/workflow_spec.rs tests/test_async_cli_command.rs tests/test_execution_config.rs tests/test_srun_args.rs

Allow Slurm execution configs to set an optional srun MPI mode.
When configured, job runners now pass the value through as
srun --mpi=<value>; when omitted, no MPI flag is added.

Add validation and serialization coverage for the new field and
extend tests for unset and configured srun argument handling.
Apply srun_mpi only to the outer srun that launches one
job runner per allocated node in start_one_worker_per_node
mode.

Remove the inner per-job srun wiring, update validation to
allow the option for direct-mode worker-per-node workflows,
and add coverage for the submission script plus the adjusted
execution-config behavior.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional execution_config.srun_mpi setting to control the MPI launcher mode used by the outer srun when launching one job-runner per allocated node, and threads that value through spec parsing/validation into Slurm submission script generation.

Changes:

  • Extend ExecutionConfig with srun_mpi and include it in YAML/JSON/KDL parsing + rendering.
  • Pass srun_mpi into Slurm submission script generation and emit srun --mpi=<value> only when configured.
  • Add/adjust integration tests for config parsing/validation and script content.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/client/workflow_spec.rs Adds ExecutionConfig.srun_mpi, wires serialization, and introduces validation rules for when it’s allowed.
src/client/commands/slurm.rs Passes execution_config.srun_mpi through to Slurm script construction.
src/client/hpc/hpc_interface.rs Extends the HpcInterface::create_submission_script API with an optional srun_mpi parameter.
src/client/hpc/slurm_interface.rs Emits --mpi=<mode> on the outer srun when launching one worker per node.
src/client/hpc/hpc_manager.rs Updates call site for the new create_submission_script signature (currently passes None).
tests/test_execution_config.rs Adds parsing/roundtrip assertions and spec validation tests for srun_mpi.
tests/test_slurm_commands.rs Updates existing tests for the new signature and adds a script-content test covering --mpi.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_execution_config.rs
Comment thread src/client/workflow_spec.rs Outdated
Comment thread src/client/workflow_spec.rs Outdated
Comment thread src/client/hpc/slurm_interface.rs
@daniel-thom
Copy link
Copy Markdown
Collaborator

Can you add something in the docs about this?

Document the execution_config.srun_mpi field in the
workflow-spec reference and add a worker-per-node direct
mode example showing how it affects the outer srun job
runner launch.
@nkeilbart
Copy link
Copy Markdown
Collaborator Author

Added docs for this in two places:

  • docs/src/core/reference/workflow-spec.md documents the srun_mpi field and notes that it applies to the outer srun used with start_one_worker_per_node: true.
  • docs/src/specialized/hpc/multi-node-jobs.md now shows the direct-mode worker-per-node case with srun_mpi: "none" and the resulting outer job-runner launch.

That should cover both the field reference and the practical usage path for this PR.

Validate srun_mpi as a single safe token whenever it is
provided, reject it unless a worker-per-node schedule_nodes
action is present, and enforce the same check when writing
Slurm submission scripts.

Add regression tests for direct-mode whitespace rejection,
slurm-mode no-op rejection, and invalid script-generation
input.
@nkeilbart
Copy link
Copy Markdown
Collaborator Author

Addressed the remaining inline review comments in 0b6ac71:

  • srun_mpi is now validated whenever it is provided, not only in slurm-mode validation.
  • The value must be a single safe token matching [A-Za-z0-9+_.-]+, so whitespace-padded or shell-special values are rejected.
  • srun_mpi is now rejected unless the workflow actually has a schedule_nodes action with start_one_worker_per_node: true, regardless of mode, so it cannot become a silent no-op.
  • The same validation is enforced again when generating the Slurm submission script.
  • Added regression tests for direct-mode whitespace rejection, slurm-mode no-op rejection, and invalid script-generation input.

Targeted checks run for this follow-up:

  • cargo test --test test_execution_config test_srun_mpi -- --nocapture
  • cargo test --test test_slurm_commands test_create_submission_script_with_srun_mpi -- --nocapture
  • cargo test --test test_slurm_commands test_create_submission_script_with_invalid_srun_mpi_rejected -- --nocapture
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • pre-commit run --files src/client/workflow_spec.rs src/client/hpc/slurm_interface.rs tests/test_execution_config.rs tests/test_slurm_commands.rs

@nkeilbart
Copy link
Copy Markdown
Collaborator Author

Follow-up on the four Copilot inline comments:

  1. Direct-mode whitespace coverage: added. tests/test_execution_config.rs now rejects a worker-per-node direct-mode value like " pmix ".
  2. Validation independent of mode: fixed. srun_mpi now goes through a shared validator whenever it is provided, not only in slurm-mode validation.
  3. No-op applicability: fixed. srun_mpi is now rejected unless the workflow actually has schedule_nodes with start_one_worker_per_node: true, regardless of mode. I also added a slurm-mode rejection test for that no-op case.
  4. Script injection / unsafe token handling: fixed. src/client/hpc/slurm_interface.rs now reuses the same validator before writing --mpi=..., and there is a regression test rejecting an unsafe value like pmix;rm -rf /.

These changes are in commit 0b6ac71.

@nkeilbart nkeilbart merged commit 639f834 into main Apr 17, 2026
9 checks passed
@nkeilbart nkeilbart deleted the fix/slurm-mpi-mode branch April 17, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants