Skip to content

Prohibit limit_resources: false in Slurm mode and validate execution_config fields#228

Merged
daniel-thom merged 6 commits intomainfrom
fix/limit-resources-disabled
Mar 19, 2026
Merged

Prohibit limit_resources: false in Slurm mode and validate execution_config fields#228
daniel-thom merged 6 commits intomainfrom
fix/limit-resources-disabled

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

@daniel-thom daniel-thom commented Mar 18, 2026

Summary

  • Prohibit limit_resources: false in Slurm mode: srun --exact requires resource args; omitting them defaults --cpus-per-task to 1. We now reject limit_resources: false with Slurm mode at workflow creation time.
  • Validate execution_config fields against mode: Direct-only fields error in Slurm mode; Slurm-only fields error in direct mode. Auto mode infers from slurm_schedulers presence.
  • Increase max record transfer size from 10k to 100k: Centralizes as MAX_RECORD_TRANSFER_COUNT.
  • Restructure ExecutionConfig docs: Groups fields into shared/direct/slurm sections.

When limit_resources=false, srun was still passed --exact, which causes
it to default --cpus-per-task to 1. This silently restricted
multi-threaded jobs to a single CPU core via cgroups, causing significant
slowdowns (45%+ observed) compared to direct execution mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@daniel-thom daniel-thom requested a review from Copilot March 18, 2026 20:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Slurm srun invocation when limit_resources=false by omitting --exact, preventing multi-threaded steps from being silently constrained to 1 CPU core and restoring expected performance.

Changes:

  • Make build_srun_command() add --exact only when limit_resources is enabled.
  • Update srun argument tests to assert --exact is absent when limit_resources=false.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/client/async_cli_command.rs Conditionally adds --exact based on limit_resources to avoid unintended CPU restriction.
tests/test_srun_args.rs Adds/updates assertions ensuring --exact is not passed when limit_resources=false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

daniel-thom and others added 4 commits March 19, 2026 07:48
Centralizes the limit as MAX_RECORD_TRANSFER_COUNT in lib.rs and updates
all server pagination, client batch creation, and API defaults to use it.
Also increases the default max request body size to 200 MiB to accommodate
larger payloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Slurm mode always requires resource limits for correct srun behavior.
Setting limit_resources: false with mode: slurm now returns a validation
error directing users to use direct mode instead. Removes the
limit_resources=false code paths from srun argument building and updates
documentation and tests accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject mode-incompatible fields at workflow creation time: direct-only
fields (termination_signal, sigterm_lead_seconds, oom_exit_code) error
in slurm mode, and slurm-only fields (srun_termination_signal,
enable_cpu_bind) error in direct mode. Auto mode infers from
slurm_schedulers presence. Restructures docs and struct comments to
group fields by shared/direct/slurm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@daniel-thom daniel-thom force-pushed the fix/limit-resources-disabled branch from ad1a863 to 05f5c2e Compare March 19, 2026 14:26
@daniel-thom daniel-thom requested a review from Copilot March 19, 2026 15:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Slurm execution configuration to prevent unintended CPU throttling when using srun --exact, and aligns server/client limits and documentation around larger batch/pagination sizes.

Changes:

  • Enforce that execution_config.limit_resources: false is only valid in direct mode, rejecting it for Slurm (and auto+Slurm-scheduler) workflows at spec-creation time.
  • Update srun argument construction to always include CPU/memory resource flags for non-default resource requirements (and remove Slurm-mode dependence on limit_resources).
  • Centralize and raise transfer/pagination limits via MAX_RECORD_TRANSFER_COUNT (10,000 → 100,000), updating OpenAPI, docs, and clients; also raise default request-body limit (20 MiB → 200 MiB).

Reviewed changes

Copilot reviewed 27 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_srun_args.rs Removes tests that exercised limit_resources=false with srun (now invalid for Slurm).
tests/test_execution_config.rs Adds tests to ensure invalid mode/field combinations are rejected during spec creation.
src/server/routing.rs Increases default max request body size and uses shared MAX_RECORD_TRANSFER_COUNT for default limit.
src/server/http_server.rs Switches pagination defaults/max checks to MAX_RECORD_TRANSFER_COUNT and updates error text.
src/server/api_types.rs Updates bulk-create docs to remove hardcoded “10,000” wording.
src/server/api/workflows.rs Uses MAX_RECORD_TRANSFER_COUNT for pagination defaults/max reporting in relationship list endpoints.
src/server/api/jobs.rs Enforces bulk job creation maximum using MAX_RECORD_TRANSFER_COUNT.
src/server/api.rs Re-exports the crate-level MAX_RECORD_TRANSFER_COUNT for server API modules.
src/lib.rs Introduces crate-level MAX_RECORD_TRANSFER_COUNT = 100_000.
src/client/workflow_spec.rs Adds validation rejecting incompatible execution_config fields based on effective mode (direct vs Slurm).
src/client/async_cli_command.rs Removes limit_resources from srun params and always applies CPU/mem flags for non-default RR when using srun.
src/client/commands/slurm.rs Updates client-side pagination limit to use MAX_RECORD_TRANSFER_COUNT.
src/client/commands/pagination/base.rs Uses MAX_RECORD_TRANSFER_COUNT as the default pagination page size.
src/client/commands/jobs.rs Uses MAX_RECORD_TRANSFER_COUNT for bulk create batch sizing and list paging.
src/client/apis/default_api.rs Updates generated docs for bulk-create endpoint text.
src/bin/torc-server.rs Adds SQLite pragmas (synchronous=NORMAL, larger cache) to reduce latency/lock contention.
src/bin/torc-dash.rs Uses MAX_RECORD_TRANSFER_COUNT when querying job lists for dashboard stats.
slurm-tests/workflows/no_srun_basic.yaml Updates Slurm test workflow to use execution_config.mode: direct instead of use_srun: false.
slurm-tests/tests/test_no_srun_basic.sh Updates comments to reflect direct mode terminology.
python_client/src/torc/api.py Raises default client batch size to 100,000.
julia_client/julia_client/docs/DefaultApi.md Updates documented default limit to 100,000.
julia_client/Torc/src/Torc.jl Raises default Julia client batch size to 100,000.
docs/src/specialized/hpc/slurm.md Documents that Slurm mode always enforces CPU/mem via srun and that limit_resources=false requires direct mode.
docs/src/specialized/design/srun-monitoring.md Updates design doc to reflect resource flags always applied in Slurm mode and re-scopes limit_resources to direct mode.
docs/src/core/workflows/workflow-formats.md Updates workflow format guidance/migration mapping for direct vs Slurm modes and limit_resources=false.
docs/src/core/reference/workflow-spec.md Reorganizes execution_config reference by mode and documents validation behavior for incompatible fields.
docs/src/core/reference/cli.md Updates CLI docs for new default limit (100,000).
docs/src/core/concepts/execution-modes.md Updates execution-mode docs to reflect always-enforced Slurm resource flags and direct-only limit_resources=false.
api/openapi.yaml Updates OpenAPI default limit values to 100,000.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Move MAX_RECORD_TRANSFER_COUNT import to top-of-file imports
- Fix process_pagination_params doc: offset defaults to 0, not max
- Clarify srun resource comment re: "default" RR placeholder skip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@daniel-thom daniel-thom changed the title Omit --exact from srun when limit_resources is false Prohibit limit_resources: false in Slurm mode and validate execution_config fields Mar 19, 2026
@daniel-thom daniel-thom merged commit 44bf397 into main Mar 19, 2026
9 checks passed
@daniel-thom daniel-thom deleted the fix/limit-resources-disabled branch March 19, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants