Prohibit limit_resources: false in Slurm mode and validate execution_config fields#228
Conversation
When limit_resources=false, srun was still passed --exact, which causes it to default --cpus-per-task to 1. This silently restricted multi-threaded jobs to a single CPU core via cgroups, causing significant slowdowns (45%+ observed) compared to direct execution mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes Slurm srun invocation when limit_resources=false by omitting --exact, preventing multi-threaded steps from being silently constrained to 1 CPU core and restoring expected performance.
Changes:
- Make
build_srun_command()add--exactonly whenlimit_resourcesis enabled. - Update srun argument tests to assert
--exactis absent whenlimit_resources=false.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/client/async_cli_command.rs |
Conditionally adds --exact based on limit_resources to avoid unintended CPU restriction. |
tests/test_srun_args.rs |
Adds/updates assertions ensuring --exact is not passed when limit_resources=false. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Centralizes the limit as MAX_RECORD_TRANSFER_COUNT in lib.rs and updates all server pagination, client batch creation, and API defaults to use it. Also increases the default max request body size to 200 MiB to accommodate larger payloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Slurm mode always requires resource limits for correct srun behavior. Setting limit_resources: false with mode: slurm now returns a validation error directing users to use direct mode instead. Removes the limit_resources=false code paths from srun argument building and updates documentation and tests accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject mode-incompatible fields at workflow creation time: direct-only fields (termination_signal, sigterm_lead_seconds, oom_exit_code) error in slurm mode, and slurm-only fields (srun_termination_signal, enable_cpu_bind) error in direct mode. Auto mode infers from slurm_schedulers presence. Restructures docs and struct comments to group fields by shared/direct/slurm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ad1a863 to
05f5c2e
Compare
There was a problem hiding this comment.
Pull request overview
This PR adjusts Slurm execution configuration to prevent unintended CPU throttling when using srun --exact, and aligns server/client limits and documentation around larger batch/pagination sizes.
Changes:
- Enforce that
execution_config.limit_resources: falseis only valid in direct mode, rejecting it for Slurm (and auto+Slurm-scheduler) workflows at spec-creation time. - Update
srunargument construction to always include CPU/memory resource flags for non-defaultresource requirements (and remove Slurm-mode dependence onlimit_resources). - Centralize and raise transfer/pagination limits via
MAX_RECORD_TRANSFER_COUNT(10,000 → 100,000), updating OpenAPI, docs, and clients; also raise default request-body limit (20 MiB → 200 MiB).
Reviewed changes
Copilot reviewed 27 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_srun_args.rs | Removes tests that exercised limit_resources=false with srun (now invalid for Slurm). |
| tests/test_execution_config.rs | Adds tests to ensure invalid mode/field combinations are rejected during spec creation. |
| src/server/routing.rs | Increases default max request body size and uses shared MAX_RECORD_TRANSFER_COUNT for default limit. |
| src/server/http_server.rs | Switches pagination defaults/max checks to MAX_RECORD_TRANSFER_COUNT and updates error text. |
| src/server/api_types.rs | Updates bulk-create docs to remove hardcoded “10,000” wording. |
| src/server/api/workflows.rs | Uses MAX_RECORD_TRANSFER_COUNT for pagination defaults/max reporting in relationship list endpoints. |
| src/server/api/jobs.rs | Enforces bulk job creation maximum using MAX_RECORD_TRANSFER_COUNT. |
| src/server/api.rs | Re-exports the crate-level MAX_RECORD_TRANSFER_COUNT for server API modules. |
| src/lib.rs | Introduces crate-level MAX_RECORD_TRANSFER_COUNT = 100_000. |
| src/client/workflow_spec.rs | Adds validation rejecting incompatible execution_config fields based on effective mode (direct vs Slurm). |
| src/client/async_cli_command.rs | Removes limit_resources from srun params and always applies CPU/mem flags for non-default RR when using srun. |
| src/client/commands/slurm.rs | Updates client-side pagination limit to use MAX_RECORD_TRANSFER_COUNT. |
| src/client/commands/pagination/base.rs | Uses MAX_RECORD_TRANSFER_COUNT as the default pagination page size. |
| src/client/commands/jobs.rs | Uses MAX_RECORD_TRANSFER_COUNT for bulk create batch sizing and list paging. |
| src/client/apis/default_api.rs | Updates generated docs for bulk-create endpoint text. |
| src/bin/torc-server.rs | Adds SQLite pragmas (synchronous=NORMAL, larger cache) to reduce latency/lock contention. |
| src/bin/torc-dash.rs | Uses MAX_RECORD_TRANSFER_COUNT when querying job lists for dashboard stats. |
| slurm-tests/workflows/no_srun_basic.yaml | Updates Slurm test workflow to use execution_config.mode: direct instead of use_srun: false. |
| slurm-tests/tests/test_no_srun_basic.sh | Updates comments to reflect direct mode terminology. |
| python_client/src/torc/api.py | Raises default client batch size to 100,000. |
| julia_client/julia_client/docs/DefaultApi.md | Updates documented default limit to 100,000. |
| julia_client/Torc/src/Torc.jl | Raises default Julia client batch size to 100,000. |
| docs/src/specialized/hpc/slurm.md | Documents that Slurm mode always enforces CPU/mem via srun and that limit_resources=false requires direct mode. |
| docs/src/specialized/design/srun-monitoring.md | Updates design doc to reflect resource flags always applied in Slurm mode and re-scopes limit_resources to direct mode. |
| docs/src/core/workflows/workflow-formats.md | Updates workflow format guidance/migration mapping for direct vs Slurm modes and limit_resources=false. |
| docs/src/core/reference/workflow-spec.md | Reorganizes execution_config reference by mode and documents validation behavior for incompatible fields. |
| docs/src/core/reference/cli.md | Updates CLI docs for new default limit (100,000). |
| docs/src/core/concepts/execution-modes.md | Updates execution-mode docs to reflect always-enforced Slurm resource flags and direct-only limit_resources=false. |
| api/openapi.yaml | Updates OpenAPI default limit values to 100,000. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Move MAX_RECORD_TRANSFER_COUNT import to top-of-file imports - Fix process_pagination_params doc: offset defaults to 0, not max - Clarify srun resource comment re: "default" RR placeholder skip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary