Improve short-job throughput for runner/server hot path#286
Improve short-job throughput for runner/server hot path#286daniel-thom merged 9 commits intoNatLabRockies:mainfrom
Conversation
Three small, independent changes targeting throughput when many runners churn through short jobs: 1. Skip the post-iteration sleep in the runner main loop when an iteration reported one or more completions. Completions free capacity and the deferred unblock task may have made more jobs ready, so reacting immediately closes the idle gap rather than waiting a full poll interval. Started jobs do not trigger skip. 2. Lock-free pre-check in transport_prepare_ready_jobs: a non-locking SELECT 1 against the same WHERE clause runs before BEGIN IMMEDIATE so empty claims do not contend for the SQLite write lock. The workflow existence check is also moved out of the transaction since it is a pure read. 3. New batch_complete_jobs endpoint that accepts a list of completions in one request and returns per-completion success/error vectors. The runner's handle_job_completion is split into prepare_job_completion (local recovery + status decision) and report_completions_batch (one batched send + finalization), with all four call sites routed through a single handle_completions_batch helper. Halves request count under load. API version bumped to 0.14.0. Also adds a 3,001-job GPU pipeline workflow at tests/workflows/gpu_pipeline_perf_test/ for manual A/B testing against main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR targets higher short-job throughput by reducing runner idle gaps and cutting completion-reporting round trips via a new server-side batched completion API, while also adding a CPU-only performance workflow to exercise the runner/server hot path.
Changes:
- Add
batch_complete_jobsworkflow endpoint with new request/response models and propagate it through OpenAPI + generated Rust/Python/Julia clients. - Add a condvar-based wakeup primitive and SIGCHLD-driven wakeups so runners react immediately to local subprocess exits instead of waiting for the poll interval.
- Add a 3-stage, CPU-only pipeline perf workflow (3,001 jobs) for stress/perf validation.
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/workflows/pipeline_perf_test/workflow.yaml | Adds a CPU-only, 3-stage pipeline workload to stress short-job runner/server behavior. |
| src/server/response_types.rs | Re-exports new BatchCompleteJobsResponse via response facade. |
| src/server/live_state.rs | Minor refactor: uses HashSet import for failure-tracking state. |
| src/server/live_router.rs | Adds HTTP route + utoipa path for POST /workflows/{id}/batch_complete_jobs. |
| src/server/http_transport/response_mapping.rs | Adds HTTP response mapping function for BatchCompleteJobsResponse. |
| src/server/http_server/workflows_transport.rs | Adjusts workflow transport (notably claim path) behavior. |
| src/server/http_server/jobs_transport.rs | Implements server-side batched completion handling + claim pre-check optimization. |
| src/server/http_server.rs | Exposes batch_complete_jobs on the transport trait implementation. |
| src/server/api_responses.rs | Introduces BatchCompleteJobsResponse response enum. |
| src/server/api_contract.rs | Adds batch_complete_jobs to the transport API contract. |
| src/run_jobs_cmd.rs | Registers SIGCHLD wakeup for local runner; adjusts cookie-header handling. |
| src/openapi_spec.rs | Adds schemas + path wiring for batch completion in generated OpenAPI doc. |
| src/models.rs | Adds BatchCompleteJobsRequest/Response + completion entry/error models. |
| src/client/job_runner.rs | Uses batched completion reporting; adds wakeup primitive + idle-wait logic; adds wakeup unit tests. |
| src/client/apis/workflows_api.rs | Generated Rust client: adds batch_complete_jobs API call + typed error enum. |
| src/bin/torc-slurm-job-runner.rs | Extends signal handling to SIGCHLD and wakes runner on SIGTERM/SIGCHLD. |
| src/api_version.rs | Bumps HTTP API contract version to 0.14.0. |
| python_client/src/torc/openapi_client/models/job_completion_error.py | Generated Python model for per-job completion error entries. |
| python_client/src/torc/openapi_client/models/job_completion_entry.py | Generated Python model for a single completion entry. |
| python_client/src/torc/openapi_client/models/batch_complete_jobs_request.py | Generated Python request model for batched completion reporting. |
| python_client/src/torc/openapi_client/models/batch_complete_jobs_response.py | Generated Python response model for batched completion outcomes. |
| python_client/src/torc/openapi_client/models/init.py | Exports the newly generated batch-completion models. |
| python_client/src/torc/openapi_client/api/workflows_api.py | Generated Python API: adds batch_complete_jobs endpoint wrapper. |
| python_client/src/torc/openapi_client/init.py | Updates Python package exports for new models. |
| julia_client/julia_client/docs/WorkflowsApi.md | Documents new batch_complete_jobs workflow endpoint. |
| julia_client/julia_client/docs/JobCompletionError.md | Documents new completion error model. |
| julia_client/julia_client/docs/JobCompletionEntry.md | Documents new completion entry model. |
| julia_client/julia_client/docs/BatchCompleteJobsResponse.md | Documents new batch completion response model. |
| julia_client/julia_client/docs/BatchCompleteJobsRequest.md | Documents new batch completion request model. |
| julia_client/julia_client/README.md | Adds new endpoint + model docs to the Julia client README. |
| julia_client/Torc/src/api/models/model_JobCompletionError.jl | Generated Julia model for completion errors. |
| julia_client/Torc/src/api/models/model_JobCompletionEntry.jl | Generated Julia model for completion entries. |
| julia_client/Torc/src/api/models/model_BatchCompleteJobsResponse.jl | Generated Julia model for batched completion response. |
| julia_client/Torc/src/api/models/model_BatchCompleteJobsRequest.jl | Generated Julia model for batched completion request. |
| julia_client/Torc/src/api/modelincludes.jl | Includes newly generated Julia model files. |
| julia_client/Torc/src/api/apis/api_WorkflowsApi.jl | Generated Julia API: adds batch_complete_jobs. |
| api/openapi.yaml | Adds OpenAPI path + schema definitions for batch_complete_jobs. |
| api/openapi.codegen.yaml | Mirrors OpenAPI additions for codegen input. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 149 out of 149 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 152 out of 152 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/server/http_server/jobs_transport.rs:1095
apply_job_completion_state(used by the single-jobcomplete_jobendpoint) no longer validatesrun_idagainst the workflow’s currentworkflow_status.run_id. The TX-backed path does this (apply_job_completion_state_tx), and other endpoints likestart_job/manage_status_changealso callvalidate_run_id, so this looks like a regression that could let stale runners complete jobs for an old run. Suggest adding the samevalidate_run_id(job.workflow_id, run_id)check here and returning an Unprocessable response on mismatch (consistent with the TX path).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
batch_complete_jobsand expose the new API/models in generated clientsSIGCHLDso local job completions refill slots without waiting for the sleep intervaltests/workflows/pipeline_perf_test/workflow.yamlto exercise the short-job pipeline on systems where GPU debug partitions are not practicalLong-poll note
We also explored server-backed long-poll claims (
wait_secondson claim endpoints) to wake idle workers when downstream jobs became ready. After working through the runner behavior, we rejected that approach for this PR and removed it entirely.Reasons we backed it out:
The final PR keeps the lower-risk improvements (
SIGCHLDwakeup and real batched completion handling) and leaves long-poll out.Validation
cargo fmt -- --checkcargo clippy --all --all-targets --all-features -- -D warningsdprint checkshellcheck(via commit hook)cargo check --workspace