Improve short-job throughput for runner/server hot path by daniel-thom · Pull Request #286 · NatLabRockies/torc

daniel-thom · 2026-04-26T19:10:04Z

Summary

add real batched server-side handling for batch_complete_jobs and expose the new API/models in generated clients
wake runners promptly on SIGCHLD so local job completions refill slots without waiting for the sleep interval
add a CPU-only three-stage perf workflow at tests/workflows/pipeline_perf_test/workflow.yaml to exercise the short-job pipeline on systems where GPU debug partitions are not practical

Long-poll note

We also explored server-backed long-poll claims (wait_seconds on claim endpoints) to wake idle workers when downstream jobs became ready. After working through the runner behavior, we rejected that approach for this PR and removed it entirely.

Reasons we backed it out:

the runner uses blocking claim RPCs, so a parked long-poll request delays local completion processing until the request returns
that can slow completion propagation and downstream unblocking near the tail of a workflow, which cuts against the goal for short jobs
it adds operational risk around OpenShift / proxy timeouts and possible thundering-herd wakeups without a proven net win in this runner architecture
it adds more behavioral complexity and unpredictability than the lower-risk changes in this PR

The final PR keeps the lower-risk improvements (SIGCHLD wakeup and real batched completion handling) and leaves long-poll out.

Validation

cargo fmt -- --check
cargo clippy --all --all-targets --all-features -- -D warnings
dprint check
shellcheck (via commit hook)
cargo check --workspace

Three small, independent changes targeting throughput when many runners churn through short jobs: 1. Skip the post-iteration sleep in the runner main loop when an iteration reported one or more completions. Completions free capacity and the deferred unblock task may have made more jobs ready, so reacting immediately closes the idle gap rather than waiting a full poll interval. Started jobs do not trigger skip. 2. Lock-free pre-check in transport_prepare_ready_jobs: a non-locking SELECT 1 against the same WHERE clause runs before BEGIN IMMEDIATE so empty claims do not contend for the SQLite write lock. The workflow existence check is also moved out of the transaction since it is a pure read. 3. New batch_complete_jobs endpoint that accepts a list of completions in one request and returns per-completion success/error vectors. The runner's handle_job_completion is split into prepare_job_completion (local recovery + status decision) and report_completions_batch (one batched send + finalization), with all four call sites routed through a single handle_completions_batch helper. Halves request count under load. API version bumped to 0.14.0. Also adds a 3,001-job GPU pipeline workflow at tests/workflows/gpu_pipeline_perf_test/ for manual A/B testing against main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR targets higher short-job throughput by reducing runner idle gaps and cutting completion-reporting round trips via a new server-side batched completion API, while also adding a CPU-only performance workflow to exercise the runner/server hot path.

Changes:

Add batch_complete_jobs workflow endpoint with new request/response models and propagate it through OpenAPI + generated Rust/Python/Julia clients.
Add a condvar-based wakeup primitive and SIGCHLD-driven wakeups so runners react immediately to local subprocess exits instead of waiting for the poll interval.
Add a 3-stage, CPU-only pipeline perf workflow (3,001 jobs) for stress/perf validation.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/workflows/pipeline_perf_test/workflow.yaml	Adds a CPU-only, 3-stage pipeline workload to stress short-job runner/server behavior.
src/server/response_types.rs	Re-exports new `BatchCompleteJobsResponse` via response facade.
src/server/live_state.rs	Minor refactor: uses `HashSet` import for failure-tracking state.
src/server/live_router.rs	Adds HTTP route + utoipa path for `POST /workflows/{id}/batch_complete_jobs`.
src/server/http_transport/response_mapping.rs	Adds HTTP response mapping function for `BatchCompleteJobsResponse`.
src/server/http_server/workflows_transport.rs	Adjusts workflow transport (notably claim path) behavior.
src/server/http_server/jobs_transport.rs	Implements server-side batched completion handling + claim pre-check optimization.
src/server/http_server.rs	Exposes `batch_complete_jobs` on the transport trait implementation.
src/server/api_responses.rs	Introduces `BatchCompleteJobsResponse` response enum.
src/server/api_contract.rs	Adds `batch_complete_jobs` to the transport API contract.
src/run_jobs_cmd.rs	Registers SIGCHLD wakeup for local runner; adjusts cookie-header handling.
src/openapi_spec.rs	Adds schemas + path wiring for batch completion in generated OpenAPI doc.
src/models.rs	Adds `BatchCompleteJobsRequest/Response` + completion entry/error models.
src/client/job_runner.rs	Uses batched completion reporting; adds wakeup primitive + idle-wait logic; adds wakeup unit tests.
src/client/apis/workflows_api.rs	Generated Rust client: adds `batch_complete_jobs` API call + typed error enum.
src/bin/torc-slurm-job-runner.rs	Extends signal handling to SIGCHLD and wakes runner on SIGTERM/SIGCHLD.
src/api_version.rs	Bumps HTTP API contract version to `0.14.0`.
python_client/src/torc/openapi_client/models/job_completion_error.py	Generated Python model for per-job completion error entries.
python_client/src/torc/openapi_client/models/job_completion_entry.py	Generated Python model for a single completion entry.
python_client/src/torc/openapi_client/models/batch_complete_jobs_request.py	Generated Python request model for batched completion reporting.
python_client/src/torc/openapi_client/models/batch_complete_jobs_response.py	Generated Python response model for batched completion outcomes.
python_client/src/torc/openapi_client/models/init.py	Exports the newly generated batch-completion models.
python_client/src/torc/openapi_client/api/workflows_api.py	Generated Python API: adds `batch_complete_jobs` endpoint wrapper.
python_client/src/torc/openapi_client/init.py	Updates Python package exports for new models.
julia_client/julia_client/docs/WorkflowsApi.md	Documents new `batch_complete_jobs` workflow endpoint.
julia_client/julia_client/docs/JobCompletionError.md	Documents new completion error model.
julia_client/julia_client/docs/JobCompletionEntry.md	Documents new completion entry model.
julia_client/julia_client/docs/BatchCompleteJobsResponse.md	Documents new batch completion response model.
julia_client/julia_client/docs/BatchCompleteJobsRequest.md	Documents new batch completion request model.
julia_client/julia_client/README.md	Adds new endpoint + model docs to the Julia client README.
julia_client/Torc/src/api/models/model_JobCompletionError.jl	Generated Julia model for completion errors.
julia_client/Torc/src/api/models/model_JobCompletionEntry.jl	Generated Julia model for completion entries.
julia_client/Torc/src/api/models/model_BatchCompleteJobsResponse.jl	Generated Julia model for batched completion response.
julia_client/Torc/src/api/models/model_BatchCompleteJobsRequest.jl	Generated Julia model for batched completion request.
julia_client/Torc/src/api/modelincludes.jl	Includes newly generated Julia model files.
julia_client/Torc/src/api/apis/api_WorkflowsApi.jl	Generated Julia API: adds `batch_complete_jobs`.
api/openapi.yaml	Adds OpenAPI path + schema definitions for `batch_complete_jobs`.
api/openapi.codegen.yaml	Mirrors OpenAPI additions for codegen input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 149 out of 149 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/server/http_server/jobs_transport.rs:1095

apply_job_completion_state (used by the single-job complete_job endpoint) no longer validates run_id against the workflow’s current workflow_status.run_id. The TX-backed path does this (apply_job_completion_state_tx), and other endpoints like start_job/manage_status_change also call validate_run_id, so this looks like a regression that could let stale runners complete jobs for an old run. Suggest adding the same validate_run_id(job.workflow_id, run_id) check here and returning an Unprocessable response on mismatch (consistent with the TX path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

daniel-thom and others added 7 commits April 25, 2026 17:56

Improve short-job throughput paths

9be4587

Reduce completion latency with adaptive watcher

1b14273

Wake runner on SIGCHLD

50f6237

Scope long-poll HTTP settings to runners

cb97174

Remove long-poll claim support

b6ee8f5

Rename pipeline perf test workflow

04a45de

daniel-thom requested a review from Copilot April 26, 2026 19:10

Copilot started reviewing on behalf of daniel-thom April 26, 2026 19:10 View session