Skip to content

Improve short-job throughput for runner/server hot path#286

Merged
daniel-thom merged 9 commits intoNatLabRockies:mainfrom
daniel-thom:perf/db-load
Apr 26, 2026
Merged

Improve short-job throughput for runner/server hot path#286
daniel-thom merged 9 commits intoNatLabRockies:mainfrom
daniel-thom:perf/db-load

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

Summary

  • add real batched server-side handling for batch_complete_jobs and expose the new API/models in generated clients
  • wake runners promptly on SIGCHLD so local job completions refill slots without waiting for the sleep interval
  • add a CPU-only three-stage perf workflow at tests/workflows/pipeline_perf_test/workflow.yaml to exercise the short-job pipeline on systems where GPU debug partitions are not practical

Long-poll note

We also explored server-backed long-poll claims (wait_seconds on claim endpoints) to wake idle workers when downstream jobs became ready. After working through the runner behavior, we rejected that approach for this PR and removed it entirely.

Reasons we backed it out:

  • the runner uses blocking claim RPCs, so a parked long-poll request delays local completion processing until the request returns
  • that can slow completion propagation and downstream unblocking near the tail of a workflow, which cuts against the goal for short jobs
  • it adds operational risk around OpenShift / proxy timeouts and possible thundering-herd wakeups without a proven net win in this runner architecture
  • it adds more behavioral complexity and unpredictability than the lower-risk changes in this PR

The final PR keeps the lower-risk improvements (SIGCHLD wakeup and real batched completion handling) and leaves long-poll out.

Validation

  • cargo fmt -- --check
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • dprint check
  • shellcheck (via commit hook)
  • cargo check --workspace

daniel-thom and others added 7 commits April 25, 2026 17:56
Three small, independent changes targeting throughput when many runners
churn through short jobs:

1. Skip the post-iteration sleep in the runner main loop when an
   iteration reported one or more completions. Completions free
   capacity and the deferred unblock task may have made more jobs
   ready, so reacting immediately closes the idle gap rather than
   waiting a full poll interval. Started jobs do not trigger skip.

2. Lock-free pre-check in transport_prepare_ready_jobs: a non-locking
   SELECT 1 against the same WHERE clause runs before BEGIN IMMEDIATE
   so empty claims do not contend for the SQLite write lock. The
   workflow existence check is also moved out of the transaction since
   it is a pure read.

3. New batch_complete_jobs endpoint that accepts a list of completions
   in one request and returns per-completion success/error vectors.
   The runner's handle_job_completion is split into prepare_job_completion
   (local recovery + status decision) and report_completions_batch
   (one batched send + finalization), with all four call sites routed
   through a single handle_completions_batch helper. Halves request
   count under load. API version bumped to 0.14.0.

Also adds a 3,001-job GPU pipeline workflow at
tests/workflows/gpu_pipeline_perf_test/ for manual A/B testing against
main.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets higher short-job throughput by reducing runner idle gaps and cutting completion-reporting round trips via a new server-side batched completion API, while also adding a CPU-only performance workflow to exercise the runner/server hot path.

Changes:

  • Add batch_complete_jobs workflow endpoint with new request/response models and propagate it through OpenAPI + generated Rust/Python/Julia clients.
  • Add a condvar-based wakeup primitive and SIGCHLD-driven wakeups so runners react immediately to local subprocess exits instead of waiting for the poll interval.
  • Add a 3-stage, CPU-only pipeline perf workflow (3,001 jobs) for stress/perf validation.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/workflows/pipeline_perf_test/workflow.yaml Adds a CPU-only, 3-stage pipeline workload to stress short-job runner/server behavior.
src/server/response_types.rs Re-exports new BatchCompleteJobsResponse via response facade.
src/server/live_state.rs Minor refactor: uses HashSet import for failure-tracking state.
src/server/live_router.rs Adds HTTP route + utoipa path for POST /workflows/{id}/batch_complete_jobs.
src/server/http_transport/response_mapping.rs Adds HTTP response mapping function for BatchCompleteJobsResponse.
src/server/http_server/workflows_transport.rs Adjusts workflow transport (notably claim path) behavior.
src/server/http_server/jobs_transport.rs Implements server-side batched completion handling + claim pre-check optimization.
src/server/http_server.rs Exposes batch_complete_jobs on the transport trait implementation.
src/server/api_responses.rs Introduces BatchCompleteJobsResponse response enum.
src/server/api_contract.rs Adds batch_complete_jobs to the transport API contract.
src/run_jobs_cmd.rs Registers SIGCHLD wakeup for local runner; adjusts cookie-header handling.
src/openapi_spec.rs Adds schemas + path wiring for batch completion in generated OpenAPI doc.
src/models.rs Adds BatchCompleteJobsRequest/Response + completion entry/error models.
src/client/job_runner.rs Uses batched completion reporting; adds wakeup primitive + idle-wait logic; adds wakeup unit tests.
src/client/apis/workflows_api.rs Generated Rust client: adds batch_complete_jobs API call + typed error enum.
src/bin/torc-slurm-job-runner.rs Extends signal handling to SIGCHLD and wakes runner on SIGTERM/SIGCHLD.
src/api_version.rs Bumps HTTP API contract version to 0.14.0.
python_client/src/torc/openapi_client/models/job_completion_error.py Generated Python model for per-job completion error entries.
python_client/src/torc/openapi_client/models/job_completion_entry.py Generated Python model for a single completion entry.
python_client/src/torc/openapi_client/models/batch_complete_jobs_request.py Generated Python request model for batched completion reporting.
python_client/src/torc/openapi_client/models/batch_complete_jobs_response.py Generated Python response model for batched completion outcomes.
python_client/src/torc/openapi_client/models/init.py Exports the newly generated batch-completion models.
python_client/src/torc/openapi_client/api/workflows_api.py Generated Python API: adds batch_complete_jobs endpoint wrapper.
python_client/src/torc/openapi_client/init.py Updates Python package exports for new models.
julia_client/julia_client/docs/WorkflowsApi.md Documents new batch_complete_jobs workflow endpoint.
julia_client/julia_client/docs/JobCompletionError.md Documents new completion error model.
julia_client/julia_client/docs/JobCompletionEntry.md Documents new completion entry model.
julia_client/julia_client/docs/BatchCompleteJobsResponse.md Documents new batch completion response model.
julia_client/julia_client/docs/BatchCompleteJobsRequest.md Documents new batch completion request model.
julia_client/julia_client/README.md Adds new endpoint + model docs to the Julia client README.
julia_client/Torc/src/api/models/model_JobCompletionError.jl Generated Julia model for completion errors.
julia_client/Torc/src/api/models/model_JobCompletionEntry.jl Generated Julia model for completion entries.
julia_client/Torc/src/api/models/model_BatchCompleteJobsResponse.jl Generated Julia model for batched completion response.
julia_client/Torc/src/api/models/model_BatchCompleteJobsRequest.jl Generated Julia model for batched completion request.
julia_client/Torc/src/api/modelincludes.jl Includes newly generated Julia model files.
julia_client/Torc/src/api/apis/api_WorkflowsApi.jl Generated Julia API: adds batch_complete_jobs.
api/openapi.yaml Adds OpenAPI path + schema definitions for batch_complete_jobs.
api/openapi.codegen.yaml Mirrors OpenAPI additions for codegen input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/http_server/workflows_transport.rs
Comment thread src/client/job_runner.rs Outdated
Comment thread src/client/job_runner.rs Outdated
Comment thread src/api_version.rs
Comment thread src/run_jobs_cmd.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 149 out of 149 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/live_router.rs
Comment thread src/client/job_runner.rs Outdated
Comment thread src/server/http_server/jobs_transport.rs
Comment thread src/server/http_server/jobs_transport.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/http_server/jobs_transport.rs
Comment thread src/server/http_server/jobs_transport.rs
Comment thread src/server/live_router.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/live_router.rs
Comment thread src/client/job_runner.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/server/http_server/jobs_transport.rs:1095

  • apply_job_completion_state (used by the single-job complete_job endpoint) no longer validates run_id against the workflow’s current workflow_status.run_id. The TX-backed path does this (apply_job_completion_state_tx), and other endpoints like start_job/manage_status_change also call validate_run_id, so this looks like a regression that could let stale runners complete jobs for an old run. Suggest adding the same validate_run_id(job.workflow_id, run_id) check here and returning an Unprocessable response on mismatch (consistent with the TX path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/http_server/jobs_transport.rs
Comment thread src/server/http_server/jobs_transport.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/http_server/jobs_transport.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 152 out of 152 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/live_router.rs
@daniel-thom daniel-thom merged commit 6044125 into NatLabRockies:main Apr 26, 2026
24 checks passed
@daniel-thom daniel-thom deleted the perf/db-load branch April 26, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants