Skip to content

Fix regression in batch_job_complete#287

Merged
daniel-thom merged 1 commit intomainfrom
fix/regression
Apr 27, 2026
Merged

Fix regression in batch_job_complete#287
daniel-thom merged 1 commit intomainfrom
fix/regression

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

No description provided.

@daniel-thom daniel-thom requested a review from Copilot April 27, 2026 21:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a regression in batch_job_complete by avoiding deadlocks under shared-cache SQLite when completing multiple jobs in a single transaction.

Changes:

  • Replaced jobs_api.get_job(...) with an in-transaction SQL read for job metadata used during completion.
  • Inlined run_id validation using the same transaction to avoid pool-connection reads that can deadlock.
  • Constructed a minimal JobModel to build the completion record without reloading the full model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/http_server/jobs_transport.rs
Comment thread src/server/http_server/jobs_transport.rs Outdated
Comment thread src/server/http_server/jobs_transport.rs Outdated
apply_job_completion_state_tx opened a transaction on connection A,
then for each completion called jobs_api.get_job and validate_run_id,
which both grab fresh pool connections. After the first iteration's
UPDATE took a write lock on tx, every subsequent iteration's read on
a separate connection blocked on that lock. With the default
--threads 1, the single tokio worker awaited the blocked read while
the lock could only release once that worker continued — self-deadlock.
Manifests under shared-cache in-memory SQLite (table-level locking);
under WAL on disk the same pattern causes pool exhaustion and
slow-acquire warnings instead.

In production the runner's 30s HTTP timeout fired, the 20-minute retry
budget exhausted, and in-flight jobs were killed.

Read job state and validate run_id directly through &mut **tx so all
DB ops in the handler share one connection. Also use i32::try_from
for the status column and populate JobModel.command from the same row.

Add a regression test that spawns its own in-memory torc-server with
--threads 1 and asserts a 4-job batch completes in <10s. Verified to
fail on unpatched code (Reqwest TimedOut after ~47s).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1219 to +1224
let job_row = match sqlx::query!(
"SELECT workflow_id, name, command, status FROM job WHERE id = ?",
id
)
.fetch_optional(&mut **tx)
.await
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New sqlx::query! added here will require updating the checked-in SQLx offline metadata (.sqlx/query-*.json) so SQLX_OFFLINE=true builds (e.g. release/publish workflows) continue to compile. Please run cargo sqlx prepare (or the repo’s equivalent) and commit the generated query descriptor for this SELECT.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +160
let start = Instant::now();
let response = apis::workflows_api::batch_complete_jobs(config, workflow_id, request)
.expect("batch_complete_jobs failed");
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression test measures elapsed time only after batch_complete_jobs(...) returns, but the generated OpenAPI client’s reqwest::blocking::Client has no request timeout by default. If the deadlock regresses, the call may hang indefinitely and stall CI. Consider constructing Configuration with a blocking reqwest client that has a hard timeout (e.g., 10–15s) or running the request in a separate thread and failing the test after a join timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +1330 to +1336
// Inline run_id validation against tx for the same reason: validate_run_id uses a
// fresh pool connection and would deadlock against the in-flight transaction.
let workflow_run_id_row = sqlx::query!(
"SELECT run_id FROM workflow_status WHERE id = ?",
job_workflow_id
)
.fetch_optional(&mut **tx)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_id check logic here now duplicates validate_run_id(...) in runtime_support.rs (query + mismatch/none handling). To avoid future divergence, consider factoring a validate_run_id_with_executor(executor, workflow_id, run_id) helper that can run against either the pool or a transaction.

Copilot uses AI. Check for mistakes.
@daniel-thom daniel-thom merged commit 172e504 into main Apr 27, 2026
13 checks passed
@daniel-thom daniel-thom deleted the fix/regression branch April 27, 2026 21:38
nreinicke pushed a commit to nreinicke/torc that referenced this pull request Apr 28, 2026
The v0.24.1 deadlock fix in batch_complete_jobs (NatLabRockies#287) moved the per-job
SELECT inside the same transaction as its writes. With pool.begin()
issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so
the subsequent UPDATE/INSERT in the same tx can fail immediately with
SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between.
busy_timeout does not retry that error. Two other handlers had the same
shape: process_workflow_unblocks_inner (masked by its own retry loop)
and update_jobs_from_completion_reversal.

Server changes:
- transport_batch_complete_jobs, process_workflow_unblocks_inner, and
  update_jobs_from_completion_reversal now use begin_immediate so the
  write lock is acquired up front and busy_timeout applies.
- In-tx error sites in apply_job_completion_state_tx and the begin/
  commit wrappers in transport_batch_complete_jobs now use
  database_lock_aware_error so lock contention propagates to the client
  and logs at debug instead of error.

Client changes:
- send_with_retries now does a fast-retry phase for database lock errors
  (up to 6 attempts, 50ms->2s exponential backoff) before falling
  through to the existing 30s ping-and-wait loop, which stays as the
  right behavior for genuine outages. Previously a transient lock error
  cost ~30s of throughput per occurrence because the loop unconditionally
  slept PING_INTERVAL_SECONDS before retrying.
- New is_database_lock_error helper, also matched by is_retryable_error
  so the lock substring (now propagated by the server) reliably enters
  the retry path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants