Fix regression in batch_job_complete by daniel-thom · Pull Request #287 · NatLabRockies/torc

daniel-thom · 2026-04-27T21:02:59Z

No description provided.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a regression in batch_job_complete by avoiding deadlocks under shared-cache SQLite when completing multiple jobs in a single transaction.

Changes:

Replaced jobs_api.get_job(...) with an in-transaction SQL read for job metadata used during completion.
Inlined run_id validation using the same transaction to avoid pool-connection reads that can deadlock.
Constructed a minimal JobModel to build the completion record without reloading the full model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apply_job_completion_state_tx opened a transaction on connection A, then for each completion called jobs_api.get_job and validate_run_id, which both grab fresh pool connections. After the first iteration's UPDATE took a write lock on tx, every subsequent iteration's read on a separate connection blocked on that lock. With the default --threads 1, the single tokio worker awaited the blocked read while the lock could only release once that worker continued — self-deadlock. Manifests under shared-cache in-memory SQLite (table-level locking); under WAL on disk the same pattern causes pool exhaustion and slow-acquire warnings instead. In production the runner's 30s HTTP timeout fired, the 20-minute retry budget exhausted, and in-flight jobs were killed. Read job state and validate run_id directly through &mut **tx so all DB ops in the handler share one connection. Also use i32::try_from for the status column and populate JobModel.command from the same row. Add a regression test that spawns its own in-memory torc-server with --threads 1 and asserts a 4-job batch completes in <10s. Verified to fail on unpatched code (Reqwest TimedOut after ~47s).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T21:33:21Z

+        let job_row = match sqlx::query!(
+            "SELECT workflow_id, name, command, status FROM job WHERE id = ?",
+            id
+        )
+        .fetch_optional(&mut **tx)
+        .await


New sqlx::query! added here will require updating the checked-in SQLx offline metadata (.sqlx/query-*.json) so SQLX_OFFLINE=true builds (e.g. release/publish workflows) continue to compile. Please run cargo sqlx prepare (or the repo’s equivalent) and commit the generated query descriptor for this SELECT.

Copilot · 2026-04-27T21:33:22Z

+    let start = Instant::now();
+    let response = apis::workflows_api::batch_complete_jobs(config, workflow_id, request)
+        .expect("batch_complete_jobs failed");


This regression test measures elapsed time only after batch_complete_jobs(...) returns, but the generated OpenAPI client’s reqwest::blocking::Client has no request timeout by default. If the deadlock regresses, the call may hang indefinitely and stall CI. Consider constructing Configuration with a blocking reqwest client that has a hard timeout (e.g., 10–15s) or running the request in a separate thread and failing the test after a join timeout.

Copilot · 2026-04-27T21:33:22Z

+        // Inline run_id validation against tx for the same reason: validate_run_id uses a
+        // fresh pool connection and would deadlock against the in-flight transaction.
+        let workflow_run_id_row = sqlx::query!(
+            "SELECT run_id FROM workflow_status WHERE id = ?",
+            job_workflow_id
+        )
+        .fetch_optional(&mut **tx)


The run_id check logic here now duplicates validate_run_id(...) in runtime_support.rs (query + mismatch/none handling). To avoid future divergence, consider factoring a validate_run_id_with_executor(executor, workflow_id, run_id) helper that can run against either the pool or a transaction.

The v0.24.1 deadlock fix in batch_complete_jobs (NatLabRockies#287) moved the per-job SELECT inside the same transaction as its writes. With pool.begin() issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so the subsequent UPDATE/INSERT in the same tx can fail immediately with SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between. busy_timeout does not retry that error. Two other handlers had the same shape: process_workflow_unblocks_inner (masked by its own retry loop) and update_jobs_from_completion_reversal. Server changes: - transport_batch_complete_jobs, process_workflow_unblocks_inner, and update_jobs_from_completion_reversal now use begin_immediate so the write lock is acquired up front and busy_timeout applies. - In-tx error sites in apply_job_completion_state_tx and the begin/ commit wrappers in transport_batch_complete_jobs now use database_lock_aware_error so lock contention propagates to the client and logs at debug instead of error. Client changes: - send_with_retries now does a fast-retry phase for database lock errors (up to 6 attempts, 50ms->2s exponential backoff) before falling through to the existing 30s ping-and-wait loop, which stays as the right behavior for genuine outages. Previously a transient lock error cost ~30s of throughput per occurrence because the loop unconditionally slept PING_INTERVAL_SECONDS before retrying. - New is_database_lock_error helper, also matched by is_retryable_error so the lock substring (now propagated by the server) reliably enters the retry path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

daniel-thom requested a review from Copilot April 27, 2026 21:03

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread src/server/http_server/jobs_transport.rs

Comment thread src/server/http_server/jobs_transport.rs Outdated

Comment thread src/server/http_server/jobs_transport.rs Outdated

daniel-thom mentioned this pull request Apr 27, 2026

Refactor validate_run_id to accept a SQLx Executor #288

Open

3 tasks

daniel-thom force-pushed the fix/regression branch from a1a6dcc to 555d10d Compare April 27, 2026 21:20

daniel-thom requested a review from Copilot April 27, 2026 21:27

Copilot AI reviewed Apr 27, 2026

View reviewed changes

daniel-thom merged commit 172e504 into main Apr 27, 2026
13 checks passed

daniel-thom deleted the fix/regression branch April 27, 2026 21:38

daniel-thom mentioned this pull request Apr 27, 2026

Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression in batch_job_complete#287

Fix regression in batch_job_complete#287
daniel-thom merged 1 commit intomainfrom
fix/regression

daniel-thom commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daniel-thom commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants