Fix regression in batch_job_complete#287
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes a regression in batch_job_complete by avoiding deadlocks under shared-cache SQLite when completing multiple jobs in a single transaction.
Changes:
- Replaced
jobs_api.get_job(...)with an in-transaction SQL read for job metadata used during completion. - Inlined
run_idvalidation using the same transaction to avoid pool-connection reads that can deadlock. - Constructed a minimal
JobModelto build the completion record without reloading the full model.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apply_job_completion_state_tx opened a transaction on connection A, then for each completion called jobs_api.get_job and validate_run_id, which both grab fresh pool connections. After the first iteration's UPDATE took a write lock on tx, every subsequent iteration's read on a separate connection blocked on that lock. With the default --threads 1, the single tokio worker awaited the blocked read while the lock could only release once that worker continued — self-deadlock. Manifests under shared-cache in-memory SQLite (table-level locking); under WAL on disk the same pattern causes pool exhaustion and slow-acquire warnings instead. In production the runner's 30s HTTP timeout fired, the 20-minute retry budget exhausted, and in-flight jobs were killed. Read job state and validate run_id directly through &mut **tx so all DB ops in the handler share one connection. Also use i32::try_from for the status column and populate JobModel.command from the same row. Add a regression test that spawns its own in-memory torc-server with --threads 1 and asserts a 4-job batch completes in <10s. Verified to fail on unpatched code (Reqwest TimedOut after ~47s).
a1a6dcc to
555d10d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let job_row = match sqlx::query!( | ||
| "SELECT workflow_id, name, command, status FROM job WHERE id = ?", | ||
| id | ||
| ) | ||
| .fetch_optional(&mut **tx) | ||
| .await |
There was a problem hiding this comment.
New sqlx::query! added here will require updating the checked-in SQLx offline metadata (.sqlx/query-*.json) so SQLX_OFFLINE=true builds (e.g. release/publish workflows) continue to compile. Please run cargo sqlx prepare (or the repo’s equivalent) and commit the generated query descriptor for this SELECT.
| let start = Instant::now(); | ||
| let response = apis::workflows_api::batch_complete_jobs(config, workflow_id, request) | ||
| .expect("batch_complete_jobs failed"); |
There was a problem hiding this comment.
This regression test measures elapsed time only after batch_complete_jobs(...) returns, but the generated OpenAPI client’s reqwest::blocking::Client has no request timeout by default. If the deadlock regresses, the call may hang indefinitely and stall CI. Consider constructing Configuration with a blocking reqwest client that has a hard timeout (e.g., 10–15s) or running the request in a separate thread and failing the test after a join timeout.
| // Inline run_id validation against tx for the same reason: validate_run_id uses a | ||
| // fresh pool connection and would deadlock against the in-flight transaction. | ||
| let workflow_run_id_row = sqlx::query!( | ||
| "SELECT run_id FROM workflow_status WHERE id = ?", | ||
| job_workflow_id | ||
| ) | ||
| .fetch_optional(&mut **tx) |
There was a problem hiding this comment.
The run_id check logic here now duplicates validate_run_id(...) in runtime_support.rs (query + mismatch/none handling). To avoid future divergence, consider factoring a validate_run_id_with_executor(executor, workflow_id, run_id) helper that can run against either the pool or a transaction.
The v0.24.1 deadlock fix in batch_complete_jobs (NatLabRockies#287) moved the per-job SELECT inside the same transaction as its writes. With pool.begin() issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so the subsequent UPDATE/INSERT in the same tx can fail immediately with SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between. busy_timeout does not retry that error. Two other handlers had the same shape: process_workflow_unblocks_inner (masked by its own retry loop) and update_jobs_from_completion_reversal. Server changes: - transport_batch_complete_jobs, process_workflow_unblocks_inner, and update_jobs_from_completion_reversal now use begin_immediate so the write lock is acquired up front and busy_timeout applies. - In-tx error sites in apply_job_completion_state_tx and the begin/ commit wrappers in transport_batch_complete_jobs now use database_lock_aware_error so lock contention propagates to the client and logs at debug instead of error. Client changes: - send_with_retries now does a fast-retry phase for database lock errors (up to 6 attempts, 50ms->2s exponential backoff) before falling through to the existing 30s ping-and-wait loop, which stays as the right behavior for genuine outages. Previously a transient lock error cost ~30s of throughput per occurrence because the loop unconditionally slept PING_INTERVAL_SECONDS before retrying. - New is_database_lock_error helper, also matched by is_retryable_error so the lock substring (now propagated by the server) reliably enters the retry path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No description provided.