Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths by daniel-thom · Pull Request #289 · NatLabRockies/torc

daniel-thom · 2026-04-27T23:50:51Z

The v0.24.1 deadlock fix in batch_complete_jobs (#287) moved the per-job SELECT inside the same transaction as its writes. With pool.begin() issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so the subsequent UPDATE/INSERT in the same tx can fail immediately with SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between. busy_timeout does not retry that error. Two other handlers had the same shape: process_workflow_unblocks_inner (masked by its own retry loop) and update_jobs_from_completion_reversal.

Server changes:

transport_batch_complete_jobs, process_workflow_unblocks_inner, and update_jobs_from_completion_reversal now use begin_immediate so the write lock is acquired up front and busy_timeout applies.
In-tx error sites in apply_job_completion_state_tx and the begin/ commit wrappers in transport_batch_complete_jobs now use database_lock_aware_error so lock contention propagates to the client and logs at debug instead of error.

Client changes:

send_with_retries now does a fast-retry phase for database lock errors (up to 6 attempts, 50ms->2s exponential backoff) before falling through to the existing 30s ping-and-wait loop, which stays as the right behavior for genuine outages. Previously a transient lock error cost ~30s of throughput per occurrence because the loop unconditionally slept PING_INTERVAL_SECONDS before retrying.
New is_database_lock_error helper, also matched by is_retryable_error so the lock substring (now propagated by the server) reliably enters the retry path.

The v0.24.1 deadlock fix in batch_complete_jobs (#287) moved the per-job SELECT inside the same transaction as its writes. With pool.begin() issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so the subsequent UPDATE/INSERT in the same tx can fail immediately with SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between. busy_timeout does not retry that error. Two other handlers had the same shape: process_workflow_unblocks_inner (masked by its own retry loop) and update_jobs_from_completion_reversal. Server changes: - transport_batch_complete_jobs, process_workflow_unblocks_inner, and update_jobs_from_completion_reversal now use begin_immediate so the write lock is acquired up front and busy_timeout applies. - In-tx error sites in apply_job_completion_state_tx and the begin/ commit wrappers in transport_batch_complete_jobs now use database_lock_aware_error so lock contention propagates to the client and logs at debug instead of error. Client changes: - send_with_retries now does a fast-retry phase for database lock errors (up to 6 attempts, 50ms->2s exponential backoff) before falling through to the existing 30s ping-and-wait loop, which stays as the right behavior for genuine outages. Previously a transient lock error cost ~30s of throughput per occurrence because the loop unconditionally slept PING_INTERVAL_SECONDS before retrying. - New is_database_lock_error helper, also matched by is_retryable_error so the lock substring (now propagated by the server) reliably enters the retry path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Addresses transient SQLite SQLITE_BUSY_SNAPSHOT failures introduced by moving read-before-write logic into a single deferred transaction, and improves end-to-end retry behavior for lock contention.

Changes:

Switch selected server transaction starts to BEGIN IMMEDIATE via begin_immediate to avoid WAL snapshot conflicts and ensure busy_timeout applies.
Use database_lock_aware_error at key in-transaction error sites in the completion path so lock contention propagates cleanly to clients.
Add a client-side fast-retry phase for detected database lock errors before falling back to the existing slower ping-and-wait retry loop.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/server/http_server/unblock_processing.rs	Uses `begin_immediate` for unblock processing transactions to prevent snapshot-conflict lock errors.
src/server/http_server/lifecycle_support.rs	Uses `begin_immediate` in completion reversal path to avoid `SQLITE_BUSY_SNAPSHOT` during recursive updates.
src/server/http_server/jobs_transport.rs	Wraps more DB errors with `database_lock_aware_error` and uses `begin_immediate` for batch completion transaction start/commit.
src/client/utils.rs	Adds database-lock detection + fast retry/backoff logic and expands unit tests for error classification helpers.

Comments suppressed due to low confidence (1)

src/server/http_server/lifecycle_support.rs:409

The new BEGIN IMMEDIATE usage here still returns a generic ApiError("Database error") on begin failure. If begin_immediate fails due to SQLite lock contention (e.g., timeout while waiting for the write lock), the client won’t see a lock marker and therefore won’t enter the fast-retry path introduced in this PR. Consider mapping the begin error through database_lock_aware_error (and keeping lock contention logs at debug) so transient lock contention can be retried consistently.

        // BEGIN IMMEDIATE: the first statement on this tx is a SELECT, which would
        // acquire a WAL read snapshot under BEGIN DEFERRED and then risk
        // SQLITE_BUSY_SNAPSHOT (517) on the recursive UPDATE that follows.
        // See server/api.rs::begin_immediate.
        let mut tx = match begin_immediate(&self.pool).await {
            Ok(tx) => tx,
            Err(e) => {
                error!("Failed to begin transaction for completion reversal: {}", e);
                return Err(ApiError("Database error".to_string()));
            }
        };

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T00:27:58Z

+/// Check whether an error string indicates SQLite lock contention on the server.
+/// Lock errors typically clear in milliseconds, so the caller can retry quickly
+/// instead of falling back to the slow ping-and-wait path used for outages.
+fn is_database_lock_error(error_str: &str) -> bool {
+    let s = error_str.to_ascii_lowercase();


is_retryable_error lowercases error_str into s, but then calls is_database_lock_error(&s) which lowercases again (allocating a new String). This adds avoidable allocations on the error path. Consider changing is_database_lock_error to operate on an already-lowercased &str (or accept a precomputed lowercase) so is_retryable_error can reuse s without re-allocating.

Suggested change

/// Check whether an error string indicates SQLite lock contention on the server.

/// Lock errors typically clear in milliseconds, so the caller can retry quickly

/// instead of falling back to the slow ping-and-wait path used for outages.

fn is_database_lock_error(error_str: &str) -> bool {

let s = error_str.to_ascii_lowercase();

/// Check whether a lowercased error string indicates SQLite lock contention on the server.

/// Lock errors typically clear in milliseconds, so the caller can retry quickly

/// instead of falling back to the slow ping-and-wait path used for outages.

fn is_database_lock_error(s: &str) -> bool {

Copilot · 2026-04-28T00:27:59Z

 pub fn send_with_retries<T, E, F>(
    config: &Configuration,
    mut api_call: F,
    wait_for_healthy_database_minutes: u64,
 ) -> Result<T, E>
 where
    F: FnMut() -> Result<T, E>,
    E: std::fmt::Display,
 {
-    match api_call() {
-        Ok(result) => Ok(result),
-        Err(e) => {
-            let error_str = e.to_string();
-            if !is_retryable_error(&error_str) {
-                return Err(e);
+    // Fast-retry phase for SQLite lock contention. Lock errors typically clear in
+    // milliseconds; the slow ping-and-wait loop below assumes the server is down
+    // and would waste throughput by sleeping 30s before each retry. We try a few
+    // times with short exponential backoff before falling through.
+    const FAST_RETRY_ATTEMPTS: u32 = 6;
+    const FAST_RETRY_INITIAL_MS: u64 = 50;
+    const FAST_RETRY_MAX_MS: u64 = 2000;
+
+    let mut e = match api_call() {
+        Ok(result) => return Ok(result),
+        Err(e) => e,
+    };
+
+    if is_database_lock_error(&e.to_string()) {
+        let mut delay_ms = FAST_RETRY_INITIAL_MS;
+        for attempt in 1..=FAST_RETRY_ATTEMPTS {
+            thread::sleep(Duration::from_millis(delay_ms));
+            match api_call() {
+                Ok(result) => {
+                    debug!(
+                        "Recovered from database lock after {} fast retries",
+                        attempt
+                    );
+                    return Ok(result);
+                }
+                Err(retry_err) => {
+                    if !is_database_lock_error(&retry_err.to_string()) {
+                        // Different error class; stop fast-retrying and let the
+                        // generic path handle it.
+                        e = retry_err;
+                        break;
+                    }
+                    e = retry_err;
+                    delay_ms = (delay_ms.saturating_mul(2)).min(FAST_RETRY_MAX_MS);
+                }
            }
+        }
+    }


send_with_retries now has non-trivial fast-retry/backoff behavior for database lock errors, but there are no unit tests covering the new control flow (fast retry success, fast retry falling through to the slow loop, and stopping fast retries when the error class changes). Adding targeted tests would help prevent regressions; you may want to make the sleep/backoff parameters injectable under cfg(test) so tests don’t actually block for 50ms+ per attempt.

daniel-thom requested a review from Copilot April 27, 2026 23:50

Copilot started reviewing on behalf of daniel-thom April 28, 2026 00:23 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

daniel-thom merged commit 04b6785 into main Apr 28, 2026
13 checks passed

daniel-thom deleted the fix/sqlite-snapshot-completion-path branch April 28, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths#289

Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths#289
daniel-thom merged 1 commit intomainfrom
fix/sqlite-snapshot-completion-path

daniel-thom commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daniel-thom commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants