Skip to content

Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths#289

Merged
daniel-thom merged 1 commit intomainfrom
fix/sqlite-snapshot-completion-path
Apr 28, 2026
Merged

Fix SQLITE_BUSY_SNAPSHOT in completion and unblock paths#289
daniel-thom merged 1 commit intomainfrom
fix/sqlite-snapshot-completion-path

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

The v0.24.1 deadlock fix in batch_complete_jobs (#287) moved the per-job SELECT inside the same transaction as its writes. With pool.begin() issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so the subsequent UPDATE/INSERT in the same tx can fail immediately with SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between. busy_timeout does not retry that error. Two other handlers had the same shape: process_workflow_unblocks_inner (masked by its own retry loop) and update_jobs_from_completion_reversal.

Server changes:

  • transport_batch_complete_jobs, process_workflow_unblocks_inner, and update_jobs_from_completion_reversal now use begin_immediate so the write lock is acquired up front and busy_timeout applies.
  • In-tx error sites in apply_job_completion_state_tx and the begin/ commit wrappers in transport_batch_complete_jobs now use database_lock_aware_error so lock contention propagates to the client and logs at debug instead of error.

Client changes:

  • send_with_retries now does a fast-retry phase for database lock errors (up to 6 attempts, 50ms->2s exponential backoff) before falling through to the existing 30s ping-and-wait loop, which stays as the right behavior for genuine outages. Previously a transient lock error cost ~30s of throughput per occurrence because the loop unconditionally slept PING_INTERVAL_SECONDS before retrying.
  • New is_database_lock_error helper, also matched by is_retryable_error so the lock substring (now propagated by the server) reliably enters the retry path.

The v0.24.1 deadlock fix in batch_complete_jobs (#287) moved the per-job
SELECT inside the same transaction as its writes. With pool.begin()
issuing BEGIN DEFERRED, that SELECT now acquires a WAL read snapshot, so
the subsequent UPDATE/INSERT in the same tx can fail immediately with
SQLITE_BUSY_SNAPSHOT (517) when another connection commits in between.
busy_timeout does not retry that error. Two other handlers had the same
shape: process_workflow_unblocks_inner (masked by its own retry loop)
and update_jobs_from_completion_reversal.

Server changes:
- transport_batch_complete_jobs, process_workflow_unblocks_inner, and
  update_jobs_from_completion_reversal now use begin_immediate so the
  write lock is acquired up front and busy_timeout applies.
- In-tx error sites in apply_job_completion_state_tx and the begin/
  commit wrappers in transport_batch_complete_jobs now use
  database_lock_aware_error so lock contention propagates to the client
  and logs at debug instead of error.

Client changes:
- send_with_retries now does a fast-retry phase for database lock errors
  (up to 6 attempts, 50ms->2s exponential backoff) before falling
  through to the existing 30s ping-and-wait loop, which stays as the
  right behavior for genuine outages. Previously a transient lock error
  cost ~30s of throughput per occurrence because the loop unconditionally
  slept PING_INTERVAL_SECONDS before retrying.
- New is_database_lock_error helper, also matched by is_retryable_error
  so the lock substring (now propagated by the server) reliably enters
  the retry path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses transient SQLite SQLITE_BUSY_SNAPSHOT failures introduced by moving read-before-write logic into a single deferred transaction, and improves end-to-end retry behavior for lock contention.

Changes:

  • Switch selected server transaction starts to BEGIN IMMEDIATE via begin_immediate to avoid WAL snapshot conflicts and ensure busy_timeout applies.
  • Use database_lock_aware_error at key in-transaction error sites in the completion path so lock contention propagates cleanly to clients.
  • Add a client-side fast-retry phase for detected database lock errors before falling back to the existing slower ping-and-wait retry loop.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/server/http_server/unblock_processing.rs Uses begin_immediate for unblock processing transactions to prevent snapshot-conflict lock errors.
src/server/http_server/lifecycle_support.rs Uses begin_immediate in completion reversal path to avoid SQLITE_BUSY_SNAPSHOT during recursive updates.
src/server/http_server/jobs_transport.rs Wraps more DB errors with database_lock_aware_error and uses begin_immediate for batch completion transaction start/commit.
src/client/utils.rs Adds database-lock detection + fast retry/backoff logic and expands unit tests for error classification helpers.
Comments suppressed due to low confidence (1)

src/server/http_server/lifecycle_support.rs:409

  • The new BEGIN IMMEDIATE usage here still returns a generic ApiError("Database error") on begin failure. If begin_immediate fails due to SQLite lock contention (e.g., timeout while waiting for the write lock), the client won’t see a lock marker and therefore won’t enter the fast-retry path introduced in this PR. Consider mapping the begin error through database_lock_aware_error (and keeping lock contention logs at debug) so transient lock contention can be retried consistently.
        // BEGIN IMMEDIATE: the first statement on this tx is a SELECT, which would
        // acquire a WAL read snapshot under BEGIN DEFERRED and then risk
        // SQLITE_BUSY_SNAPSHOT (517) on the recursive UPDATE that follows.
        // See server/api.rs::begin_immediate.
        let mut tx = match begin_immediate(&self.pool).await {
            Ok(tx) => tx,
            Err(e) => {
                error!("Failed to begin transaction for completion reversal: {}", e);
                return Err(ApiError("Database error".to_string()));
            }
        };

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/utils.rs
Comment on lines +118 to +122
/// Check whether an error string indicates SQLite lock contention on the server.
/// Lock errors typically clear in milliseconds, so the caller can retry quickly
/// instead of falling back to the slow ping-and-wait path used for outages.
fn is_database_lock_error(error_str: &str) -> bool {
let s = error_str.to_ascii_lowercase();
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_retryable_error lowercases error_str into s, but then calls is_database_lock_error(&s) which lowercases again (allocating a new String). This adds avoidable allocations on the error path. Consider changing is_database_lock_error to operate on an already-lowercased &str (or accept a precomputed lowercase) so is_retryable_error can reuse s without re-allocating.

Suggested change
/// Check whether an error string indicates SQLite lock contention on the server.
/// Lock errors typically clear in milliseconds, so the caller can retry quickly
/// instead of falling back to the slow ping-and-wait path used for outages.
fn is_database_lock_error(error_str: &str) -> bool {
let s = error_str.to_ascii_lowercase();
/// Check whether a lowercased error string indicates SQLite lock contention on the server.
/// Lock errors typically clear in milliseconds, so the caller can retry quickly
/// instead of falling back to the slow ping-and-wait path used for outages.
fn is_database_lock_error(s: &str) -> bool {

Copilot uses AI. Check for mistakes.
Comment thread src/client/utils.rs
Comment on lines 141 to +187
pub fn send_with_retries<T, E, F>(
config: &Configuration,
mut api_call: F,
wait_for_healthy_database_minutes: u64,
) -> Result<T, E>
where
F: FnMut() -> Result<T, E>,
E: std::fmt::Display,
{
match api_call() {
Ok(result) => Ok(result),
Err(e) => {
let error_str = e.to_string();
if !is_retryable_error(&error_str) {
return Err(e);
// Fast-retry phase for SQLite lock contention. Lock errors typically clear in
// milliseconds; the slow ping-and-wait loop below assumes the server is down
// and would waste throughput by sleeping 30s before each retry. We try a few
// times with short exponential backoff before falling through.
const FAST_RETRY_ATTEMPTS: u32 = 6;
const FAST_RETRY_INITIAL_MS: u64 = 50;
const FAST_RETRY_MAX_MS: u64 = 2000;

let mut e = match api_call() {
Ok(result) => return Ok(result),
Err(e) => e,
};

if is_database_lock_error(&e.to_string()) {
let mut delay_ms = FAST_RETRY_INITIAL_MS;
for attempt in 1..=FAST_RETRY_ATTEMPTS {
thread::sleep(Duration::from_millis(delay_ms));
match api_call() {
Ok(result) => {
debug!(
"Recovered from database lock after {} fast retries",
attempt
);
return Ok(result);
}
Err(retry_err) => {
if !is_database_lock_error(&retry_err.to_string()) {
// Different error class; stop fast-retrying and let the
// generic path handle it.
e = retry_err;
break;
}
e = retry_err;
delay_ms = (delay_ms.saturating_mul(2)).min(FAST_RETRY_MAX_MS);
}
}
}
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_with_retries now has non-trivial fast-retry/backoff behavior for database lock errors, but there are no unit tests covering the new control flow (fast retry success, fast retry falling through to the slow loop, and stopping fast retries when the error class changes). Adding targeted tests would help prevent regressions; you may want to make the sleep/backoff parameters injectable under cfg(test) so tests don’t actually block for 50ms+ per attempt.

Copilot uses AI. Check for mistakes.
@daniel-thom daniel-thom merged commit 04b6785 into main Apr 28, 2026
13 checks passed
@daniel-thom daniel-thom deleted the fix/sqlite-snapshot-completion-path branch April 28, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants