Skip to content

fix: Prevent SQLite corruption in parallel mode with atomic operations#108

Closed
cabana8471-arch wants to merge 11 commits intoAutoForgeAI:masterfrom
cabana8471-arch:fix/sqlite-parallel-corruption
Closed

fix: Prevent SQLite corruption in parallel mode with atomic operations#108
cabana8471-arch wants to merge 11 commits intoAutoForgeAI:masterfrom
cabana8471-arch:fix/sqlite-parallel-corruption

Conversation

@cabana8471-arch
Copy link
Contributor

@cabana8471-arch cabana8471-arch commented Jan 26, 2026

Summary

  • Fixes database corruption (sqlite3.DatabaseError: database disk image is malformed) when running multiple agents in parallel
  • Replaces thread-based locking (ineffective cross-process) with atomic SQL operations
  • Adds proper engine cleanup with WAL checkpointing on shutdown

Fixes #106

Root Causes Addressed

  1. threading.Lock() useless cross-process - Removed and replaced with atomic SQL
  2. Read-modify-write races - Fixed with UPDATE ... WHERE clauses
  3. Engine never disposed - Added cleanup() method with signal handlers
  4. No WAL checkpoint - Added PRAGMA wal_checkpoint(FULL) before dispose
  5. assistant.db cache race - Added thread-safe locking

Changes

api/database.py

  • Added atomic_transaction() context manager with BEGIN IMMEDIATE/EXCLUSIVE
  • Added atomic_claim_feature(), atomic_mark_passing(), atomic_update_priority_to_end()

mcp_server/feature_mcp.py

  • Removed useless _priority_lock = threading.Lock()
  • Fixed all high-risk tools with atomic operations:
    • feature_skip - atomic priority update with subquery
    • feature_mark_in_progress - atomic UPDATE WHERE in_progress=0
    • feature_claim_and_get - atomic claim
    • feature_create_bulk - EXCLUSIVE transaction for priority range
    • feature_add_dependency - IMMEDIATE transaction for cycle detection

parallel_orchestrator.py

  • Added cleanup() method with WAL checkpoint before dispose
  • Added atexit and signal handlers (SIGTERM, SIGINT)
  • Always calls cleanup in finally block

progress.py

  • Added _get_connection() helper with timeout=30 and PRAGMA busy_timeout=30000

server/services/assistant_database.py

  • Added threading.Lock() for engine cache
  • Uses double-checked locking pattern

Test plan

  • Run parallel mode with max concurrency (5 agents)
  • After 60s, check PRAGMA integrity_check returns "ok"
  • Verify no duplicate priorities: SELECT priority, COUNT(*) FROM features GROUP BY priority HAVING COUNT(*) > 1
  • Stop orchestrator and verify process count returns to baseline

Migration

  • No schema changes - backward compatible
  • No data migration needed
  • Works with existing databases

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Explicit atomic transaction support for safer concurrent writes.
    • Graceful cleanup on exit to release DB resources reliably.
    • DB dependency now yields connections for safer lifecycle management.
  • Bug Fixes

    • Stronger SQLite locking to avoid races and ensure write-lock acquisition.
    • Improved timeout/busy handling and more consistent error responses.
  • Refactor

    • Replaced in-process locks with atomic DB-backed updates.
  • Tests

    • Added concurrency and transaction isolation tests.

✏️ Tip: You can customize this high-level summary in your review settings.

Fixes AutoForgeAI#106

## Problem
When running multiple agents in parallel (up to 5), database corruption occurs:
- `sqlite3.DatabaseError: database disk image is malformed`
- `PRAGMA integrity_check` shows btreeInitPage errors

## Root Causes
1. `threading.Lock()` in feature_mcp.py only works per-process, not cross-process
2. Read-modify-write patterns in 6+ MCP tools cause race conditions
3. Orchestrator engine never disposed - connection pool not cleaned up
4. No WAL checkpoint on agent exit causes stale cache issues
5. assistant.db engine cache has no locking

## Solution
- Replace read-modify-write patterns with atomic `UPDATE ... WHERE` clauses
- Use SQLite IMMEDIATE transactions for consistent reads during cycle detection
- Add WAL checkpoint before engine dispose on shutdown
- Add thread-safe locking to assistant.db engine cache
- Add proper connection timeout settings (30s) throughout

## Changes
- api/database.py: Add atomic_transaction(), atomic_claim_feature(), etc.
- mcp_server/feature_mcp.py: Fix all high-risk tools with atomic SQL
- parallel_orchestrator.py: Add cleanup() with WAL checkpoint, signal handlers
- progress.py: Add _get_connection() with timeout settings
- server/services/assistant_database.py: Add thread-safe cache locking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Warning

Rate limit exceeded

@cabana8471-arch has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds SQLAlchemy 2.0-style declarative base and an explicit atomic_transaction(session_maker, isolation_level="IMMEDIATE") context manager, configures SQLite to use BEGIN IMMEDIATE via event hooks, converts many feature operations to atomic SQL updates, improves SQLite parallel-safety timeouts/connections, adds orchestrator cleanup and thread-safe engine caching, and adds concurrency tests.

Changes

Cohort / File(s) Summary
Database core & API
api/database.py
Added atomic_transaction(session_maker, isolation_level="IMMEDIATE"), _configure_sqlite_immediate_transactions(engine), switched to SQLAlchemy 2.0 DeclarativeBase for Base, made get_db() a generator Generator[Session, None, None], added event hooks to enforce BEGIN IMMEDIATE and PRAGMA setup before Base.metadata.create_all.
Feature operations (atomic updates)
mcp_server/feature_mcp.py
Replaced in-process locks with atomic SQL UPDATE ... WHERE ... inside atomic_transaction for claim, in_progress, passes, priority, dependencies, bulk create, and related flows; added state refreshes and explicit JSON error responses.
Orchestrator lifecycle & cleanup
parallel_orchestrator.py
Added ParallelOrchestrator.cleanup() to checkpoint WAL and dispose engine (idempotent), registered atexit and signal handlers, ensured cleanup in finally blocks, and widened some parameter types to allow None.
SQLite connection helpers & timeouts
progress.py
Added SQLITE_TIMEOUT, SQLITE_BUSY_TIMEOUT_MS, and _get_connection(db_file: Path) to apply timeouts and busy_timeout; replaced direct sqlite3.connect() usages with helper and context-managed connections.
Thread-safe engine cache
server/services/assistant_database.py
Introduced module-level _cache_lock and double-checked locking in get_engine(); engines created with check_same_thread=False and 30s timeout and cached for reuse.
Registry / ORM base update
registry.py
Replaced declarative_base() with class Base(DeclarativeBase) to adopt SQLAlchemy 2.0 declarative base pattern for models.
Tests (concurrency & behavior)
test_atomic_operations.py
Added extensive concurrency tests validating single-winner claims, concurrent priority updates, transaction isolation/rollback, event hook firing, and orchestrator cleanup() idempotency.
Dead/legacy helpers noted
api/database.py (comments)
Marked some legacy atomic helper functions as dead/relocated in comments and added minor # type: ignore annotations around table creation to satisfy typing.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant MCP as mcp_server.feature_mcp
  participant API as api.database
  participant DB as SQLite
  Client->>MCP: request to claim feature (id)
  MCP->>API: atomic_claim_feature(session_maker, id)
  API->>DB: BEGIN IMMEDIATE
  DB-->>API: lock acquired
  API->>DB: UPDATE features SET in_progress=1 WHERE id=? AND conditions...
  DB-->>API: affected_rows (1 or 0)
  API->>DB: COMMIT
  API-->>MCP: return success/failure with feature payload
  MCP-->>Client: respond with claim result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through rows with PRAGMAs in paw,

Began IMMEDIATE to keep chaos at bay,
Timeouts and checkpoints to guard every draw,
Threads now behave while updates hold sway,
A rabbit nods — the DB is safe today.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: preventing SQLite corruption in parallel mode using atomic operations, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #106: replaces threading locks with atomic SQL operations [#106], implements transaction semantics (BEGIN IMMEDIATE) [#106], adds engine cleanup with WAL checkpointing [#106], and ensures cross-process-safe database access [#106].
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing SQLite corruption: database connection handling, atomic operations, cleanup routines, and concurrency tests. No unrelated refactoring or scope creep detected.
Docstring Coverage ✅ Passed Docstring coverage is 84.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
progress.py (1)

65-74: Ensure SQLite connections close on error paths.
If cursor.execute(...) raises, conn.close() is skipped, which can leak file descriptors and keep locks longer than needed. Consider wrapping connections with contextlib.closing (or explicit try/finally) in all three call sites.

♻️ Suggested fix (apply same pattern to the other functions)
+from contextlib import closing
...
-    try:
-        conn = _get_connection(db_file)
-        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM features")
-        count = cursor.fetchone()[0]
-        conn.close()
-        return count > 0
-    except Exception:
-        return False
+    try:
+        with closing(_get_connection(db_file)) as conn:
+            cursor = conn.cursor()
+            cursor.execute("SELECT COUNT(*) FROM features")
+            count = cursor.fetchone()[0]
+            return count > 0
+    except Exception:
+        return False

Also applies to: 93-126, 145-158

🤖 Fix all issues with AI agents
In `@mcp_server/feature_mcp.py`:
- Around line 37-45: The import block is unsorted causing ruff errors; reorder
the names imported from api.database so they are alphabetically sorted (e.g.,
adjust the import that currently lists Feature, create_database,
atomic_transaction to list them in alphabetical order: atomic_transaction,
create_database, Feature) and ensure the overall import group ordering follows
standard (stdlib, third-party, local) so FastMCP, BaseModel, Field, text,
sys.path insertion, then the sorted api.database import and subsequent
dependency_resolver imports.
- Around line 417-426: In the result.rowcount == 0 handling block, remove the
unnecessary f-string prefix on the final error message; change the return
json.dumps({"error": f"Failed to mark feature in-progress for unknown reason"})
to a plain string literal (no f) so it reads return json.dumps({"error": "Failed
to mark feature in-progress for unknown reason"}); keep the other messages that
interpolate feature_id (e.g., the returns for Feature not found, already
passing, already in-progress) as f-strings and leave session.query(Feature) and
feature_id usage unchanged.

cabana8471-arch and others added 2 commits January 26, 2026 22:29
- Sort imports alphabetically in feature_mcp.py (I001)
- Remove unnecessary f-string prefix (F541)
- Use contextlib.closing to ensure SQLite connections close on error paths

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ruff requires classes (uppercase) to come before functions (lowercase):
Feature, atomic_transaction, create_database

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cabana8471-arch added a commit to cabana8471-arch/autocoder that referenced this pull request Jan 27, 2026
Combined:
- Stuck features clearing at orchestrator startup (my-features)
- Signal handlers and atexit cleanup (PR AutoForgeAI#108)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cabana8471-arch added a commit to cabana8471-arch/autocoder that referenced this pull request Jan 27, 2026
…orgeAI#100, AutoForgeAI#108, AutoForgeAI#109, AutoForgeAI#110

PR AutoForgeAI#110 (Quality Gates):
- Move quality checks before DB session to avoid holding locks
- Return error instead of None for missing configured custom script
- Use contextlib.closing for SQLite connections in progress.py

PR AutoForgeAI#109 (Rate Limit):
- Extract rate limit logic to shared rate_limit_utils.py module
- Remove duplicated code from agent.py and test_agent.py

PR AutoForgeAI#108 (SQLite Parallel):
- Sort imports alphabetically in feature_mcp.py

PR AutoForgeAI#100 (Config Diagnostics):
- Add logger.warning for pkill_processes validation failures

PR AutoForgeAI#95 (Infrastructure Mock):
- Add language tags to fenced code blocks in initializer template

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cabana8471-arch added a commit to cabana8471-arch/autocoder that referenced this pull request Jan 27, 2026
cabana8471-arch and others added 2 commits January 28, 2026 22:27
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
getworken pushed a commit to getworken/autocoder that referenced this pull request Jan 29, 2026
…orgeAI#100, AutoForgeAI#108, AutoForgeAI#109, AutoForgeAI#110

PR AutoForgeAI#110 (Quality Gates):
- Move quality checks before DB session to avoid holding locks
- Return error instead of None for missing configured custom script
- Use contextlib.closing for SQLite connections in progress.py

PR AutoForgeAI#109 (Rate Limit):
- Extract rate limit logic to shared rate_limit_utils.py module
- Remove duplicated code from agent.py and test_agent.py

PR AutoForgeAI#108 (SQLite Parallel):
- Sort imports alphabetically in feature_mcp.py

PR AutoForgeAI#100 (Config Diagnostics):
- Add logger.warning for pkill_processes validation failures

PR AutoForgeAI#95 (Infrastructure Mock):
- Add language tags to fenced code blocks in initializer template

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cabana8471-arch cabana8471-arch deleted the fix/sqlite-parallel-corruption branch January 29, 2026 16:35
@cabana8471-arch cabana8471-arch restored the fix/sqlite-parallel-corruption branch January 29, 2026 20:37
@leonvanzyl
Copy link
Collaborator

Hey @cabana8471-arch! Thanks for tackling this - SQLite corruption in parallel mode is a serious issue and the core approach here (replacing threading.Lock() with atomic SQL operations) is the right call. Nice work identifying the root causes.

I did a thorough review and found a few things worth addressing before merging. Splitting them into must-fix and should-fix:


Critical (Must Fix)

1. Signal handler performs unsafe database operations

parallel_orchestrator.pysignal_handler()

The signal handler calls orchestrator.cleanup() which does SQLite I/O (PRAGMA wal_checkpoint(FULL), engine.dispose()). Python signal handlers execute between bytecode instructions — if a signal arrives while the main thread is mid-SQLite operation, the handler will try to acquire the same internal SQLite mutexes, causing a deadlock.

This is a documented real-world issue.

Suggested fix: Use a flag-based approach instead:

def signal_handler(signum, frame):
    orchestrator._shutdown_requested = True
    orchestrator.is_running = False

Then let the main run_loop() check the flag and the finally block handle cleanup on a safe code path. The atexit handler already covers normal termination.

2. atomic_transaction() relies on fragile pysqlite behavior

api/database.pyatomic_transaction()

Issuing session.execute(text("BEGIN IMMEDIATE")) on a session created with autocommit=False works today only because it's the first operation on a fresh session (before pysqlite's legacy mode auto-begins). This is fragile:

  • Breaks if any ORM flush occurs before BEGIN IMMEDIATE
  • Will break in Python 3.16 when the legacy pysqlite transaction mode is removed

Per the SQLAlchemy SQLite docs, the recommended approach is event hooks:

@event.listens_for(engine, "connect")
def do_connect(dbapi_connection, connection_record):
    dbapi_connection.isolation_level = None

@event.listens_for(engine, "begin")
def do_begin(conn):
    conn.exec_driver_sql("BEGIN IMMEDIATE")

This makes atomic_transaction() unnecessary for most cases and is forward-compatible.


Warnings (Should Address)

3. Dead code in api/database.py

atomic_claim_feature(), atomic_mark_passing(), atomic_update_priority_to_end(), and atomic_get_next_priority() are defined but never imported or called — feature_mcp.py reimplements the same logic inline. That's ~100 lines of unused code. Either remove them or refactor feature_mcp.py to call them.

4. cleanup() is not idempotent

cleanup() can be called up to 3 times (signal handler + finally + atexit), but self._engine is never set to None after dispose. Subsequent calls will attempt operations on a disposed engine.

Quick fix:

def cleanup(self):
    engine = self._engine
    self._engine = None  # Prevent re-entry
    if engine is not None:
        # ... checkpoint and dispose

5. except KeyboardInterrupt becomes dead code

signal.signal(signal.SIGINT, signal_handler) overrides Python's default behavior, so the except KeyboardInterrupt: block in run_parallel_orchestrator() will never execute. Either remove it or don't register SIGINT (let Python raise KeyboardInterrupt naturally).

6. Windows note

signal.signal(signal.SIGTERM, handler) is silently ignored on Windows. Since this project's primary dev platform is Windows, worth a comment noting the limitation.

7. Rollback failure can swallow the original exception

In atomic_transaction(), if the yield block raises and session.rollback() also fails, the rollback exception replaces the original. Consider wrapping:

except Exception:
    try:
        session.rollback()
    except Exception:
        pass
    raise

8. No automated tests

This is concurrency-critical code. Even a basic test that spawns N threads calling atomic_claim_feature() for the same feature (verifying exactly 1 succeeds) would catch regressions.


What's Good

  • Correctly identifies threading.Lock() as useless cross-process
  • Atomic UPDATE...WHERE is the right pattern for SQLite multi-process safety
  • WAL checkpoint on shutdown prevents stale cache issues
  • contextlib.closing() for connection leak prevention (good catch from the CodeRabbit feedback too)
  • Thread-safe engine cache with double-checked locking in assistant_database.py
  • Backward compatible — no schema changes needed

The concept is solid. Once the signal handler and atomic_transaction issues are addressed, this should be good to go. Let me know if you have questions about any of the above!

…afety

Changes per leonvanzyl and CodeRabbit feedback:

Signal handling (parallel_orchestrator.py):
- Make signal handler async-safe (only sets flags, no I/O)
- Add _shutdown_requested flag for graceful shutdown
- Remove SIGINT handler to let KeyboardInterrupt work naturally
- Make cleanup() idempotent by clearing _engine first
- Fix Optional type hints for mypy compliance

SQLite transactions (api/database.py):
- Add SQLAlchemy event hooks for BEGIN IMMEDIATE transactions
- This replaces fragile pysqlite implicit transaction handling
- Future-proof for Python 3.16 when legacy mode is removed
- Fix rollback exception swallowing in atomic_transaction()
- Remove ~150 lines of dead code (unused atomic_* functions)

Type safety improvements:
- Migrate to SQLAlchemy 2.0 DeclarativeBase (database.py, registry.py)
- Fix all mypy errors in modified files
- Add explicit type annotations for Optional parameters

Testing:
- Add test_atomic_operations.py with 7 concurrency tests
- Tests verify atomic claims, priority updates, cleanup idempotency

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cabana8471-arch added a commit to cabana8471-arch/autocoder that referenced this pull request Jan 30, 2026
…afety

Changes per leonvanzyl and CodeRabbit feedback:

Signal handling (parallel_orchestrator.py):
- Make signal handler async-safe (only sets flags, no I/O)
- Add _shutdown_requested flag for graceful shutdown
- Remove SIGINT handler to let KeyboardInterrupt work naturally
- Make cleanup() idempotent by clearing _engine first
- Fix Optional type hints for mypy compliance

SQLite transactions (api/database.py):
- Add SQLAlchemy event hooks for BEGIN IMMEDIATE transactions
- This replaces fragile pysqlite implicit transaction handling
- Future-proof for Python 3.16 when legacy mode is removed
- Fix rollback exception swallowing in atomic_transaction()
- Remove ~150 lines of dead code (unused atomic_* functions)

Type safety improvements:
- Migrate to SQLAlchemy 2.0 DeclarativeBase (database.py, registry.py)
- Fix all mypy errors in modified files
- Add explicit type annotations for Optional parameters

Testing:
- Add test_atomic_operations.py with 7 concurrency tests
- Tests verify atomic claims, priority updates, cleanup idempotency

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@api/database.py`:
- Around line 496-513: The code currently attempts session.execute(text("BEGIN
EXCLUSIVE")) when isolation_level == "EXCLUSIVE", but the SQLAlchemy "begin"
event hook already issues "BEGIN IMMEDIATE", causing a nested-transaction error;
fix by removing the explicit BEGIN EXCLUSIVE and do not support starting
EXCLUSIVE at the session level—either (preferred) map the "EXCLUSIVE" argument
to "IMMEDIATE" (treat isolation_level == "EXCLUSIVE" as "IMMEDIATE") in the
atomic_transaction/session creation logic, or immediately raise a clear
ValueError when callers pass "EXCLUSIVE"; update references to isolation_level
handling and remove session.execute(text("BEGIN EXCLUSIVE")) so the existing
begin event hook (which issues BEGIN IMMEDIATE) is the single source of BEGIN
semantics.
🧹 Nitpick comments (1)
test_atomic_operations.py (1)

218-280: Clarify: Test validates serialization, not parallel reads.

The test comment says "IMMEDIATE prevents stale reads" but the expected behavior (initial + 200) actually validates that transactions serialize properly - the second transaction waits for the first to complete, then reads the updated value.

This is correct behavior for BEGIN IMMEDIATE, but the test name/comments could be clearer. With IMMEDIATE, the second transaction blocks until the first commits, so each thread reads the previously committed value (not a stale value). This is serialization via write-lock contention, not snapshot isolation.

Suggested clarification in docstring
 class TestAtomicTransactionIsolation:
-    """Test that atomic_transaction with IMMEDIATE prevents stale reads."""
+    """Test that atomic_transaction with IMMEDIATE serializes concurrent writes."""

     def test_read_modify_write_isolation(self, temp_db):
-        """Verify IMMEDIATE transaction prevents stale read in read-modify-write."""
+        """Verify IMMEDIATE transactions serialize, preventing lost updates."""
         engine, session_maker = temp_db

-        # This test verifies that two concurrent read-modify-write operations
-        # don't both read the same value and create a conflict
+        # This test verifies that IMMEDIATE transactions serialize correctly.
+        # The second transaction blocks until the first commits, then reads
+        # the updated value. This prevents lost updates.

Comment on lines +496 to +513
session = session_maker()
try:
# For EXCLUSIVE mode, override the default IMMEDIATE from event hooks
# For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically
if isolation_level == "EXCLUSIVE":
session.execute(text("BEGIN EXCLUSIVE"))
# Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it
# This prevents the fragile "BEGIN on already-begun transaction" issue
yield session
session.commit()
except Exception:
try:
session.rollback()
except Exception:
pass # Don't let rollback failure mask original error
raise
finally:
session.close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if EXCLUSIVE mode is actually used anywhere in the codebase
echo "=== Searching for EXCLUSIVE mode usage ==="
rg -n "EXCLUSIVE" --type=py -B 2 -A 2

echo -e "\n=== Searching for isolation_level assignments ==="
rg -n 'isolation_level' --type=py -B 2 -A 2

echo -e "\n=== Looking for atomic_transaction context manager ==="
rg -n "atomic_transaction\|@atomic_transaction" --type=py -B 2 -A 2 | head -50

Repository: leonvanzyl/autocoder

Length of output: 6215


🏁 Script executed:

#!/bin/bash
# Find the event hook implementation
echo "=== Looking for do_begin event hook ==="
rg -n "do_begin\|event.listen\|before_cursor_execute" --type=py -B 3 -A 5

Repository: leonvanzyl/autocoder

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Check the database.py file structure to find event hook setup
echo "=== Examining api/database.py for event hooks and session setup ==="
head -100 api/database.py

Repository: leonvanzyl/autocoder

Length of output: 3621


🏁 Script executed:

#!/bin/bash
echo "=== Looking for _configure_sqlite_immediate_transactions function ==="
rg -n "_configure_sqlite_immediate_transactions" --type=py -A 20

echo -e "\n=== Searching for event.listen calls ==="
rg -n "event\.listen" --type=py -B 2 -A 5

Repository: leonvanzyl/autocoder

Length of output: 5510


🏁 Script executed:

#!/bin/bash
echo "=== Searching for BEGIN IMMEDIATE in database.py ==="
rg -n "BEGIN IMMEDIATE" api/database.py -B 3 -A 3

echo -e "\n=== Full context around lines 350-420 in api/database.py ==="
sed -n '350,420p' api/database.py

Repository: leonvanzyl/autocoder

Length of output: 5057


🏁 Script executed:

#!/bin/bash
echo "=== Check lines around the atomic_transaction function and event setup ==="
sed -n '460,520p' api/database.py

Repository: leonvanzyl/autocoder

Length of output: 2522


EXCLUSIVE mode causes "cannot start a transaction within a transaction" error.

When isolation_level == "EXCLUSIVE", the code executes session.execute(text("BEGIN EXCLUSIVE")) at line 501. However, SQLAlchemy's "begin" event (lines 367-370) fires and executes BEGIN IMMEDIATE before this statement runs. Since autocommit=False on the sessionmaker, attempting BEGIN EXCLUSIVE within an already-started transaction causes SQLite to error.

This affects the actual usage in feature_mcp.py line 564, which calls atomic_transaction(_session_maker, "EXCLUSIVE").

Proposed fix

Either:

  1. Disable EXCLUSIVE mode support and only allow IMMEDIATE:
     session = session_maker()
     try:
-        # For EXCLUSIVE mode, override the default IMMEDIATE from event hooks
-        # For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically
-        if isolation_level == "EXCLUSIVE":
-            session.execute(text("BEGIN EXCLUSIVE"))
-        # Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it
-        # This prevents the fragile "BEGIN on already-begun transaction" issue
+        if isolation_level != "IMMEDIATE":
+            raise ValueError(f"Only IMMEDIATE isolation level is supported, got: {isolation_level}")
         yield session
  1. Or conditionally disable the event hook for EXCLUSIVE mode connections (requires connection-level isolation which adds complexity).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
session = session_maker()
try:
# For EXCLUSIVE mode, override the default IMMEDIATE from event hooks
# For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically
if isolation_level == "EXCLUSIVE":
session.execute(text("BEGIN EXCLUSIVE"))
# Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it
# This prevents the fragile "BEGIN on already-begun transaction" issue
yield session
session.commit()
except Exception:
try:
session.rollback()
except Exception:
pass # Don't let rollback failure mask original error
raise
finally:
session.close()
session = session_maker()
try:
if isolation_level != "IMMEDIATE":
raise ValueError(f"Only IMMEDIATE isolation level is supported, got: {isolation_level}")
yield session
session.commit()
except Exception:
try:
session.rollback()
except Exception:
pass # Don't let rollback failure mask original error
raise
finally:
session.close()
🤖 Prompt for AI Agents
In `@api/database.py` around lines 496 - 513, The code currently attempts
session.execute(text("BEGIN EXCLUSIVE")) when isolation_level == "EXCLUSIVE",
but the SQLAlchemy "begin" event hook already issues "BEGIN IMMEDIATE", causing
a nested-transaction error; fix by removing the explicit BEGIN EXCLUSIVE and do
not support starting EXCLUSIVE at the session level—either (preferred) map the
"EXCLUSIVE" argument to "IMMEDIATE" (treat isolation_level == "EXCLUSIVE" as
"IMMEDIATE") in the atomic_transaction/session creation logic, or immediately
raise a clear ValueError when callers pass "EXCLUSIVE"; update references to
isolation_level handling and remove session.execute(text("BEGIN EXCLUSIVE")) so
the existing begin event hook (which issues BEGIN IMMEDIATE) is the single
source of BEGIN semantics.

cabana8471-arch and others added 3 commits January 30, 2026 23:17
These imports are needed for the default_concurrency column and migration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add explicit encoding="utf-8" and errors="replace" to popen_kwargs
dict in parallel_orchestrator.py (3 locations) to fix Windows CP1252
encoding issues.

Closes leonvanzyl#138

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@leonvanzyl
Copy link
Collaborator

Hey @cabana8471-arch — thanks for this PR and the thorough work on diagnosing the SQLite corruption issue in parallel mode. Your root cause analysis was spot-on: threading.Lock() is indeed useless across processes, and atomic SQL operations are the right fix.

We've gone ahead and implemented the core improvements from this PR directly on master, incorporating the best parts while addressing a few issues we found during review. Here's what we took and what we adjusted:

What we implemented (inspired by this PR)

  • SQLAlchemy 2.0 event hooks (do_connect + do_begin) for reliable BEGIN IMMEDIATE transactions in api/database.py
  • Atomic UPDATE ... WHERE operations across all feature MCP tools (feature_mark_passing, feature_mark_in_progress, feature_claim_and_get, feature_skip, etc.)
  • Removed threading.Lock() (_priority_lock) from feature_mcp.py — replaced with SQL-level atomicity
  • atomic_transaction() context manager for multi-statement operations (feature_create_bulk, feature_add_dependency, etc.)
  • WAL checkpoint on shutdown + idempotent cleanup() method in parallel_orchestrator.py
  • Async-safe signal handler (flag-only, no I/O) with atexit registration
  • contextlib.closing() for connection handling in progress.py
  • Thread-safe engine cache with double-checked locking in assistant_database.py
  • DeclarativeBase migration in registry.py, database.py, and assistant_database.py

What we fixed vs this PR

  1. Removed BEGIN EXCLUSIVE support entirely — The do_begin event hook fires BEGIN IMMEDIATE for every new transaction. When atomic_transaction(session_maker, "EXCLUSIVE") then executes BEGIN EXCLUSIVE, it creates a nested transaction that throws OperationalError: cannot start a transaction within a transaction. Since IMMEDIATE and EXCLUSIVE provide identical write-lock guarantees in WAL mode, we use IMMEDIATE everywhere. This was the main blocker for merging.

  2. Added WHERE passes = 0 state guard to feature_mark_passing — The PR's version used UPDATE features SET passes = 1, in_progress = 0 WHERE id = :id without checking current state, allowing duplicate success reports. We added AND passes = 0 and check rowcount for already-passing detection.

What we left out

  • Cross-process tests — The test suite uses threading.Barrier / ThreadPoolExecutor which validates SQL-level atomicity but doesn't exercise the actual multi-process scenario from issue [Bug] Database gets corrupted from time to time since multi agents #106. We may add multiprocessing-based tests separately.
  • The full test file — Since we adjusted several implementations (no EXCLUSIVE, added state guards), the tests would need rework. We'll write tests tailored to our implementation.

Thanks again for identifying these issues and proposing solid solutions. The atomic SQL pattern and event hook approach were exactly right, and this PR was instrumental in getting these fixes onto master. We're going to close this PR since the changes have been incorporated, but your contribution is very much appreciated! 🙏

@leonvanzyl
Copy link
Collaborator

Closing this PR as the core improvements have been implemented directly on master (commit d8a8c83). See the detailed comment above for what was incorporated. Thanks again for the excellent work on this!

@leonvanzyl leonvanzyl closed this Feb 1, 2026
leonvanzyl added a commit that referenced this pull request Feb 1, 2026
Replace ineffective threading.Lock() with atomic SQL operations for
cross-process safety. Key changes:

- Add SQLAlchemy event hooks (do_connect/do_begin) for BEGIN IMMEDIATE
  transactions in api/database.py
- Add atomic_transaction() context manager for multi-statement ops
- Convert all feature MCP write operations to atomic UPDATE...WHERE
  with compare-and-swap patterns (feature_claim, mark_passing, etc.)
- Add WHERE passes=0 state guard to feature_mark_passing
- Add WAL checkpoint on shutdown and idempotent cleanup() in
  parallel_orchestrator.py with async-safe signal handling
- Wrap SQLite connections with contextlib.closing() in progress.py
- Add thread-safe engine cache with double-checked locking in
  assistant_database.py
- Migrate to SQLAlchemy 2.0 DeclarativeBase across all modules

Inspired by PR #108 (cabana8471-arch), with fixes for nested
BEGIN EXCLUSIVE bug and missing state guards.

Closes #106

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Database gets corrupted from time to time since multi agents

2 participants