fix: Prevent SQLite corruption in parallel mode with atomic operations by cabana8471-arch · Pull Request #108 · AutoForgeAI/autoforge

cabana8471-arch · 2026-01-26T21:15:28Z

Summary

Fixes database corruption (sqlite3.DatabaseError: database disk image is malformed) when running multiple agents in parallel
Replaces thread-based locking (ineffective cross-process) with atomic SQL operations
Adds proper engine cleanup with WAL checkpointing on shutdown

Fixes #106

Root Causes Addressed

threading.Lock() useless cross-process - Removed and replaced with atomic SQL
Read-modify-write races - Fixed with UPDATE ... WHERE clauses
Engine never disposed - Added cleanup() method with signal handlers
No WAL checkpoint - Added PRAGMA wal_checkpoint(FULL) before dispose
assistant.db cache race - Added thread-safe locking

Changes

`api/database.py`

Added atomic_transaction() context manager with BEGIN IMMEDIATE/EXCLUSIVE
Added atomic_claim_feature(), atomic_mark_passing(), atomic_update_priority_to_end()

`mcp_server/feature_mcp.py`

Removed useless _priority_lock = threading.Lock()
Fixed all high-risk tools with atomic operations:
- feature_skip - atomic priority update with subquery
- feature_mark_in_progress - atomic UPDATE WHERE in_progress=0
- feature_claim_and_get - atomic claim
- feature_create_bulk - EXCLUSIVE transaction for priority range
- feature_add_dependency - IMMEDIATE transaction for cycle detection

`parallel_orchestrator.py`

Added cleanup() method with WAL checkpoint before dispose
Added atexit and signal handlers (SIGTERM, SIGINT)
Always calls cleanup in finally block

`progress.py`

Added _get_connection() helper with timeout=30 and PRAGMA busy_timeout=30000

`server/services/assistant_database.py`

Added threading.Lock() for engine cache
Uses double-checked locking pattern

Test plan

Run parallel mode with max concurrency (5 agents)
After 60s, check PRAGMA integrity_check returns "ok"
Verify no duplicate priorities: SELECT priority, COUNT(*) FROM features GROUP BY priority HAVING COUNT(*) > 1
Stop orchestrator and verify process count returns to baseline

Migration

No schema changes - backward compatible
No data migration needed
Works with existing databases

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Explicit atomic transaction support for safer concurrent writes.
- Graceful cleanup on exit to release DB resources reliably.
- DB dependency now yields connections for safer lifecycle management.
Bug Fixes
- Stronger SQLite locking to avoid races and ensure write-lock acquisition.
- Improved timeout/busy handling and more consistent error responses.
Refactor
- Replaced in-process locks with atomic DB-backed updates.
Tests
- Added concurrency and transaction isolation tests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Fixes AutoForgeAI#106 ## Problem When running multiple agents in parallel (up to 5), database corruption occurs: - `sqlite3.DatabaseError: database disk image is malformed` - `PRAGMA integrity_check` shows btreeInitPage errors ## Root Causes 1. `threading.Lock()` in feature_mcp.py only works per-process, not cross-process 2. Read-modify-write patterns in 6+ MCP tools cause race conditions 3. Orchestrator engine never disposed - connection pool not cleaned up 4. No WAL checkpoint on agent exit causes stale cache issues 5. assistant.db engine cache has no locking ## Solution - Replace read-modify-write patterns with atomic `UPDATE ... WHERE` clauses - Use SQLite IMMEDIATE transactions for consistent reads during cycle detection - Add WAL checkpoint before engine dispose on shutdown - Add thread-safe locking to assistant.db engine cache - Add proper connection timeout settings (30s) throughout ## Changes - api/database.py: Add atomic_transaction(), atomic_claim_feature(), etc. - mcp_server/feature_mcp.py: Fix all high-risk tools with atomic SQL - parallel_orchestrator.py: Add cleanup() with WAL checkpoint, signal handlers - progress.py: Add _get_connection() with timeout settings - server/services/assistant_database.py: Add thread-safe cache locking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-26T21:15:50Z

Warning

Rate limit exceeded

@cabana8471-arch has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds SQLAlchemy 2.0-style declarative base and an explicit atomic_transaction(session_maker, isolation_level="IMMEDIATE") context manager, configures SQLite to use BEGIN IMMEDIATE via event hooks, converts many feature operations to atomic SQL updates, improves SQLite parallel-safety timeouts/connections, adds orchestrator cleanup and thread-safe engine caching, and adds concurrency tests.

Changes

Cohort / File(s)	Summary
Database core & API `api/database.py`	Added `atomic_transaction(session_maker, isolation_level="IMMEDIATE")`, `_configure_sqlite_immediate_transactions(engine)`, switched to SQLAlchemy 2.0 `DeclarativeBase` for `Base`, made `get_db()` a generator `Generator[Session, None, None]`, added event hooks to enforce `BEGIN IMMEDIATE` and PRAGMA setup before `Base.metadata.create_all`.
Feature operations (atomic updates) `mcp_server/feature_mcp.py`	Replaced in-process locks with atomic SQL `UPDATE ... WHERE ...` inside `atomic_transaction` for claim, in_progress, passes, priority, dependencies, bulk create, and related flows; added state refreshes and explicit JSON error responses.
Orchestrator lifecycle & cleanup `parallel_orchestrator.py`	Added `ParallelOrchestrator.cleanup()` to checkpoint WAL and dispose engine (idempotent), registered `atexit` and signal handlers, ensured cleanup in finally blocks, and widened some parameter types to allow `None`.
SQLite connection helpers & timeouts `progress.py`	Added `SQLITE_TIMEOUT`, `SQLITE_BUSY_TIMEOUT_MS`, and `_get_connection(db_file: Path)` to apply timeouts and `busy_timeout`; replaced direct `sqlite3.connect()` usages with helper and context-managed connections.
Thread-safe engine cache `server/services/assistant_database.py`	Introduced module-level `_cache_lock` and double-checked locking in `get_engine()`; engines created with `check_same_thread=False` and 30s timeout and cached for reuse.
Registry / ORM base update `registry.py`	Replaced `declarative_base()` with `class Base(DeclarativeBase)` to adopt SQLAlchemy 2.0 declarative base pattern for models.
Tests (concurrency & behavior) `test_atomic_operations.py`	Added extensive concurrency tests validating single-winner claims, concurrent priority updates, transaction isolation/rollback, event hook firing, and orchestrator `cleanup()` idempotency.
Dead/legacy helpers noted `api/database.py` (comments)	Marked some legacy atomic helper functions as dead/relocated in comments and added minor `# type: ignore` annotations around table creation to satisfy typing.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant MCP as mcp_server.feature_mcp
  participant API as api.database
  participant DB as SQLite
  Client->>MCP: request to claim feature (id)
  MCP->>API: atomic_claim_feature(session_maker, id)
  API->>DB: BEGIN IMMEDIATE
  DB-->>API: lock acquired
  API->>DB: UPDATE features SET in_progress=1 WHERE id=? AND conditions...
  DB-->>API: affected_rows (1 or 0)
  API->>DB: COMMIT
  API-->>MCP: return success/failure with feature payload
  MCP-->>Client: respond with claim result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through rows with PRAGMAs in paw,

Began IMMEDIATE to keep chaos at bay,
Timeouts and checkpoints to guard every draw,
Threads now behave while updates hold sway,
A rabbit nods — the DB is safe today.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: preventing SQLite corruption in parallel mode using atomic operations, which is the primary objective of the PR.
Linked Issues check	✅ Passed	The PR addresses all coding requirements from issue `#106`: replaces threading locks with atomic SQL operations [`#106`], implements transaction semantics (BEGIN IMMEDIATE) [`#106`], adds engine cleanup with WAL checkpointing [`#106`], and ensures cross-process-safe database access [`#106`].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing SQLite corruption: database connection handling, atomic operations, cleanup routines, and concurrency tests. No unrelated refactoring or scope creep detected.
Docstring Coverage	✅ Passed	Docstring coverage is 84.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

progress.py (1)

65-74: Ensure SQLite connections close on error paths.
If cursor.execute(...) raises, conn.close() is skipped, which can leak file descriptors and keep locks longer than needed. Consider wrapping connections with contextlib.closing (or explicit try/finally) in all three call sites.

♻️ Suggested fix (apply same pattern to the other functions)

+from contextlib import closing
...
-    try:
-        conn = _get_connection(db_file)
-        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM features")
-        count = cursor.fetchone()[0]
-        conn.close()
-        return count > 0
-    except Exception:
-        return False
+    try:
+        with closing(_get_connection(db_file)) as conn:
+            cursor = conn.cursor()
+            cursor.execute("SELECT COUNT(*) FROM features")
+            count = cursor.fetchone()[0]
+            return count > 0
+    except Exception:
+        return False

Also applies to: 93-126, 145-158

🤖 Fix all issues with AI agents

In `@mcp_server/feature_mcp.py`:
- Around line 37-45: The import block is unsorted causing ruff errors; reorder
the names imported from api.database so they are alphabetically sorted (e.g.,
adjust the import that currently lists Feature, create_database,
atomic_transaction to list them in alphabetical order: atomic_transaction,
create_database, Feature) and ensure the overall import group ordering follows
standard (stdlib, third-party, local) so FastMCP, BaseModel, Field, text,
sys.path insertion, then the sorted api.database import and subsequent
dependency_resolver imports.
- Around line 417-426: In the result.rowcount == 0 handling block, remove the
unnecessary f-string prefix on the final error message; change the return
json.dumps({"error": f"Failed to mark feature in-progress for unknown reason"})
to a plain string literal (no f) so it reads return json.dumps({"error": "Failed
to mark feature in-progress for unknown reason"}); keep the other messages that
interpolate feature_id (e.g., the returns for Feature not found, already
passing, already in-progress) as f-strings and leave session.query(Feature) and
feature_id usage unchanged.

mcp_server/feature_mcp.py

- Sort imports alphabetically in feature_mcp.py (I001) - Remove unnecessary f-string prefix (F541) - Use contextlib.closing to ensure SQLite connections close on error paths Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Ruff requires classes (uppercase) to come before functions (lowercase): Feature, atomic_transaction, create_database Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Combined: - Stuck features clearing at orchestrator startup (my-features) - Signal handlers and atexit cleanup (PR AutoForgeAI#108) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…orgeAI#100, AutoForgeAI#108, AutoForgeAI#109, AutoForgeAI#110 PR AutoForgeAI#110 (Quality Gates): - Move quality checks before DB session to avoid holding locks - Return error instead of None for missing configured custom script - Use contextlib.closing for SQLite connections in progress.py PR AutoForgeAI#109 (Rate Limit): - Extract rate limit logic to shared rate_limit_utils.py module - Remove duplicated code from agent.py and test_agent.py PR AutoForgeAI#108 (SQLite Parallel): - Sort imports alphabetically in feature_mcp.py PR AutoForgeAI#100 (Config Diagnostics): - Add logger.warning for pkill_processes validation failures PR AutoForgeAI#95 (Infrastructure Mock): - Add language tags to fenced code blocks in initializer template Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…orgeAI#100, AutoForgeAI#108, AutoForgeAI#109, AutoForgeAI#110 PR AutoForgeAI#110 (Quality Gates): - Move quality checks before DB session to avoid holding locks - Return error instead of None for missing configured custom script - Use contextlib.closing for SQLite connections in progress.py PR AutoForgeAI#109 (Rate Limit): - Extract rate limit logic to shared rate_limit_utils.py module - Remove duplicated code from agent.py and test_agent.py PR AutoForgeAI#108 (SQLite Parallel): - Sort imports alphabetically in feature_mcp.py PR AutoForgeAI#100 (Config Diagnostics): - Add logger.warning for pkill_processes validation failures PR AutoForgeAI#95 (Infrastructure Mock): - Add language tags to fenced code blocks in initializer template Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

leonvanzyl · 2026-01-30T09:31:33Z

Hey @cabana8471-arch! Thanks for tackling this - SQLite corruption in parallel mode is a serious issue and the core approach here (replacing threading.Lock() with atomic SQL operations) is the right call. Nice work identifying the root causes.

I did a thorough review and found a few things worth addressing before merging. Splitting them into must-fix and should-fix:

Critical (Must Fix)

1. Signal handler performs unsafe database operations

parallel_orchestrator.py — signal_handler()

The signal handler calls orchestrator.cleanup() which does SQLite I/O (PRAGMA wal_checkpoint(FULL), engine.dispose()). Python signal handlers execute between bytecode instructions — if a signal arrives while the main thread is mid-SQLite operation, the handler will try to acquire the same internal SQLite mutexes, causing a deadlock.

This is a documented real-world issue.

Suggested fix: Use a flag-based approach instead:

def signal_handler(signum, frame):
    orchestrator._shutdown_requested = True
    orchestrator.is_running = False

Then let the main run_loop() check the flag and the finally block handle cleanup on a safe code path. The atexit handler already covers normal termination.

2. `atomic_transaction()` relies on fragile pysqlite behavior

api/database.py — atomic_transaction()

Issuing session.execute(text("BEGIN IMMEDIATE")) on a session created with autocommit=False works today only because it's the first operation on a fresh session (before pysqlite's legacy mode auto-begins). This is fragile:

Breaks if any ORM flush occurs before BEGIN IMMEDIATE
Will break in Python 3.16 when the legacy pysqlite transaction mode is removed

Per the SQLAlchemy SQLite docs, the recommended approach is event hooks:

@event.listens_for(engine, "connect")
def do_connect(dbapi_connection, connection_record):
    dbapi_connection.isolation_level = None

@event.listens_for(engine, "begin")
def do_begin(conn):
    conn.exec_driver_sql("BEGIN IMMEDIATE")

This makes atomic_transaction() unnecessary for most cases and is forward-compatible.

Warnings (Should Address)

3. Dead code in `api/database.py`

atomic_claim_feature(), atomic_mark_passing(), atomic_update_priority_to_end(), and atomic_get_next_priority() are defined but never imported or called — feature_mcp.py reimplements the same logic inline. That's ~100 lines of unused code. Either remove them or refactor feature_mcp.py to call them.

4. `cleanup()` is not idempotent

cleanup() can be called up to 3 times (signal handler + finally + atexit), but self._engine is never set to None after dispose. Subsequent calls will attempt operations on a disposed engine.

Quick fix:

def cleanup(self):
    engine = self._engine
    self._engine = None  # Prevent re-entry
    if engine is not None:
        # ... checkpoint and dispose

5. `except KeyboardInterrupt` becomes dead code

signal.signal(signal.SIGINT, signal_handler) overrides Python's default behavior, so the except KeyboardInterrupt: block in run_parallel_orchestrator() will never execute. Either remove it or don't register SIGINT (let Python raise KeyboardInterrupt naturally).

6. Windows note

signal.signal(signal.SIGTERM, handler) is silently ignored on Windows. Since this project's primary dev platform is Windows, worth a comment noting the limitation.

7. Rollback failure can swallow the original exception

In atomic_transaction(), if the yield block raises and session.rollback() also fails, the rollback exception replaces the original. Consider wrapping:

except Exception:
    try:
        session.rollback()
    except Exception:
        pass
    raise

8. No automated tests

This is concurrency-critical code. Even a basic test that spawns N threads calling atomic_claim_feature() for the same feature (verifying exactly 1 succeeds) would catch regressions.

What's Good

Correctly identifies threading.Lock() as useless cross-process
Atomic UPDATE...WHERE is the right pattern for SQLite multi-process safety
WAL checkpoint on shutdown prevents stale cache issues
contextlib.closing() for connection leak prevention (good catch from the CodeRabbit feedback too)
Thread-safe engine cache with double-checked locking in assistant_database.py
Backward compatible — no schema changes needed

The concept is solid. Once the signal handler and atomic_transaction issues are addressed, this should be good to go. Let me know if you have questions about any of the above!

…afety Changes per leonvanzyl and CodeRabbit feedback: Signal handling (parallel_orchestrator.py): - Make signal handler async-safe (only sets flags, no I/O) - Add _shutdown_requested flag for graceful shutdown - Remove SIGINT handler to let KeyboardInterrupt work naturally - Make cleanup() idempotent by clearing _engine first - Fix Optional type hints for mypy compliance SQLite transactions (api/database.py): - Add SQLAlchemy event hooks for BEGIN IMMEDIATE transactions - This replaces fragile pysqlite implicit transaction handling - Future-proof for Python 3.16 when legacy mode is removed - Fix rollback exception swallowing in atomic_transaction() - Remove ~150 lines of dead code (unused atomic_* functions) Type safety improvements: - Migrate to SQLAlchemy 2.0 DeclarativeBase (database.py, registry.py) - Fix all mypy errors in modified files - Add explicit type annotations for Optional parameters Testing: - Add test_atomic_operations.py with 7 concurrency tests - Tests verify atomic claims, priority updates, cleanup idempotency Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@api/database.py`:
- Around line 496-513: The code currently attempts session.execute(text("BEGIN
EXCLUSIVE")) when isolation_level == "EXCLUSIVE", but the SQLAlchemy "begin"
event hook already issues "BEGIN IMMEDIATE", causing a nested-transaction error;
fix by removing the explicit BEGIN EXCLUSIVE and do not support starting
EXCLUSIVE at the session level—either (preferred) map the "EXCLUSIVE" argument
to "IMMEDIATE" (treat isolation_level == "EXCLUSIVE" as "IMMEDIATE") in the
atomic_transaction/session creation logic, or immediately raise a clear
ValueError when callers pass "EXCLUSIVE"; update references to isolation_level
handling and remove session.execute(text("BEGIN EXCLUSIVE")) so the existing
begin event hook (which issues BEGIN IMMEDIATE) is the single source of BEGIN
semantics.

🧹 Nitpick comments (1)

test_atomic_operations.py (1)
218-280: Clarify: Test validates serialization, not parallel reads.

The test comment says "IMMEDIATE prevents stale reads" but the expected behavior (initial + 200) actually validates that transactions serialize properly - the second transaction waits for the first to complete, then reads the updated value.

This is correct behavior for BEGIN IMMEDIATE, but the test name/comments could be clearer. With IMMEDIATE, the second transaction blocks until the first commits, so each thread reads the previously committed value (not a stale value). This is serialization via write-lock contention, not snapshot isolation.
Suggested clarification in docstring
 class TestAtomicTransactionIsolation:
-    """Test that atomic_transaction with IMMEDIATE prevents stale reads."""
+    """Test that atomic_transaction with IMMEDIATE serializes concurrent writes."""

     def test_read_modify_write_isolation(self, temp_db):
-        """Verify IMMEDIATE transaction prevents stale read in read-modify-write."""
+        """Verify IMMEDIATE transactions serialize, preventing lost updates."""
         engine, session_maker = temp_db

-        # This test verifies that two concurrent read-modify-write operations
-        # don't both read the same value and create a conflict
+        # This test verifies that IMMEDIATE transactions serialize correctly.
+        # The second transaction blocks until the first commits, then reads
+        # the updated value. This prevents lost updates.

coderabbitai · 2026-01-30T21:15:53Z

api/database.py

+    session = session_maker()
+    try:
+        # For EXCLUSIVE mode, override the default IMMEDIATE from event hooks
+        # For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically
+        if isolation_level == "EXCLUSIVE":
+            session.execute(text("BEGIN EXCLUSIVE"))
+        # Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it
+        # This prevents the fragile "BEGIN on already-begun transaction" issue
+        yield session
+        session.commit()
+    except Exception:
+        try:
+            session.rollback()
+        except Exception:
+            pass  # Don't let rollback failure mask original error
+        raise
+    finally:
+        session.close()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if EXCLUSIVE mode is actually used anywhere in the codebase echo "=== Searching for EXCLUSIVE mode usage ===" rg -n "EXCLUSIVE" --type=py -B 2 -A 2 echo -e "\n=== Searching for isolation_level assignments ===" rg -n 'isolation_level' --type=py -B 2 -A 2 echo -e "\n=== Looking for atomic_transaction context manager ===" rg -n "atomic_transaction\|@atomic_transaction" --type=py -B 2 -A 2 | head -50

Repository: leonvanzyl/autocoder

Length of output: 6215

🏁 Script executed:

#!/bin/bash # Find the event hook implementation echo "=== Looking for do_begin event hook ===" rg -n "do_begin\|event.listen\|before_cursor_execute" --type=py -B 3 -A 5

Repository: leonvanzyl/autocoder

Length of output: 103

🏁 Script executed:

#!/bin/bash # Check the database.py file structure to find event hook setup echo "=== Examining api/database.py for event hooks and session setup ===" head -100 api/database.py

Repository: leonvanzyl/autocoder

Length of output: 3621

🏁 Script executed:

#!/bin/bash echo "=== Looking for _configure_sqlite_immediate_transactions function ===" rg -n "_configure_sqlite_immediate_transactions" --type=py -A 20 echo -e "\n=== Searching for event.listen calls ===" rg -n "event\.listen" --type=py -B 2 -A 5

Repository: leonvanzyl/autocoder

Length of output: 5510

🏁 Script executed:

#!/bin/bash echo "=== Searching for BEGIN IMMEDIATE in database.py ===" rg -n "BEGIN IMMEDIATE" api/database.py -B 3 -A 3 echo -e "\n=== Full context around lines 350-420 in api/database.py ===" sed -n '350,420p' api/database.py

Repository: leonvanzyl/autocoder

Length of output: 5057

🏁 Script executed:

#!/bin/bash echo "=== Check lines around the atomic_transaction function and event setup ===" sed -n '460,520p' api/database.py

Repository: leonvanzyl/autocoder

Length of output: 2522

EXCLUSIVE mode causes "cannot start a transaction within a transaction" error.

When isolation_level == "EXCLUSIVE", the code executes session.execute(text("BEGIN EXCLUSIVE")) at line 501. However, SQLAlchemy's "begin" event (lines 367-370) fires and executes BEGIN IMMEDIATE before this statement runs. Since autocommit=False on the sessionmaker, attempting BEGIN EXCLUSIVE within an already-started transaction causes SQLite to error.

This affects the actual usage in feature_mcp.py line 564, which calls atomic_transaction(_session_maker, "EXCLUSIVE").

Proposed fix

Either:

Disable EXCLUSIVE mode support and only allow IMMEDIATE:

session = session_maker() try: - # For EXCLUSIVE mode, override the default IMMEDIATE from event hooks - # For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically - if isolation_level == "EXCLUSIVE": - session.execute(text("BEGIN EXCLUSIVE")) - # Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it - # This prevents the fragile "BEGIN on already-begun transaction" issue + if isolation_level != "IMMEDIATE": + raise ValueError(f"Only IMMEDIATE isolation level is supported, got: {isolation_level}") yield session

Or conditionally disable the event hook for EXCLUSIVE mode connections (requires connection-level isolation which adds complexity).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

session = session_maker()

try:

# For EXCLUSIVE mode, override the default IMMEDIATE from event hooks

# For IMMEDIATE mode, the event hooks handle BEGIN IMMEDIATE automatically

if isolation_level == "EXCLUSIVE":

session.execute(text("BEGIN EXCLUSIVE"))

# Note: For IMMEDIATE, we don't issue BEGIN here - the event hook handles it

# This prevents the fragile "BEGIN on already-begun transaction" issue

yield session

session.commit()

except Exception:

try:

session.rollback()

except Exception:

pass # Don't let rollback failure mask original error

raise

finally:

session.close()

session = session_maker()

try:

if isolation_level != "IMMEDIATE":

raise ValueError(f"Only IMMEDIATE isolation level is supported, got: {isolation_level}")

yield session

session.commit()

except Exception:

try:

session.rollback()

except Exception:

pass # Don't let rollback failure mask original error

raise

finally:

session.close()

🤖 Prompt for AI Agents

In `@api/database.py` around lines 496 - 513, The code currently attempts session.execute(text("BEGIN EXCLUSIVE")) when isolation_level == "EXCLUSIVE", but the SQLAlchemy "begin" event hook already issues "BEGIN IMMEDIATE", causing a nested-transaction error; fix by removing the explicit BEGIN EXCLUSIVE and do not support starting EXCLUSIVE at the session level—either (preferred) map the "EXCLUSIVE" argument to "IMMEDIATE" (treat isolation_level == "EXCLUSIVE" as "IMMEDIATE") in the atomic_transaction/session creation logic, or immediately raise a clear ValueError when callers pass "EXCLUSIVE"; update references to isolation_level handling and remove session.execute(text("BEGIN EXCLUSIVE")) so the existing begin event hook (which issues BEGIN IMMEDIATE) is the single source of BEGIN semantics.

These imports are needed for the default_concurrency column and migration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add explicit encoding="utf-8" and errors="replace" to popen_kwargs dict in parallel_orchestrator.py (3 locations) to fix Windows CP1252 encoding issues. Closes leonvanzyl#138 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

leonvanzyl · 2026-02-01T07:43:43Z

Hey @cabana8471-arch — thanks for this PR and the thorough work on diagnosing the SQLite corruption issue in parallel mode. Your root cause analysis was spot-on: threading.Lock() is indeed useless across processes, and atomic SQL operations are the right fix.

We've gone ahead and implemented the core improvements from this PR directly on master, incorporating the best parts while addressing a few issues we found during review. Here's what we took and what we adjusted:

What we implemented (inspired by this PR)

SQLAlchemy 2.0 event hooks (do_connect + do_begin) for reliable BEGIN IMMEDIATE transactions in api/database.py
Atomic UPDATE ... WHERE operations across all feature MCP tools (feature_mark_passing, feature_mark_in_progress, feature_claim_and_get, feature_skip, etc.)
Removed threading.Lock() (_priority_lock) from feature_mcp.py — replaced with SQL-level atomicity
atomic_transaction() context manager for multi-statement operations (feature_create_bulk, feature_add_dependency, etc.)
WAL checkpoint on shutdown + idempotent cleanup() method in parallel_orchestrator.py
Async-safe signal handler (flag-only, no I/O) with atexit registration
contextlib.closing() for connection handling in progress.py
Thread-safe engine cache with double-checked locking in assistant_database.py
DeclarativeBase migration in registry.py, database.py, and assistant_database.py

What we fixed vs this PR

Removed BEGIN EXCLUSIVE support entirely — The do_begin event hook fires BEGIN IMMEDIATE for every new transaction. When atomic_transaction(session_maker, "EXCLUSIVE") then executes BEGIN EXCLUSIVE, it creates a nested transaction that throws OperationalError: cannot start a transaction within a transaction. Since IMMEDIATE and EXCLUSIVE provide identical write-lock guarantees in WAL mode, we use IMMEDIATE everywhere. This was the main blocker for merging.
Added WHERE passes = 0 state guard to feature_mark_passing — The PR's version used UPDATE features SET passes = 1, in_progress = 0 WHERE id = :id without checking current state, allowing duplicate success reports. We added AND passes = 0 and check rowcount for already-passing detection.

What we left out

Cross-process tests — The test suite uses threading.Barrier / ThreadPoolExecutor which validates SQL-level atomicity but doesn't exercise the actual multi-process scenario from issue [Bug] Database gets corrupted from time to time since multi agents #106. We may add multiprocessing-based tests separately.
The full test file — Since we adjusted several implementations (no EXCLUSIVE, added state guards), the tests would need rework. We'll write tests tailored to our implementation.

Thanks again for identifying these issues and proposing solid solutions. The atomic SQL pattern and event hook approach were exactly right, and this PR was instrumental in getting these fixes onto master. We're going to close this PR since the changes have been incorporated, but your contribution is very much appreciated! 🙏

leonvanzyl · 2026-02-01T07:45:27Z

Closing this PR as the core improvements have been implemented directly on master (commit d8a8c83). See the detailed comment above for what was incorporated. Thanks again for the excellent work on this!

Replace ineffective threading.Lock() with atomic SQL operations for cross-process safety. Key changes: - Add SQLAlchemy event hooks (do_connect/do_begin) for BEGIN IMMEDIATE transactions in api/database.py - Add atomic_transaction() context manager for multi-statement ops - Convert all feature MCP write operations to atomic UPDATE...WHERE with compare-and-swap patterns (feature_claim, mark_passing, etc.) - Add WHERE passes=0 state guard to feature_mark_passing - Add WAL checkpoint on shutdown and idempotent cleanup() in parallel_orchestrator.py with async-safe signal handling - Wrap SQLite connections with contextlib.closing() in progress.py - Add thread-safe engine cache with double-checked locking in assistant_database.py - Migrate to SQLAlchemy 2.0 DeclarativeBase across all modules Inspired by PR #108 (cabana8471-arch), with fixes for nested BEGIN EXCLUSIVE bug and missing state guards. Closes #106 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

mcp_server/feature_mcp.py Show resolved Hide resolved

mcp_server/feature_mcp.py Outdated Show resolved Hide resolved

cabana8471-arch and others added 2 commits January 26, 2026 22:29

fix: address ruff linting errors and connection leak

fbacc05

- Sort imports alphabetically in feature_mcp.py (I001) - Remove unnecessary f-string prefix (F541) - Use contextlib.closing to ensure SQLite connections close on error paths Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: correct import order (classes before functions)

041f576

Ruff requires classes (uppercase) to come before functions (lowercase): Feature, atomic_transaction, create_database Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cabana8471-arch and others added 2 commits January 27, 2026 07:00

fix: sort imports alphabetically per CodeRabbit feedback

0a30735

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: correct import order per ruff (uppercase before lowercase)

1cbc66a

cabana8471-arch added a commit to cabana8471-arch/autocoder that referenced this pull request Jan 27, 2026

Merge SQLite parallel corruption fixes from PR AutoForgeAI#108

66a19b6

cabana8471-arch and others added 2 commits January 28, 2026 22:27

fix: sort imports alphabetically to fix ruff I001

4a3fe77

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: organize imports in feature_mcp.py

7acdb95

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cabana8471-arch closed this Jan 29, 2026

cabana8471-arch deleted the fix/sqlite-parallel-corruption branch January 29, 2026 16:35

cabana8471-arch restored the fix/sqlite-parallel-corruption branch January 29, 2026 20:37

cabana8471-arch reopened this Jan 29, 2026

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

cabana8471-arch and others added 3 commits January 30, 2026 23:17

Merge branch 'master' into fix/sqlite-parallel-corruption

bf9df05

fix: add missing Integer and text imports to registry.py

b111099

These imports are needed for the default_concurrency column and migration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: add UTF-8 encoding for subprocess calls on Windows

4b5b643

Add explicit encoding="utf-8" and errors="replace" to popen_kwargs dict in parallel_orchestrator.py (3 locations) to fix Windows CP1252 encoding issues. Closes leonvanzyl#138 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

leonvanzyl closed this Feb 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Prevent SQLite corruption in parallel mode with atomic operations#108

fix: Prevent SQLite corruption in parallel mode with atomic operations#108
cabana8471-arch wants to merge 11 commits intoAutoForgeAI:masterfrom
cabana8471-arch:fix/sqlite-parallel-corruption

cabana8471-arch commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

leonvanzyl commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 30, 2026

Uh oh!

leonvanzyl commented Feb 1, 2026

Uh oh!

leonvanzyl commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cabana8471-arch commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Causes Addressed

Changes

api/database.py

mcp_server/feature_mcp.py

parallel_orchestrator.py

progress.py

server/services/assistant_database.py

Test plan

Migration

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leonvanzyl commented Jan 30, 2026

Critical (Must Fix)

1. Signal handler performs unsafe database operations

2. atomic_transaction() relies on fragile pysqlite behavior

Warnings (Should Address)

3. Dead code in api/database.py

4. cleanup() is not idempotent

5. except KeyboardInterrupt becomes dead code

6. Windows note

7. Rollback failure can swallow the original exception

8. No automated tests

What's Good

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

leonvanzyl commented Feb 1, 2026

What we implemented (inspired by this PR)

What we fixed vs this PR

What we left out

Uh oh!

leonvanzyl commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cabana8471-arch commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

`api/database.py`

`mcp_server/feature_mcp.py`

`parallel_orchestrator.py`

`progress.py`

`server/services/assistant_database.py`

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

2. `atomic_transaction()` relies on fragile pysqlite behavior

3. Dead code in `api/database.py`

4. `cleanup()` is not idempotent

5. `except KeyboardInterrupt` becomes dead code