fix: DB lock resilience + MLX health recovery by EtanHey · Pull Request #44 · EtanHey/brainlayer

EtanHey · 2026-02-27T00:23:50Z

Summary

DB Lock: APSW bestpractice.connection_optimize runs PRAGMA optimize inside Connection() constructor before setbusytimeout() can be called. Registered a connection_hook that sets busy_timeout=10s before bestpractice hooks fire. Added retry with exponential backoff (5 attempts) on _init_db().
MLX Health: Added inter-batch health checks, auto-restart of MLX server when it dies mid-run, batch fail ratio detection (>80% = pause + health check), and better error categorization (ConnectionError vs Timeout).
14 new tests, 541 total passing

Test plan

test_db_lock_resilience.py — 5 tests for retry, concurrent init, WAL mode
test_mlx_health_recovery.py — 9 tests for health checks, error categorization, fail ratio detection
Full test suite: 541 passed, 0 regressions

🤖 Generated with Claude Code

Note

Medium Risk
Changes DB connection/initialization behavior and enrichment run control flow (including optional MLX subprocess restarts), which could affect reliability under contention or during backend failures.

Overview
Improves enrichment robustness when the LLM backend degrades: adds check_backend_health, distinguishes MLX ConnectionError vs Timeout, detects high per-batch failure ratios, and attempts MLX auto-restart/recovery before aborting after circuit-breaker events (with new env tunables like BATCH_FAIL_RATIO_THRESHOLD and restart timing).

Hardens SQLite/APSW startup under contention by setting busy_timeout via an early apsw.connection_hooks hook, applying setbusytimeout() immediately on new connections, and retrying VectorStore initialization with exponential backoff on BusyError; adds tests covering retry behavior, concurrent init, WAL mode, and MLX recovery paths.

^{Written by Cursor Bugbot for commit 14fa56a. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

New Features
- Added automatic backend health checks and recovery mechanisms for improved system reliability.
- Added configurable auto-restart for backends with customizable wait periods.
- Added periodic statistics syncing for enhanced monitoring.
Bug Fixes
- Improved database lock resilience with automatic retry during initialization.
- Enhanced error handling with better connection failure logging.
- Enhanced circuit-breaker recovery with automatic backend recovery attempts.

Two critical fixes that prevented enrichment from running: 1. DB Lock (CRITICAL): APSW bestpractice hooks run PRAGMA optimize inside Connection() constructor — before setbusytimeout() can be called. Fix: register a connection_hook that sets busy_timeout=10s BEFORE bestpractice hooks fire. Also wrap _init_db() in retry with exponential backoff (5 attempts, 0.5s-8s) for DDL contention. 2. MLX Health: When MLX server dies mid-run, pipeline now detects it via inter-batch health checks, attempts auto-restart, and stops cleanly if recovery fails. Also: batch fail ratio detection (>80% = pause), better error categorization (ConnectionError vs Timeout). Tests: 14 new tests, 541 total passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-27T00:24:10Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58371cb and 14fa56a.

📒 Files selected for processing (4)

src/brainlayer/pipeline/enrichment.py
src/brainlayer/vector_store.py
tests/test_db_lock_resilience.py
tests/test_mlx_health_recovery.py

📝 Walkthrough

Walkthrough

This PR introduces health monitoring and recovery orchestration for backend services in the enrichment pipeline, adds database lock resilience with retry logic and busy timeout handling to the vector store, and includes comprehensive test coverage for both features.

Changes

Cohort / File(s)	Summary
Backend Health & Recovery Orchestration `src/brainlayer/pipeline/enrichment.py`	Added health check function `check_backend_health()` to probe MLX, Ollama, and Groq reachability. Introduced recovery helpers: `_try_restart_mlx()` for server restart detection, `_recover_backend()` for orchestrated recovery attempts, and `_check_fallback_available()` for fallback validation. Enhanced circuit-breaker and batch-failure handling to attempt recovery before stopping. Integrated `_sync_stats_to_supabase()` for periodic stats syncing. Added environment-driven toggles for MLX auto-restart, restart wait period, batch fail ratio thresholds, and health check pause behavior. Improved MLX error logging for connection and timeout errors.
Database Lock Resilience & Timeout `src/brainlayer/vector_store.py`	Introduced module-level busy timeout hook to set APSW busy timeout to 10 seconds. Added retry mechanism (`_init_db_with_retry()`) with exponential backoff for database initialization on BusyError. Enhanced `_init_db()` to apply busy timeout immediately and enforce WAL journaling on every initialization.
Database Lock Resilience Tests `tests/test_db_lock_resilience.py`	New test module validating VectorStore initialization resilience: successful first-try initialization, retry-on-BusyError with backoff verification, BusyError propagation after max retries, WAL journal mode verification via PRAGMA, and concurrent multi-instance initialization without errors.
MLX Health & Recovery Tests `tests/test_mlx_health_recovery.py`	New test module with three test suites: `TestCheckBackendHealth` validates health checks for MLX, Ollama, and Groq (with/without API key); `TestCallMlxErrorCategorization` verifies distinct logging of connection errors vs. timeouts; `TestHighFailRatioBehavior` confirms health check triggering on high fail ratio and circuit breaker recovery flow.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant EnrichmentService as Enrichment<br/>Service
    participant HealthChecker as Health<br/>Check
    participant MLXBackend as MLX<br/>Backend
    participant CircuitBreaker as Circuit<br/>Breaker
    participant Recovery as Recovery<br/>Orchestrator

    Client->>EnrichmentService: run_enrichment(batch)
    EnrichmentService->>EnrichmentService: process batch
    
    alt Batch Fail Ratio Exceeds Threshold
        EnrichmentService->>HealthChecker: check_backend_health(mlx)
        HealthChecker->>MLXBackend: health probe (HTTP)
        alt Backend Unhealthy
            MLXBackend-->>HealthChecker: connection error
            HealthChecker-->>EnrichmentService: health=False
            EnrichmentService->>Recovery: _recover_backend(mlx)
            Recovery->>MLXBackend: attempt restart
            alt Restart Succeeds
                MLXBackend-->>Recovery: ready
                Recovery-->>EnrichmentService: recovery=True
                EnrichmentService->>EnrichmentService: reset state & continue
            else Restart Fails
                MLXBackend-->>Recovery: timeout/error
                Recovery-->>EnrichmentService: recovery=False
                EnrichmentService->>EnrichmentService: pause/stop pipeline
            end
        else Backend Healthy
            MLXBackend-->>HealthChecker: OK
            HealthChecker-->>EnrichmentService: health=True
        end
    end
    
    alt Circuit Breaker Triggered
        EnrichmentService->>CircuitBreaker: circuit opened
        CircuitBreaker-->>EnrichmentService: breaker open
        EnrichmentService->>HealthChecker: check_backend_health(mlx)
        HealthChecker->>MLXBackend: health probe
        EnrichmentService->>Recovery: _recover_backend(mlx)
        Recovery->>MLXBackend: attempt restart
        alt Recovery Succeeds
            MLXBackend-->>Recovery: ready
            Recovery-->>CircuitBreaker: reset circuit
            CircuitBreaker-->>EnrichmentService: breaker closed
        else Recovery Fails
            Recovery-->>EnrichmentService: stop
        end
    end
    
    EnrichmentService-->>Client: enriched batch or error

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

feat: groq backend + on-demand enrichment #43: Modifies enrichment pipeline backend selection and circuit-breaking behavior in run_enrichment/enrich_batch; this PR extends those same functions with health and recovery orchestration.
feat: MLX auto-detection, stall detection, current_context fix #14: Introduces backend selection and fallback logic to run_enrichment/enrich_batch; this PR adds health/recovery coordination to the backend management introduced there.
feat: mid-run backend fallback for enrichment #38: Adds fallback availability checking and state management in enrichment.py; this PR builds on that foundation with health-driven recovery and circuit-breaker integration.

Poem

🐰 Hoppy healthchecks bound through the night,
Backends recovered with all their might,
DB locks tamed with a timeout embrace,
Recovery dances at a speedy pace,
Resilience hops where errors once were! ✨

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/enrichment-db-lock-mlx-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-27T00:34:45Z

+# Register busy_timeout hook BEFORE bestpractice hooks so it fires first.
+# bestpractice.apply() adds hooks that run PRAGMA optimize inside Connection(),
+# which needs busy_timeout active or it crashes under contention.
+apsw.connection_hooks.insert(0, _set_busy_timeout_hook)


Bestpractice hook overrides custom busy_timeout to 100ms

Medium Severity

The _set_busy_timeout_hook inserted at position 0 sets busy_timeout to 10s, but apsw.bestpractice.recommended includes connection_busy_timeout which resets it to 100ms (its default duration_ms). Since bestpractice hooks are appended after position 0, connection_busy_timeout fires after the custom hook, overriding 10s back to 100ms before connection_optimize runs. The stated fix — protecting PRAGMA optimize during the Connection() constructor with a 10s timeout — doesn't achieve its goal.

Additional Locations (1)

src/brainlayer/vector_store.py#L13-L21

cursor · 2026-02-27T00:34:45Z

+        return None
+    except requests.exceptions.Timeout as e:
+        print(f"  MLX timeout ({timeout}s): {e}", file=sys.stderr)
+        return None


ConnectTimeout caught as ConnectionError, not Timeout

Low Severity

requests.exceptions.ConnectTimeout inherits from both ConnectionError and Timeout. Since the ConnectionError except clause appears first, connect timeouts are caught there and logged as "MLX connection error (server dead?)" instead of the Timeout handler's "MLX timeout" message. This defeats the PR's goal of better error categorization between connection errors and timeouts.

EtanHey merged commit b7a0c5a into main Feb 27, 2026
1 of 6 checks passed

cursor Bot reviewed Feb 27, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Mar 13, 2026

Harden BrainLayer search validation and backfill coverage #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DB lock resilience + MLX health recovery#44

fix: DB lock resilience + MLX health recovery#44
EtanHey merged 1 commit intomainfrom
fix/enrichment-db-lock-mlx-health

EtanHey commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 27, 2026

Uh oh!

cursor Bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Feb 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 27, 2026

Choose a reason for hiding this comment

Bestpractice hook overrides custom busy_timeout to 100ms

Uh oh!

cursor Bot Feb 27, 2026

Choose a reason for hiding this comment

ConnectTimeout caught as ConnectionError, not Timeout

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading