Skip to content

fix: DB lock resilience + MLX health recovery#44

Merged
EtanHey merged 1 commit intomainfrom
fix/enrichment-db-lock-mlx-health
Feb 27, 2026
Merged

fix: DB lock resilience + MLX health recovery#44
EtanHey merged 1 commit intomainfrom
fix/enrichment-db-lock-mlx-health

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 27, 2026

Summary

  • DB Lock: APSW bestpractice.connection_optimize runs PRAGMA optimize inside Connection() constructor before setbusytimeout() can be called. Registered a connection_hook that sets busy_timeout=10s before bestpractice hooks fire. Added retry with exponential backoff (5 attempts) on _init_db().
  • MLX Health: Added inter-batch health checks, auto-restart of MLX server when it dies mid-run, batch fail ratio detection (>80% = pause + health check), and better error categorization (ConnectionError vs Timeout).
  • 14 new tests, 541 total passing

Test plan

  • test_db_lock_resilience.py — 5 tests for retry, concurrent init, WAL mode
  • test_mlx_health_recovery.py — 9 tests for health checks, error categorization, fail ratio detection
  • Full test suite: 541 passed, 0 regressions

🤖 Generated with Claude Code


Note

Medium Risk
Changes DB connection/initialization behavior and enrichment run control flow (including optional MLX subprocess restarts), which could affect reliability under contention or during backend failures.

Overview
Improves enrichment robustness when the LLM backend degrades: adds check_backend_health, distinguishes MLX ConnectionError vs Timeout, detects high per-batch failure ratios, and attempts MLX auto-restart/recovery before aborting after circuit-breaker events (with new env tunables like BATCH_FAIL_RATIO_THRESHOLD and restart timing).

Hardens SQLite/APSW startup under contention by setting busy_timeout via an early apsw.connection_hooks hook, applying setbusytimeout() immediately on new connections, and retrying VectorStore initialization with exponential backoff on BusyError; adds tests covering retry behavior, concurrent init, WAL mode, and MLX recovery paths.

Written by Cursor Bugbot for commit 14fa56a. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Added automatic backend health checks and recovery mechanisms for improved system reliability.
    • Added configurable auto-restart for backends with customizable wait periods.
    • Added periodic statistics syncing for enhanced monitoring.
  • Bug Fixes

    • Improved database lock resilience with automatic retry during initialization.
    • Enhanced error handling with better connection failure logging.
    • Enhanced circuit-breaker recovery with automatic backend recovery attempts.

Two critical fixes that prevented enrichment from running:

1. DB Lock (CRITICAL): APSW bestpractice hooks run PRAGMA optimize inside
   Connection() constructor — before setbusytimeout() can be called.
   Fix: register a connection_hook that sets busy_timeout=10s BEFORE
   bestpractice hooks fire. Also wrap _init_db() in retry with exponential
   backoff (5 attempts, 0.5s-8s) for DDL contention.

2. MLX Health: When MLX server dies mid-run, pipeline now detects it via
   inter-batch health checks, attempts auto-restart, and stops cleanly
   if recovery fails. Also: batch fail ratio detection (>80% = pause),
   better error categorization (ConnectionError vs Timeout).

Tests: 14 new tests, 541 total passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit b7a0c5a into main Feb 27, 2026
1 of 6 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58371cb and 14fa56a.

📒 Files selected for processing (4)
  • src/brainlayer/pipeline/enrichment.py
  • src/brainlayer/vector_store.py
  • tests/test_db_lock_resilience.py
  • tests/test_mlx_health_recovery.py

📝 Walkthrough

Walkthrough

This PR introduces health monitoring and recovery orchestration for backend services in the enrichment pipeline, adds database lock resilience with retry logic and busy timeout handling to the vector store, and includes comprehensive test coverage for both features.

Changes

Cohort / File(s) Summary
Backend Health & Recovery Orchestration
src/brainlayer/pipeline/enrichment.py
Added health check function check_backend_health() to probe MLX, Ollama, and Groq reachability. Introduced recovery helpers: _try_restart_mlx() for server restart detection, _recover_backend() for orchestrated recovery attempts, and _check_fallback_available() for fallback validation. Enhanced circuit-breaker and batch-failure handling to attempt recovery before stopping. Integrated _sync_stats_to_supabase() for periodic stats syncing. Added environment-driven toggles for MLX auto-restart, restart wait period, batch fail ratio thresholds, and health check pause behavior. Improved MLX error logging for connection and timeout errors.
Database Lock Resilience & Timeout
src/brainlayer/vector_store.py
Introduced module-level busy timeout hook to set APSW busy timeout to 10 seconds. Added retry mechanism (_init_db_with_retry()) with exponential backoff for database initialization on BusyError. Enhanced _init_db() to apply busy timeout immediately and enforce WAL journaling on every initialization.
Database Lock Resilience Tests
tests/test_db_lock_resilience.py
New test module validating VectorStore initialization resilience: successful first-try initialization, retry-on-BusyError with backoff verification, BusyError propagation after max retries, WAL journal mode verification via PRAGMA, and concurrent multi-instance initialization without errors.
MLX Health & Recovery Tests
tests/test_mlx_health_recovery.py
New test module with three test suites: TestCheckBackendHealth validates health checks for MLX, Ollama, and Groq (with/without API key); TestCallMlxErrorCategorization verifies distinct logging of connection errors vs. timeouts; TestHighFailRatioBehavior confirms health check triggering on high fail ratio and circuit breaker recovery flow.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant EnrichmentService as Enrichment<br/>Service
    participant HealthChecker as Health<br/>Check
    participant MLXBackend as MLX<br/>Backend
    participant CircuitBreaker as Circuit<br/>Breaker
    participant Recovery as Recovery<br/>Orchestrator

    Client->>EnrichmentService: run_enrichment(batch)
    EnrichmentService->>EnrichmentService: process batch
    
    alt Batch Fail Ratio Exceeds Threshold
        EnrichmentService->>HealthChecker: check_backend_health(mlx)
        HealthChecker->>MLXBackend: health probe (HTTP)
        alt Backend Unhealthy
            MLXBackend-->>HealthChecker: connection error
            HealthChecker-->>EnrichmentService: health=False
            EnrichmentService->>Recovery: _recover_backend(mlx)
            Recovery->>MLXBackend: attempt restart
            alt Restart Succeeds
                MLXBackend-->>Recovery: ready
                Recovery-->>EnrichmentService: recovery=True
                EnrichmentService->>EnrichmentService: reset state & continue
            else Restart Fails
                MLXBackend-->>Recovery: timeout/error
                Recovery-->>EnrichmentService: recovery=False
                EnrichmentService->>EnrichmentService: pause/stop pipeline
            end
        else Backend Healthy
            MLXBackend-->>HealthChecker: OK
            HealthChecker-->>EnrichmentService: health=True
        end
    end
    
    alt Circuit Breaker Triggered
        EnrichmentService->>CircuitBreaker: circuit opened
        CircuitBreaker-->>EnrichmentService: breaker open
        EnrichmentService->>HealthChecker: check_backend_health(mlx)
        HealthChecker->>MLXBackend: health probe
        EnrichmentService->>Recovery: _recover_backend(mlx)
        Recovery->>MLXBackend: attempt restart
        alt Recovery Succeeds
            MLXBackend-->>Recovery: ready
            Recovery-->>CircuitBreaker: reset circuit
            CircuitBreaker-->>EnrichmentService: breaker closed
        else Recovery Fails
            Recovery-->>EnrichmentService: stop
        end
    end
    
    EnrichmentService-->>Client: enriched batch or error
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Poem

🐰 Hoppy healthchecks bound through the night,
Backends recovered with all their might,
DB locks tamed with a timeout embrace,
Recovery dances at a speedy pace,
Resilience hops where errors once were!

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/enrichment-db-lock-mlx-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

# Register busy_timeout hook BEFORE bestpractice hooks so it fires first.
# bestpractice.apply() adds hooks that run PRAGMA optimize inside Connection(),
# which needs busy_timeout active or it crashes under contention.
apsw.connection_hooks.insert(0, _set_busy_timeout_hook)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bestpractice hook overrides custom busy_timeout to 100ms

Medium Severity

The _set_busy_timeout_hook inserted at position 0 sets busy_timeout to 10s, but apsw.bestpractice.recommended includes connection_busy_timeout which resets it to 100ms (its default duration_ms). Since bestpractice hooks are appended after position 0, connection_busy_timeout fires after the custom hook, overriding 10s back to 100ms before connection_optimize runs. The stated fix — protecting PRAGMA optimize during the Connection() constructor with a 10s timeout — doesn't achieve its goal.

Additional Locations (1)

Fix in Cursor Fix in Web

return None
except requests.exceptions.Timeout as e:
print(f" MLX timeout ({timeout}s): {e}", file=sys.stderr)
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectTimeout caught as ConnectionError, not Timeout

Low Severity

requests.exceptions.ConnectTimeout inherits from both ConnectionError and Timeout. Since the ConnectionError except clause appears first, connect timeouts are caught there and logged as "MLX connection error (server dead?)" instead of the Timeout handler's "MLX timeout" message. This defeats the PR's goal of better error categorization between connection errors and timeouts.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant