fix: DB lock resilience + MLX health recovery#44
Conversation
Two critical fixes that prevented enrichment from running: 1. DB Lock (CRITICAL): APSW bestpractice hooks run PRAGMA optimize inside Connection() constructor — before setbusytimeout() can be called. Fix: register a connection_hook that sets busy_timeout=10s BEFORE bestpractice hooks fire. Also wrap _init_db() in retry with exponential backoff (5 attempts, 0.5s-8s) for DDL contention. 2. MLX Health: When MLX server dies mid-run, pipeline now detects it via inter-batch health checks, attempts auto-restart, and stops cleanly if recovery fails. Also: batch fail ratio detection (>80% = pause), better error categorization (ConnectionError vs Timeout). Tests: 14 new tests, 541 total passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR introduces health monitoring and recovery orchestration for backend services in the enrichment pipeline, adds database lock resilience with retry logic and busy timeout handling to the vector store, and includes comprehensive test coverage for both features. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant EnrichmentService as Enrichment<br/>Service
participant HealthChecker as Health<br/>Check
participant MLXBackend as MLX<br/>Backend
participant CircuitBreaker as Circuit<br/>Breaker
participant Recovery as Recovery<br/>Orchestrator
Client->>EnrichmentService: run_enrichment(batch)
EnrichmentService->>EnrichmentService: process batch
alt Batch Fail Ratio Exceeds Threshold
EnrichmentService->>HealthChecker: check_backend_health(mlx)
HealthChecker->>MLXBackend: health probe (HTTP)
alt Backend Unhealthy
MLXBackend-->>HealthChecker: connection error
HealthChecker-->>EnrichmentService: health=False
EnrichmentService->>Recovery: _recover_backend(mlx)
Recovery->>MLXBackend: attempt restart
alt Restart Succeeds
MLXBackend-->>Recovery: ready
Recovery-->>EnrichmentService: recovery=True
EnrichmentService->>EnrichmentService: reset state & continue
else Restart Fails
MLXBackend-->>Recovery: timeout/error
Recovery-->>EnrichmentService: recovery=False
EnrichmentService->>EnrichmentService: pause/stop pipeline
end
else Backend Healthy
MLXBackend-->>HealthChecker: OK
HealthChecker-->>EnrichmentService: health=True
end
end
alt Circuit Breaker Triggered
EnrichmentService->>CircuitBreaker: circuit opened
CircuitBreaker-->>EnrichmentService: breaker open
EnrichmentService->>HealthChecker: check_backend_health(mlx)
HealthChecker->>MLXBackend: health probe
EnrichmentService->>Recovery: _recover_backend(mlx)
Recovery->>MLXBackend: attempt restart
alt Recovery Succeeds
MLXBackend-->>Recovery: ready
Recovery-->>CircuitBreaker: reset circuit
CircuitBreaker-->>EnrichmentService: breaker closed
else Recovery Fails
Recovery-->>EnrichmentService: stop
end
end
EnrichmentService-->>Client: enriched batch or error
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| # Register busy_timeout hook BEFORE bestpractice hooks so it fires first. | ||
| # bestpractice.apply() adds hooks that run PRAGMA optimize inside Connection(), | ||
| # which needs busy_timeout active or it crashes under contention. | ||
| apsw.connection_hooks.insert(0, _set_busy_timeout_hook) |
There was a problem hiding this comment.
Bestpractice hook overrides custom busy_timeout to 100ms
Medium Severity
The _set_busy_timeout_hook inserted at position 0 sets busy_timeout to 10s, but apsw.bestpractice.recommended includes connection_busy_timeout which resets it to 100ms (its default duration_ms). Since bestpractice hooks are appended after position 0, connection_busy_timeout fires after the custom hook, overriding 10s back to 100ms before connection_optimize runs. The stated fix — protecting PRAGMA optimize during the Connection() constructor with a 10s timeout — doesn't achieve its goal.
Additional Locations (1)
| return None | ||
| except requests.exceptions.Timeout as e: | ||
| print(f" MLX timeout ({timeout}s): {e}", file=sys.stderr) | ||
| return None |
There was a problem hiding this comment.
ConnectTimeout caught as ConnectionError, not Timeout
Low Severity
requests.exceptions.ConnectTimeout inherits from both ConnectionError and Timeout. Since the ConnectionError except clause appears first, connect timeouts are caught there and logged as "MLX connection error (server dead?)" instead of the Timeout handler's "MLX timeout" message. This defeats the PR's goal of better error categorization between connection errors and timeouts.


Summary
bestpractice.connection_optimizerunsPRAGMA optimizeinsideConnection()constructor beforesetbusytimeout()can be called. Registered aconnection_hookthat setsbusy_timeout=10sbefore bestpractice hooks fire. Added retry with exponential backoff (5 attempts) on_init_db().ConnectionErrorvsTimeout).Test plan
test_db_lock_resilience.py— 5 tests for retry, concurrent init, WAL modetest_mlx_health_recovery.py— 9 tests for health checks, error categorization, fail ratio detection🤖 Generated with Claude Code
Note
Medium Risk
Changes DB connection/initialization behavior and enrichment run control flow (including optional MLX subprocess restarts), which could affect reliability under contention or during backend failures.
Overview
Improves enrichment robustness when the LLM backend degrades: adds
check_backend_health, distinguishes MLXConnectionErrorvsTimeout, detects high per-batch failure ratios, and attempts MLX auto-restart/recovery before aborting after circuit-breaker events (with new env tunables likeBATCH_FAIL_RATIO_THRESHOLDand restart timing).Hardens SQLite/APSW startup under contention by setting
busy_timeoutvia an earlyapsw.connection_hookshook, applyingsetbusytimeout()immediately on new connections, and retryingVectorStoreinitialization with exponential backoff onBusyError; adds tests covering retry behavior, concurrent init, WAL mode, and MLX recovery paths.Written by Cursor Bugbot for commit 14fa56a. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Bug Fixes