Harden BrainLayer search validation and backfill coverage#79
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughAdds a sidecar SQLite checkpoint system, Gemini batch request building and cost estimation, read-only VectorStore fallback, hybrid-search result caching with invalidation hooks, search parameter validation, and extensive tests for backfill and search behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BackfillFlow
participant VectorStore
participant CheckpointDB as Checkpoint DB
participant ReadOnlyStore as ReadOnlyBackfillStore
participant GeminiAPI as Gemini Batch API
Client->>BackfillFlow: run_full_backfill(db_path, submit_only?)
BackfillFlow->>VectorStore: open (init)
alt VectorStore locked (BusyError)
VectorStore-->>BackfillFlow: BusyError
BackfillFlow->>ReadOnlyStore: open_backfill_store -> create ReadOnlyBackfillStore
ReadOnlyStore-->>BackfillFlow: ro_store
else VectorStore available
VectorStore-->>BackfillFlow: store
end
BackfillFlow->>CheckpointDB: _open_checkpoint_conn / _ensure_checkpoint_table
BackfillFlow->>CheckpointDB: _migrate_checkpoints_from_main_db(store, conn)
BackfillFlow->>BackfillFlow: get_unsubmitted_export_files() (submit_only path)
alt export required
BackfillFlow->>BackfillFlow: build_batch_request_line(chunk_id, prompt)
BackfillFlow->>BackfillFlow: estimate_batch_cost_usd(input_tokens, output_tokens)
BackfillFlow->>GeminiAPI: submit batch JSONL
GeminiAPI-->>BackfillFlow: batch_id / response
BackfillFlow->>CheckpointDB: persist checkpoint with batch_id
end
BackfillFlow-->>Client: complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a108affa87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DEFAULT_DB_PATH = get_db_path() | ||
| EXPORT_DIR = Path(__file__).resolve().parent / "backfill_data" | ||
| CHECKPOINT_TABLE = "enrichment_checkpoints" | ||
| CHECKPOINT_DB_PATH = DEFAULT_DB_PATH.with_name("enrichment_checkpoints.db") |
There was a problem hiding this comment.
Derive checkpoint DB path from the active --db argument
CHECKPOINT_DB_PATH is bound once to DEFAULT_DB_PATH, so checkpoint operations ignore the db_path passed to run_full_backfill, resume_backfill, and show_status. If an operator runs the script with --db pointing to a non-default database, status/resume/checkpoint writes still hit the default sidecar file, which mixes job state across databases and can hide or replay jobs for the wrong dataset.
Useful? React with 👍 / 👎.
| elif list(EXPORT_DIR.glob("batch_*.jsonl")): | ||
| print("All existing JSONL batch files are already checkpointed. No new submissions needed.") | ||
| return |
There was a problem hiding this comment.
Export new chunks when all prior JSONLs are already checkpointed
In submit_only mode with sample == 0, this branch returns early whenever batch files exist but are all checkpointed, so it never calls export_unenriched_chunks for newly-added unenriched chunks. After one successful cycle, recurring submit-only runs can silently stop submitting fresh work unless the export directory is manually cleaned.
Useful? React with 👍 / 👎.
| INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)}) | ||
| VALUES ({placeholders}) |
There was a problem hiding this comment.
Stop clobbering sidecar state during legacy checkpoint migration
Legacy migration runs on every ensure_checkpoint_table call and uses INSERT OR REPLACE, which overwrites newer sidecar state with stale rows from the main DB table. Since new updates now write only to the sidecar DB, a legacy submitted row can repeatedly replace a later completed/imported sidecar row and make resume/status regress.
Useful? React with 👍 / 👎.
| cache_key = _hybrid_cache_key( | ||
| store_key, | ||
| query_text, | ||
| n_results, |
There was a problem hiding this comment.
Include query embedding in the hybrid search cache key
The cache key excludes query_embedding even though semantic ranking is computed from it, so two calls with identical text/filters but different embeddings will incorrectly reuse the first result for up to 60 seconds. This can return stale or wrong rankings during embedding model changes, alternate embedding pipelines, or recovery from a bad first embedding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/cloud_backfill.py`:
- Around line 52-59: The checkpoint sidecar path is being fixed at import via
CHECKPOINT_DB_PATH (derived from get_db_path()), causing resume/status
operations against a non-default DB to use the wrong sidecar; change
CHECKPOINT_DB_PATH from a module-level constant to a function (e.g.,
get_checkpoint_db_path(db_path)) that returns
db_path.with_name("enrichment_checkpoints.db"), and update all callers
(run_full_backfill, resume_backfill, show_status and any code in the ranges that
reference CHECKPOINT_DB_PATH) to call get_checkpoint_db_path(db_path) or
otherwise derive the checkpoint path from the supplied db_path/--db argument
instead of the import-time constant. Ensure CHECKPOINT_STABLE_STATUSES and retry
constants remain constants; only the sidecar path resolution is moved to
runtime.
- Around line 807-820: The branch in the submit-only path returns early whenever
any batch_*.jsonl exists (EXPORT_DIR.glob), which prevents creating new exports;
instead, when jsonl_files is empty call export_unenriched_chunks(store,
max_chunks=max_chunks, no_sanitize=no_sanitize) to attempt producing new batch
files (use existing get_unsubmitted_export_files() first), then if after
exporting there are still no files print the "already checkpointed" message and
return; update the logic around variables submit_only, sample, jsonl_files and
functions get_unsubmitted_export_files() / export_unenriched_chunks(...) so the
early return is removed and the export is attempted before deciding to exit.
In `@src/brainlayer/search_repo.py`:
- Around line 30-65: The cache key in _hybrid_cache_key must include the
query_embedding so different embeddings for the same text don't share results;
change _hybrid_cache_key to accept the query_embedding and include a stable
representation (preferably a fixed-size hash like SHA256 of the embedding bytes
or a tuple of floats) in the returned tuple, convert/normalize list/np.ndarray
embeddings before hashing, and update all callers to pass query_embedding; apply
the same change to the other cache-key helper referenced around lines 424-441 so
both helpers use the embedding hash rather than only query_text.
- Around line 14-27: The hybrid search cache (_hybrid_cache) must be invalidated
on every write to avoid serving stale results; update the write-path functions
store_memory(), update_enrichment(), update_chunk(), and archive_chunk() so they
call _hybrid_cache.clear() (or a small helper flush_hybrid_cache()) immediately
after successfully persisting/modifying chunks (and after any enrichment
pipeline calls that ultimately mutate chunks via update_enrichment()); ensure
the clear happens only on successful commits/returns and reference the
module-level _hybrid_cache symbol so the module-wide LRU state is reset for the
affected store.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 328cfbb1-deda-4a3a-8bfa-4562135dd203
📒 Files selected for processing (7)
scripts/cloud_backfill.pysrc/brainlayer/mcp/search_handler.pysrc/brainlayer/search_repo.pytests/conftest.pytests/test_cloud_backfill.pytests/test_eval_baselines.pytests/test_search_validation.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Python package structure should follow the layout:src/brainlayer/for package code, with separate modules forvector_store.py,embeddings.py,daemon.py,dashboard/, andmcp/for different concerns
Usepaths.py:get_db_path()for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (~/.local/share/brainlayer/brainlayer.db)
Lint and format Python code usingruff check src/andruff format src/
Preserve verbatim content forai_code,stack_trace, anduser_messagemessage types during classification and chunking; skipnoisecontent entirely; summarizebuild_logcontent; extract structure-only fordir_listing
Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking
Handle SQLite concurrency by implementing retry logic onSQLITE_BUSYerrors; ensure each worker uses its own database connection
Prioritize MLX (Qwen2.5-Coder-14B-Instruct-4bit) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (glm-4.7-flashon port 11434) after 3 consecutive MLX failures; support backend override viaBRAINLAYER_ENRICH_BACKENDenvironment variable
Brain graph API must expose endpoints:/brain/graph,/brain/node/{node_id}(FastAPI)
Backlog API must support endpoints:/backlog/itemswith GET, POST, PATCH, DELETE operations (FastAPI)
Providebrainlayer brain-exportcommand to export brain graph as JSON for dashboard consumption
Providebrainlayer export-obsidiancommand to export as Markdown vault with backlinks and tags
For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete fromchunkswhile FTS trigger is active
Files:
src/brainlayer/search_repo.pysrc/brainlayer/mcp/search_handler.py
src/brainlayer/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
MCP tools must implement:
brain_search,brain_store,brain_recall,brain_entity,brain_expand,brain_update,brain_digest,brain_get_personwith legacybrainlayer_*aliases for backward compatibility
Files:
src/brainlayer/mcp/search_handler.py
🧠 Learnings (3)
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/**/*.py : Use `paths.py:get_db_path()` for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (`~/.local/share/brainlayer/brainlayer.db`)
Applied to files:
tests/test_eval_baselines.py
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/mcp/**/*.py : MCP tools must implement: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` with legacy `brainlayer_*` aliases for backward compatibility
Applied to files:
src/brainlayer/mcp/search_handler.py
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active
Applied to files:
scripts/cloud_backfill.py
| # ── hybrid_search result cache ─────────────────────────────────────────────── | ||
| # Caches identical (store, query_text, filters) → results for 60s. | ||
| # Warm repeated queries (e.g. hook firing on the same prompt twice) return | ||
| # instantly instead of re-running the full 303K-vector brute-force scan. | ||
| # | ||
| # Correctness constraints: | ||
| # - Store-scoped: different DB files must never share cached results. | ||
| # - Filter-scoped: all query-affecting filters belong in the cache key. | ||
| # - Copy-on-read: callers enrich and mutate result metadata after search. | ||
| _HYBRID_CACHE_TTL = 60.0 # seconds | ||
| _HYBRID_CACHE_MAX = 128 # max entries (LRU eviction) | ||
|
|
||
| # Module-level LRU cache: {cache_key: (result, timestamp)} | ||
| _hybrid_cache: "OrderedDict[tuple, tuple[dict, float]]" = OrderedDict() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\nCache references:\n'
rg -n --type=py '\b_hybrid_cache\b|_hybrid_cache_key|_HYBRID_CACHE_TTL'
printf '\nPotential chunk/enrichment write paths:\n'
rg -n --type=py 'store_memory\s*\(|update_enrichment\s*\(|INSERT INTO chunks|UPDATE chunks|DELETE FROM chunks'Repository: EtanHey/brainlayer
Length of output: 10109
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== store_memory() function in store.py ==="
sed -n '46,160p' src/brainlayer/store.py
echo -e "\n=== update_enrichment() in session_repo.py ==="
sed -n '88,190p' src/brainlayer/session_repo.py
echo -e "\n=== Enrichment pipeline operations in enrichment.py (around line 834) ==="
sed -n '830,850p' src/brainlayer/pipeline/enrichment.pyRepository: EtanHey/brainlayer
Length of output: 9459
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== vector_store.py update methods (lines 680-720) ==="
sed -n '680,720p' src/brainlayer/vector_store.py
echo -e "\n=== Check if _hybrid_cache is cleared anywhere in store.py or vector_store.py ==="
rg -n '\b_hybrid_cache\s*\.clear\(\)|_hybrid_cache\s*=|_hybrid_cache\.pop' src/brainlayer/Repository: EtanHey/brainlayer
Length of output: 2016
Add cache invalidation to all write paths.
The hybrid search cache has no version component, so every write path must explicitly clear _hybrid_cache to prevent stale results. Currently, store_memory(), update_enrichment(), archive_chunk(), and other mutations do not invalidate the cache—only tests manually call _hybrid_cache.clear(). This allows repeated queries against the same store to serve pre-write results for up to 60 seconds after any write.
Affected write paths:
store_memory()instore.pyupdate_enrichment()insession_repo.pyupdate_chunk()andarchive_chunk()invector_store.py- Enrichment pipeline calls to
update_enrichment()
Each should clear _hybrid_cache after modifying chunks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/search_repo.py` around lines 14 - 27, The hybrid search cache
(_hybrid_cache) must be invalidated on every write to avoid serving stale
results; update the write-path functions store_memory(), update_enrichment(),
update_chunk(), and archive_chunk() so they call _hybrid_cache.clear() (or a
small helper flush_hybrid_cache()) immediately after successfully
persisting/modifying chunks (and after any enrichment pipeline calls that
ultimately mutate chunks via update_enrichment()); ensure the clear happens only
on successful commits/returns and reference the module-level _hybrid_cache
symbol so the module-wide LRU state is reset for the affected store.
dcab659 to
fe9fac8
Compare
Summary
brain_searchMCP boundary so invaliddetailvalues and out-of-rangenum_resultsfail fasthybrid_searchcaching safe by scoping entries to the backing DB, including all relevant filters, and returning defensive copiescloud_backfill, cover them with regression tests, and fix the--no-sanitizeexport crash~/.claudecopy so full pytest verifies repo codeReview Notes
+411/-21summary; the active diff included largercloud_backfillchanges around checkpoint handling and export reuse, so this PR reviews and tests the code that is actually presentTesting
pytest -q tests/test_search_validation.py tests/test_cloud_backfill.pypytest -q tests/test_eval_baselines.py -k "PromptHookEntityInjection or HookLatency"pytestSummary by CodeRabbit
New Features
Bug Fixes
Tests