Skip to content

Harden BrainLayer search validation and backfill coverage#79

Merged
EtanHey merged 3 commits intomainfrom
fix/search-hardening-pr-round
Mar 14, 2026
Merged

Harden BrainLayer search validation and backfill coverage#79
EtanHey merged 3 commits intomainfrom
fix/search-hardening-pr-round

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Mar 13, 2026

Summary

  • harden the public brain_search MCP boundary so invalid detail values and out-of-range num_results fail fast
  • make hybrid_search caching safe by scoping entries to the backing DB, including all relevant filters, and returning defensive copies
  • factor the Gemini batch payload/cost helpers in cloud_backfill, cover them with regression tests, and fix the --no-sanitize export crash
  • make eval hook tests prefer the repo hook implementation instead of a stale ~/.claude copy so full pytest verifies repo code

Review Notes

  • the current worktree did not match the original +411/-21 summary; the active diff included larger cloud_backfill changes around checkpoint handling and export reuse, so this PR reviews and tests the code that is actually present
  • the warm-cache optimization is now documented in code, and the tests cover cache scoping plus mutation safety

Testing

  • pytest -q tests/test_search_validation.py tests/test_cloud_backfill.py
  • pytest -q tests/test_eval_baselines.py -k "PromptHookEntityInjection or HookLatency"
  • pytest

Summary by CodeRabbit

  • New Features

    • Added hybrid search caching (60s TTL) and automatic cache invalidation after updates.
    • Added batch-export helpers: standardized batch request builder, cost estimation, and a sidecar checkpoint DB with read-only fallback support.
  • Bug Fixes

    • Enforced search parameter validation (detail and result count bounds).
  • Tests

    • Added extensive tests for search validation, hybrid-cache behavior, and batch export/checkpoint workflows.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Mar 13, 2026

@codex review
@cursor review
@BugBot review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 13, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 10 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b5d7ccce-43c2-4de5-983a-9a7beed38d03

📥 Commits

Reviewing files that changed from the base of the PR and between 0998aaf and fe9fac8.

📒 Files selected for processing (8)
  • scripts/cloud_backfill.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/session_repo.py
  • src/brainlayer/store.py
  • src/brainlayer/vector_store.py
  • tests/test_cloud_backfill.py
  • tests/test_eval_baselines.py
  • tests/test_search_validation.py
📝 Walkthrough

Walkthrough

Adds a sidecar SQLite checkpoint system, Gemini batch request building and cost estimation, read-only VectorStore fallback, hybrid-search result caching with invalidation hooks, search parameter validation, and extensive tests for backfill and search behaviors.

Changes

Cohort / File(s) Summary
Cloud Backfill & Batch
scripts/cloud_backfill.py
Adds sidecar enrichment_checkpoints.db with APSW/WAL handling, migration helpers, checkpoint management helpers, JSONL export reuse, Gemini batch request builder build_batch_request_line, cost estimator estimate_batch_cost_usd, ReadOnlyBackfillStore and open_backfill_store, and related logging.
Cloud Backfill Tests
tests/test_cloud_backfill.py
Adds regression tests covering checkpoint migration, write separation, submit-only/export reuse, read-only fallback, export path filtering, and cost/JSONL behaviors.
Hybrid Search Caching
src/brainlayer/search_repo.py, src/brainlayer/session_repo.py, src/brainlayer/store.py, src/brainlayer/vector_store.py
Introduces LRU TTL cache for hybrid_search (60s, 128 entries), cache-key helpers, defensive copy semantics, clear_hybrid_search_cache and invalidation calls after enrichment/storage/vector updates.
Search Validation
src/brainlayer/mcp/search_handler.py
Adds _VALID_SEARCH_DETAILS, _MAX_PUBLIC_NUM_RESULTS, _MIN_PUBLIC_NUM_RESULTS and input validation in _brain_search for detail and num_results.
Search Tests & Validation
tests/test_search_validation.py
Adds extensive tests for search parameter validation, eval isolation, hybrid_search latency and cache behavior.
Test Fixtures & Hook Path
tests/conftest.py, tests/test_eval_baselines.py
Adds eval_project fixture (unique per test) and repository-aware hook resolution (REPO_HOOK_PATH, HOME_HOOK_PATH, _resolve_hook_path) used by eval baseline tests.
Test Adjustments
tests/test_eval_baselines.py
Rewires HOOK_PATH usage to prefer repo hook when present.
Misc. Exports
scripts/cloud_backfill.py (public additions)
New public API: build_batch_request_line, estimate_batch_cost_usd, get_checkpoint_db_path, ReadOnlyBackfillStore, open_backfill_store (several underscored helpers added as internal utilities).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BackfillFlow
    participant VectorStore
    participant CheckpointDB as Checkpoint DB
    participant ReadOnlyStore as ReadOnlyBackfillStore
    participant GeminiAPI as Gemini Batch API

    Client->>BackfillFlow: run_full_backfill(db_path, submit_only?)
    BackfillFlow->>VectorStore: open (init)
    alt VectorStore locked (BusyError)
        VectorStore-->>BackfillFlow: BusyError
        BackfillFlow->>ReadOnlyStore: open_backfill_store -> create ReadOnlyBackfillStore
        ReadOnlyStore-->>BackfillFlow: ro_store
    else VectorStore available
        VectorStore-->>BackfillFlow: store
    end

    BackfillFlow->>CheckpointDB: _open_checkpoint_conn / _ensure_checkpoint_table
    BackfillFlow->>CheckpointDB: _migrate_checkpoints_from_main_db(store, conn)
    BackfillFlow->>BackfillFlow: get_unsubmitted_export_files() (submit_only path)
    alt export required
        BackfillFlow->>BackfillFlow: build_batch_request_line(chunk_id, prompt)
        BackfillFlow->>BackfillFlow: estimate_batch_cost_usd(input_tokens, output_tokens)
        BackfillFlow->>GeminiAPI: submit batch JSONL
        GeminiAPI-->>BackfillFlow: batch_id / response
        BackfillFlow->>CheckpointDB: persist checkpoint with batch_id
    end
    BackfillFlow-->>Client: complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through SQLite tunnels, sidecar snug and bright,
I bundled batch requests and priced them neat and light,
When DB doors are bolted, I nibble read-only crumbs,
I clear cached crumbs when new enrichments come,
A little rabbit cheer for backfills done just right!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Harden BrainLayer search validation and backfill coverage' directly summarizes the main objectives of the PR, which includes hardening search validation, increasing backfill test coverage, and improving cache safety.
Docstring Coverage ✅ Passed Docstring coverage is 80.72% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/search-hardening-pr-round
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a108affa87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/cloud_backfill.py
DEFAULT_DB_PATH = get_db_path()
EXPORT_DIR = Path(__file__).resolve().parent / "backfill_data"
CHECKPOINT_TABLE = "enrichment_checkpoints"
CHECKPOINT_DB_PATH = DEFAULT_DB_PATH.with_name("enrichment_checkpoints.db")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive checkpoint DB path from the active --db argument

CHECKPOINT_DB_PATH is bound once to DEFAULT_DB_PATH, so checkpoint operations ignore the db_path passed to run_full_backfill, resume_backfill, and show_status. If an operator runs the script with --db pointing to a non-default database, status/resume/checkpoint writes still hit the default sidecar file, which mixes job state across databases and can hide or replay jobs for the wrong dataset.

Useful? React with 👍 / 👎.

Comment thread scripts/cloud_backfill.py Outdated
Comment on lines +816 to +818
elif list(EXPORT_DIR.glob("batch_*.jsonl")):
print("All existing JSONL batch files are already checkpointed. No new submissions needed.")
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Export new chunks when all prior JSONLs are already checkpointed

In submit_only mode with sample == 0, this branch returns early whenever batch files exist but are all checkpointed, so it never calls export_unenriched_chunks for newly-added unenriched chunks. After one successful cycle, recurring submit-only runs can silently stop submitting fresh work unless the export directory is manually cleaned.

Useful? React with 👍 / 👎.

Comment thread scripts/cloud_backfill.py
Comment on lines +198 to +199
INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)})
VALUES ({placeholders})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop clobbering sidecar state during legacy checkpoint migration

Legacy migration runs on every ensure_checkpoint_table call and uses INSERT OR REPLACE, which overwrites newer sidecar state with stale rows from the main DB table. Since new updates now write only to the sidecar DB, a legacy submitted row can repeatedly replace a later completed/imported sidecar row and make resume/status regress.

Useful? React with 👍 / 👎.

Comment on lines +424 to +427
cache_key = _hybrid_cache_key(
store_key,
query_text,
n_results,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include query embedding in the hybrid search cache key

The cache key excludes query_embedding even though semantic ranking is computed from it, so two calls with identical text/filters but different embeddings will incorrectly reuse the first result for up to 60 seconds. This can return stale or wrong rankings during embedding model changes, alternate embedding pipelines, or recovery from a bad first embedding.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/cloud_backfill.py`:
- Around line 52-59: The checkpoint sidecar path is being fixed at import via
CHECKPOINT_DB_PATH (derived from get_db_path()), causing resume/status
operations against a non-default DB to use the wrong sidecar; change
CHECKPOINT_DB_PATH from a module-level constant to a function (e.g.,
get_checkpoint_db_path(db_path)) that returns
db_path.with_name("enrichment_checkpoints.db"), and update all callers
(run_full_backfill, resume_backfill, show_status and any code in the ranges that
reference CHECKPOINT_DB_PATH) to call get_checkpoint_db_path(db_path) or
otherwise derive the checkpoint path from the supplied db_path/--db argument
instead of the import-time constant. Ensure CHECKPOINT_STABLE_STATUSES and retry
constants remain constants; only the sidecar path resolution is moved to
runtime.
- Around line 807-820: The branch in the submit-only path returns early whenever
any batch_*.jsonl exists (EXPORT_DIR.glob), which prevents creating new exports;
instead, when jsonl_files is empty call export_unenriched_chunks(store,
max_chunks=max_chunks, no_sanitize=no_sanitize) to attempt producing new batch
files (use existing get_unsubmitted_export_files() first), then if after
exporting there are still no files print the "already checkpointed" message and
return; update the logic around variables submit_only, sample, jsonl_files and
functions get_unsubmitted_export_files() / export_unenriched_chunks(...) so the
early return is removed and the export is attempted before deciding to exit.

In `@src/brainlayer/search_repo.py`:
- Around line 30-65: The cache key in _hybrid_cache_key must include the
query_embedding so different embeddings for the same text don't share results;
change _hybrid_cache_key to accept the query_embedding and include a stable
representation (preferably a fixed-size hash like SHA256 of the embedding bytes
or a tuple of floats) in the returned tuple, convert/normalize list/np.ndarray
embeddings before hashing, and update all callers to pass query_embedding; apply
the same change to the other cache-key helper referenced around lines 424-441 so
both helpers use the embedding hash rather than only query_text.
- Around line 14-27: The hybrid search cache (_hybrid_cache) must be invalidated
on every write to avoid serving stale results; update the write-path functions
store_memory(), update_enrichment(), update_chunk(), and archive_chunk() so they
call _hybrid_cache.clear() (or a small helper flush_hybrid_cache()) immediately
after successfully persisting/modifying chunks (and after any enrichment
pipeline calls that ultimately mutate chunks via update_enrichment()); ensure
the clear happens only on successful commits/returns and reference the
module-level _hybrid_cache symbol so the module-wide LRU state is reset for the
affected store.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 328cfbb1-deda-4a3a-8bfa-4562135dd203

📥 Commits

Reviewing files that changed from the base of the PR and between 50d637d and a108aff.

📒 Files selected for processing (7)
  • scripts/cloud_backfill.py
  • src/brainlayer/mcp/search_handler.py
  • src/brainlayer/search_repo.py
  • tests/conftest.py
  • tests/test_cloud_backfill.py
  • tests/test_eval_baselines.py
  • tests/test_search_validation.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Python package structure should follow the layout: src/brainlayer/ for package code, with separate modules for vector_store.py, embeddings.py, daemon.py, dashboard/, and mcp/ for different concerns
Use paths.py:get_db_path() for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (~/.local/share/brainlayer/brainlayer.db)
Lint and format Python code using ruff check src/ and ruff format src/
Preserve verbatim content for ai_code, stack_trace, and user_message message types during classification and chunking; skip noise content entirely; summarize build_log content; extract structure-only for dir_listing
Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking
Handle SQLite concurrency by implementing retry logic on SQLITE_BUSY errors; ensure each worker uses its own database connection
Prioritize MLX (Qwen2.5-Coder-14B-Instruct-4bit) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (glm-4.7-flash on port 11434) after 3 consecutive MLX failures; support backend override via BRAINLAYER_ENRICH_BACKEND environment variable
Brain graph API must expose endpoints: /brain/graph, /brain/node/{node_id} (FastAPI)
Backlog API must support endpoints: /backlog/items with GET, POST, PATCH, DELETE operations (FastAPI)
Provide brainlayer brain-export command to export brain graph as JSON for dashboard consumption
Provide brainlayer export-obsidian command to export as Markdown vault with backlinks and tags
For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from chunks while FTS trigger is active

Files:

  • src/brainlayer/search_repo.py
  • src/brainlayer/mcp/search_handler.py
src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

MCP tools must implement: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person with legacy brainlayer_* aliases for backward compatibility

Files:

  • src/brainlayer/mcp/search_handler.py
🧠 Learnings (3)
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/**/*.py : Use `paths.py:get_db_path()` for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (`~/.local/share/brainlayer/brainlayer.db`)

Applied to files:

  • tests/test_eval_baselines.py
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/mcp/**/*.py : MCP tools must implement: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` with legacy `brainlayer_*` aliases for backward compatibility

Applied to files:

  • src/brainlayer/mcp/search_handler.py
📚 Learning: 2026-03-12T14:22:54.798Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.798Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active

Applied to files:

  • scripts/cloud_backfill.py

Comment thread scripts/cloud_backfill.py
Comment thread scripts/cloud_backfill.py
Comment on lines +14 to +27
# ── hybrid_search result cache ───────────────────────────────────────────────
# Caches identical (store, query_text, filters) → results for 60s.
# Warm repeated queries (e.g. hook firing on the same prompt twice) return
# instantly instead of re-running the full 303K-vector brute-force scan.
#
# Correctness constraints:
# - Store-scoped: different DB files must never share cached results.
# - Filter-scoped: all query-affecting filters belong in the cache key.
# - Copy-on-read: callers enrich and mutate result metadata after search.
_HYBRID_CACHE_TTL = 60.0 # seconds
_HYBRID_CACHE_MAX = 128 # max entries (LRU eviction)

# Module-level LRU cache: {cache_key: (result, timestamp)}
_hybrid_cache: "OrderedDict[tuple, tuple[dict, float]]" = OrderedDict()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\nCache references:\n'
rg -n --type=py '\b_hybrid_cache\b|_hybrid_cache_key|_HYBRID_CACHE_TTL'

printf '\nPotential chunk/enrichment write paths:\n'
rg -n --type=py 'store_memory\s*\(|update_enrichment\s*\(|INSERT INTO chunks|UPDATE chunks|DELETE FROM chunks'

Repository: EtanHey/brainlayer

Length of output: 10109


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== store_memory() function in store.py ==="
sed -n '46,160p' src/brainlayer/store.py

echo -e "\n=== update_enrichment() in session_repo.py ==="
sed -n '88,190p' src/brainlayer/session_repo.py

echo -e "\n=== Enrichment pipeline operations in enrichment.py (around line 834) ==="
sed -n '830,850p' src/brainlayer/pipeline/enrichment.py

Repository: EtanHey/brainlayer

Length of output: 9459


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== vector_store.py update methods (lines 680-720) ==="
sed -n '680,720p' src/brainlayer/vector_store.py

echo -e "\n=== Check if _hybrid_cache is cleared anywhere in store.py or vector_store.py ==="
rg -n '\b_hybrid_cache\s*\.clear\(\)|_hybrid_cache\s*=|_hybrid_cache\.pop' src/brainlayer/

Repository: EtanHey/brainlayer

Length of output: 2016


Add cache invalidation to all write paths.

The hybrid search cache has no version component, so every write path must explicitly clear _hybrid_cache to prevent stale results. Currently, store_memory(), update_enrichment(), archive_chunk(), and other mutations do not invalidate the cache—only tests manually call _hybrid_cache.clear(). This allows repeated queries against the same store to serve pre-write results for up to 60 seconds after any write.

Affected write paths:

  • store_memory() in store.py
  • update_enrichment() in session_repo.py
  • update_chunk() and archive_chunk() in vector_store.py
  • Enrichment pipeline calls to update_enrichment()

Each should clear _hybrid_cache after modifying chunks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/search_repo.py` around lines 14 - 27, The hybrid search cache
(_hybrid_cache) must be invalidated on every write to avoid serving stale
results; update the write-path functions store_memory(), update_enrichment(),
update_chunk(), and archive_chunk() so they call _hybrid_cache.clear() (or a
small helper flush_hybrid_cache()) immediately after successfully
persisting/modifying chunks (and after any enrichment pipeline calls that
ultimately mutate chunks via update_enrichment()); ensure the clear happens only
on successful commits/returns and reference the module-level _hybrid_cache
symbol so the module-wide LRU state is reset for the affected store.

Comment thread src/brainlayer/search_repo.py
@EtanHey EtanHey force-pushed the fix/search-hardening-pr-round branch from dcab659 to fe9fac8 Compare March 14, 2026 02:21
@EtanHey EtanHey merged commit 7337bd9 into main Mar 14, 2026
4 checks passed
@EtanHey EtanHey deleted the fix/search-hardening-pr-round branch March 14, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant