feat: search hardening phase 1 — indexes, tag table, FTS5, compact importance#71
feat: search hardening phase 1 — indexes, tag table, FTS5, compact importance#71
Conversation
…D/OR, compact importance Gap F: Add content_type + language indexes to chunks table (eliminates full table scans) Gap G: Add chunk_tags junction table with triggers + backfill (replaces O(n) json_each scanning) Gap H: FTS5 AND/OR hybrid — AND for ≤3 terms (precision), OR for 4+ (recall via RRF) Gap I: Route brainlayer_search backward-compat alias through _brain_search (entity detection, KG routing) Gap J: Include importance field in compact search results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
…string - Add chunk_tags_update_clear trigger for NULL/invalid tag updates - Detect partial backfills via count comparison (crash recovery) - Update compact result docstring to include importance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (3)src/brainlayer/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/vector_store.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
src/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧠 Learnings (2)📚 Learning: 2026-03-02T21:14:04.731ZApplied to files:
📚 Learning: 2026-03-02T21:14:04.731ZApplied to files:
🔇 Additional comments (7)
📝 WalkthroughWalkthroughIntroduces configurable FTS5 term matching, migrates tag filtering from JSON to a normalized Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MCP as MCP/_brain_search
participant Repo as search_repo / kg_repo
participant Helper as _helpers._escape_fts5_query
participant DB as SQLite (FTS5 + chunk_tags)
participant Builder as mcp/_shared._build_compact_result
Client->>MCP: search request (query, tags, detail)
MCP->>Repo: build query (vector/text/hybrid) + tags
Repo->>Helper: escape FTS5 query (match_mode: "or" or "auto")
Helper-->>Repo: escaped MATCH clause
Repo->>DB: execute SQL (MATCH + join on chunk_tags for tag filtering)
DB-->>Repo: rows (including importance)
Repo->>Builder: map rows -> compact items
Builder-->>MCP: compact results (includes importance)
MCP-->>Client: response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/brainlayer/mcp/__init__.py (1)
924-940:⚠️ Potential issue | 🟠 MajorMissing
entity_idparameter breaks backward compatibility.The
brainlayer_searchalias now routes to_brain_searchbut doesn't forwardentity_id. The old_searchfunction acceptedentity_id(per context snippet 2), so existing callers usingbrainlayer_search(entity_id=...)will have that parameter silently dropped.Compare with
brain_search(line 832) which correctly passesentity_id=arguments.get("entity_id").🐛 Proposed fix to add entity_id forwarding
elif name == "brainlayer_search": return await _with_timeout( _brain_search( query=arguments["query"], project=arguments.get("project"), content_type=arguments.get("content_type"), num_results=arguments.get("num_results", 5), source=arguments.get("source"), tag=arguments.get("tag"), intent=arguments.get("intent"), importance_min=arguments.get("importance_min"), date_from=arguments.get("date_from"), date_to=arguments.get("date_to"), sentiment=arguments.get("sentiment"), + entity_id=arguments.get("entity_id"), detail=arguments.get("detail", "compact"), ) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/__init__.py` around lines 924 - 940, The brainlayer_search branch calls _brain_search but omits forwarding entity_id, breaking backward compatibility; update the brainlayer_search call so it passes entity_id=arguments.get("entity_id") (matching how brain_search does it) when invoking _brain_search to ensure existing callers receive the entity_id parameter.src/brainlayer/kg_repo.py (1)
341-349:⚠️ Potential issue | 🟠 MajorAlways using OR here makes entity resolution too permissive.
Later in this file,
resolve_entity()falls back tosearch_entities(name_or_alias, limit=1). Withmatch_mode="or", a miss on a multi-token name can now resolve to any entity sharing just one token, and because that fallback does not constrainentity_type, the wrong type can win too. I would keepsearch_entities()precision-oriented by default and let only the broad recall path opt into OR, or do a second OR pass only after the precise query returns nothing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/kg_repo.py` around lines 341 - 349, The search_entities function always uses _escape_fts5_query(..., match_mode="or"), which makes resolution too permissive; change search_entities to default to a precision-oriented match_mode (e.g., "and" or a new default like "phrase") and add an optional parameter match_mode: str = "and" to its signature so callers can request OR when needed; then update resolve_entity to call search_entities(name_or_alias, limit=1, match_mode="or") only as the fallback path after a precise search returns nothing. Ensure references to _escape_fts5_query and search_entities are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 282-294: The current completeness check using tagged_chunks and
backfilled_chunks is lossy; change it to compare distinct (chunk_id, tag) pairs
instead of rows: compute tagged_pairs by running SELECT COUNT(*) FROM chunks c,
json_each(c.tags) j WHERE c.tags IS NOT NULL AND json_valid(c.tags)=1 (or SELECT
COUNT(DISTINCT c.id || '::' || j.value) if you need distinctness), compute
backfilled_pairs as SELECT COUNT(*) FROM chunk_tags, and only run the INSERT OR
IGNORE when backfilled_pairs < tagged_pairs; alternatively implement an atomic
backfill by creating a transaction that builds a temporary table of (chunk_id,
tag) from json_each(c.tags), replaces chunk_tags with that table, and sets a
completion sentinel flag in a metadata table to avoid re-running partial
backfills—update the code paths referencing tagged_chunks, backfilled_chunks,
chunk_tags, chunks.tags, json_each, and the existing INSERT OR IGNORE
accordingly.
In `@tests/test_smart_search_entity_dedup.py`:
- Around line 102-104: Update the test's docstring to match the current
assertions: it should state that in compact mode the result omits "source_file"
but does include "importance". Locate the test's docstring near the assertions
that check assert "source_file" not in result and assert "importance" in result
and change the wording (previously saying compact omits importance) to reflect
the new contract.
---
Outside diff comments:
In `@src/brainlayer/kg_repo.py`:
- Around line 341-349: The search_entities function always uses
_escape_fts5_query(..., match_mode="or"), which makes resolution too permissive;
change search_entities to default to a precision-oriented match_mode (e.g.,
"and" or a new default like "phrase") and add an optional parameter match_mode:
str = "and" to its signature so callers can request OR when needed; then update
resolve_entity to call search_entities(name_or_alias, limit=1, match_mode="or")
only as the fallback path after a precise search returns nothing. Ensure
references to _escape_fts5_query and search_entities are updated accordingly.
In `@src/brainlayer/mcp/__init__.py`:
- Around line 924-940: The brainlayer_search branch calls _brain_search but
omits forwarding entity_id, breaking backward compatibility; update the
brainlayer_search call so it passes entity_id=arguments.get("entity_id")
(matching how brain_search does it) when invoking _brain_search to ensure
existing callers receive the entity_id parameter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 73fe12eb-5fe4-4a21-a242-4fa978949810
📒 Files selected for processing (9)
src/brainlayer/_helpers.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/mcp/_shared.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.pytests/test_phase6_critical.pytests/test_search_routing.pytests/test_smart_search_entity_dedup.py
💤 Files with no reviewable changes (1)
- tests/test_search_routing.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (5)
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Python package structure with Typer CLI located in
src/brainlayer/
Files:
src/brainlayer/_helpers.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/_shared.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Lint and format Python code using
ruff check src/andruff format src/
Files:
src/brainlayer/_helpers.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/kg_repo.pysrc/brainlayer/mcp/_shared.pysrc/brainlayer/search_repo.pysrc/brainlayer/vector_store.py
**/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement MCP tools:
brain_search,brain_store,brain_recall,brain_entity,brain_expand,brain_update,brain_digest,brain_get_personwith legacybrainlayer_*aliases
Files:
src/brainlayer/mcp/__init__.pysrc/brainlayer/mcp/_shared.py
**/*test*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run tests with
pytest
Files:
tests/test_smart_search_entity_dedup.pytests/test_phase6_critical.py
**/vector_store.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use APSW with sqlite-vec for vector storage (in
vector_store.py)
Files:
src/brainlayer/vector_store.py
🧠 Learnings (4)
📚 Learning: 2026-03-02T21:14:04.731Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T21:14:04.731Z
Learning: Applies to **/mcp/**/*.py : Implement MCP tools: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` with legacy `brainlayer_*` aliases
Applied to files:
src/brainlayer/mcp/__init__.pysrc/brainlayer/mcp/_shared.py
📚 Learning: 2026-03-02T21:14:04.731Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T21:14:04.731Z
Learning: Applies to scripts/**/*.py : Fix 8 scripts in `scripts/` that hardcode `brainlayer.db` path to use dynamic path resolution during DB consolidation
Applied to files:
src/brainlayer/mcp/__init__.py
📚 Learning: 2026-03-02T21:14:04.731Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T21:14:04.731Z
Learning: Applies to scripts/**/*.py : Drop FTS triggers (especially `chunks_fts_delete`) before bulk deletes and recreate after — FTS triggers cause massive performance degradation on large datasets
Applied to files:
src/brainlayer/vector_store.py
📚 Learning: 2026-03-02T21:14:04.731Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T21:14:04.731Z
Learning: Applies to scripts/**/*.py : Never delete from `chunks` table while FTS trigger is active on large datasets
Applied to files:
src/brainlayer/vector_store.py
🔇 Additional comments (2)
src/brainlayer/mcp/_shared.py (1)
242-246: Good additive compact payload change.Keeping
importancebehind the existingis not Noneguard makes this backward-compatible for result rows that still do not populate the field.src/brainlayer/search_repo.py (1)
67-71: Good switch to indexed tag lookups across all search paths.These
chunk_tagssubqueries remove the per-row JSON scan from vector, LIKE, and FTS search without disturbing the existing parameter flow.Also applies to: 136-140, 404-408
| # Backfill chunk_tags from existing data (detects partial fills from crashes) | ||
| tagged_chunks = cursor.execute( | ||
| "SELECT COUNT(*) FROM chunks WHERE tags IS NOT NULL AND json_valid(tags) = 1" | ||
| ).fetchone()[0] | ||
| backfilled_chunks = cursor.execute( | ||
| "SELECT COUNT(DISTINCT chunk_id) FROM chunk_tags" | ||
| ).fetchone()[0] | ||
| if tagged_chunks > 0 and backfilled_chunks < tagged_chunks: | ||
| cursor.execute(""" | ||
| INSERT OR IGNORE INTO chunk_tags(chunk_id, tag) | ||
| SELECT c.id, j.value FROM chunks c, json_each(c.tags) j | ||
| WHERE c.tags IS NOT NULL AND json_valid(c.tags) = 1 | ||
| """) |
There was a problem hiding this comment.
The crash-recovery completeness check is still lossy.
tagged_chunks counts rows with any valid JSON in chunks.tags, while backfilled_chunks counts distinct chunk_ids in chunk_tags. That means tags = '[]' will force a full backfill on every startup, and a crash after inserting only some tags for one chunk still looks “complete” because that chunk contributes 1 on both sides. Since src/brainlayer/search_repo.py now filters by chunk_tags only, this can leave tag search permanently incomplete. Please key this off distinct (chunk_id, tag) pairs or make the backfill atomic with a completion sentinel.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/vector_store.py` around lines 282 - 294, The current
completeness check using tagged_chunks and backfilled_chunks is lossy; change it
to compare distinct (chunk_id, tag) pairs instead of rows: compute tagged_pairs
by running SELECT COUNT(*) FROM chunks c, json_each(c.tags) j WHERE c.tags IS
NOT NULL AND json_valid(c.tags)=1 (or SELECT COUNT(DISTINCT c.id || '::' ||
j.value) if you need distinctness), compute backfilled_pairs as SELECT COUNT(*)
FROM chunk_tags, and only run the INSERT OR IGNORE when backfilled_pairs <
tagged_pairs; alternatively implement an atomic backfill by creating a
transaction that builds a temporary table of (chunk_id, tag) from
json_each(c.tags), replaces chunk_tags with that table, and sets a completion
sentinel flag in a metadata table to avoid re-running partial backfills—update
the code paths referencing tagged_chunks, backfilled_chunks, chunk_tags,
chunks.tags, json_each, and the existing INSERT OR IGNORE accordingly.
| # Compact mode should not have verbose fields (importance IS included now) | ||
| assert "source_file" not in result | ||
| assert "importance" not in result | ||
| assert "importance" in result |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
The test description is stale after the assertion flip.
The docstring on Line 86 still says compact results omit importance, but Line 104 now requires it. Please update the description so the test documents the current contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_smart_search_entity_dedup.py` around lines 102 - 104, Update the
test's docstring to match the current assertions: it should state that in
compact mode the result omits "source_file" but does include "importance".
Locate the test's docstring near the assertions that check assert "source_file"
not in result and assert "importance" in result and change the wording
(previously saying compact omits importance) to reflect the new contract.
Summary
Fixes 5 remaining search/index gaps found in the March 8 audit (Gaps F-J).
content_type+languageindexes — eliminates full table scans on 268K+ rowschunk_tagsjunction table with triggers + backfill — replaces O(n)json_each()scanning with indexed lookupbrainlayer_searchbackward-compat alias through_brain_search— enables entity detection, KG routing, detail paramimportancefield in compact search results — visible without drill-downTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Search Behavior
Filters
Tests