feat: Phase 1 — KG schema + entity table + youtube source fix#29
feat: Phase 1 — KG schema + entity table + youtube source fix#29
Conversation
Add 5 knowledge graph tables to BrainLayer's sqlite DB: - kg_entities: core entity storage (person, company, meeting, topic, project, golem, skill) - kg_relations: explicit relationships between entities with confidence scoring - kg_entity_chunks: bridge table linking entities to existing 270K chunks - kg_vec_entities: sqlite-vec virtual table for semantic entity search (1024 dims) - kg_entities_fts: FTS5 full-text search on entity names/metadata with sync triggers Also adds: - source_project_id column on chunks table (critical issue #3 from plan) - KG CRUD methods on VectorStore (upsert_entity, add_relation, link_entity_chunk, search_entities, search_entities_semantic, kg_stats, etc.) - Fix brain_search youtube bug: skip project auto-scope for non-claude_code sources (youtube, whatsapp, telegram) since those chunks have null/different project values 40 new tests covering schema creation, CRUD, FTS5 triggers, vector search, idempotency, and source filter routing. 299 total tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughIntroduces a Knowledge Graph subsystem into the VectorStore (schema, FTS5 and vector-backed entity search, CRUD APIs, and KG-chunk linking) and tightens auto-scope logic in brain_search to disable auto-scoping for non-claude_code sources (youtube, whatsapp, telegram, all). Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant VectorStore
participant EmbeddingService
participant DB
Client->>VectorStore: search_entities_semantic(query or embedding)
VectorStore->>EmbeddingService: embed(query) %% if query text provided
EmbeddingService-->>VectorStore: embedding
VectorStore->>DB: query kg_vec_entities (ANN) with embedding + filters
DB-->>VectorStore: matching entity_ids + scores
VectorStore->>DB: fetch kg_entities (FTS fallback / metadata)
DB-->>VectorStore: entity records
VectorStore-->>Client: aggregated entity results (vectors + metadata)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2289-2312: search_entities_semantic currently passes the requested
limit as the vec0 "k" which is applied before the WHERE entity_type filter, so
when entity_type is set you can get fewer than limit results; update the
function (search_entities_semantic) to over-fetch and then slice: compute a
fetch_k (e.g. min(max(limit * 3, limit + 50), some_upper_bound)) and use fetch_k
in place of k when binding to the SQL (still using query_bytes and the same JOIN
on kg_vec_entities v / kg_entities e), execute the query, then post-filter/slice
the returned rows to return at most limit entries matching entity_type; ensure
you bound fetch_k to avoid unbounded scans and preserve the existing behavior
when entity_type is None by using limit directly.
- Around line 2049-2072: The add_relation function currently returns the
caller-supplied relation_id even when the INSERT ON CONFLICT update keeps the
existing row id; fix by retrieving and returning the actual stored id after the
upsert (e.g., SELECT id FROM kg_relations WHERE source_id=? AND target_id=? AND
relation_type=?), so add_relation returns that fetched id instead of the input
relation_id; update references to relation_id, source_id, target_id,
relation_type and the kg_relations table to perform the select after the execute
and return the selected id.
- Around line 2016-2047: The current upsert_entity returns the caller-provided
entity_id even when ON CONFLICT causes the existing row (with a different id) to
be kept; change upsert_entity so that after executing the INSERT ... ON
CONFLICT(...) DO UPDATE statement you query the actual stored id (e.g. SELECT id
FROM kg_entities WHERE entity_type = ? AND name = ?) to obtain the real entity
ID, use that stored_id when writing the embedding into kg_vec_entities (replace
uses of the original entity_id for the embedding INSERT/REPLACE), and return the
stored_id instead of the caller's id; keep all other metadata/embedding handling
the same and ensure you use the same entity_type/name parameters to locate the
row.
- Around line 347-428: The DB schema enables foreign key constraints on
kg_relations and kg_entity_chunks but SQLite’s foreign key enforcement is off by
default; in _init_db enable enforcement by executing PRAGMA foreign_keys = ON
(use cursor.execute("PRAGMA foreign_keys = ON") immediately after obtaining the
cursor/connection in _init_db) so the REFERENCES on kg_relations and
kg_entity_chunks are actually enforced; if you intentionally don’t want
enforcement, remove the REFERENCES clauses from kg_relations and
kg_entity_chunks instead.
In `@tests/test_kg_schema.py`:
- Around line 118-125: Update the test_upsert_entity_updates_on_conflict to
verify that on conflict the original entity ID is preserved and that
upsert_entity returns the stored ID: after the second upsert_entity call
retrieve the entity via get_entity_by_name and assert entity["id"] == "person-1"
(original ID preserved), and then call upsert_entity again (e.g., with
"person-3") and assert the returned_id equals entity["id"]; this will detect the
stale-ID bug in upsert_entity in vector_store.py where the method may return the
new candidate ID instead of the existing stored ID.
- Around line 358-377: In test_fts_trigger_updates_on_entity_update add an
assertion to verify the stale FTS entry was removed: after calling old_results =
store.search_entities("Old Name") assert len(old_results) == 0; this ensures the
UPDATE to kg_entities (done via direct SQL in the test) actually caused the FTS
triggers to remove the old entry—update the test function
test_fts_trigger_updates_on_entity_update accordingly, referencing old_results
and using store.search_entities to perform the check.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.pytests/test_kg_schema.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
ruff check src/for linting andruff format src/for code formatting
Files:
src/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.py
src/brainlayer/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names
Files:
src/brainlayer/mcp/__init__.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run tests using
pytestfrom the project root
Files:
tests/test_kg_schema.py
src/brainlayer/vector_store.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use sqlite-vec with APSW for vector storage, WAL mode, and
PRAGMA busy_timeout = 5000for concurrent multi-process safety
Files:
src/brainlayer/vector_store.py
🧠 Learnings (1)
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names
Applied to files:
src/brainlayer/mcp/__init__.py
🧬 Code graph analysis (1)
src/brainlayer/vector_store.py (2)
src/brainlayer/clustering.py (1)
serialize_f32(53-60)scripts/reembed_bge_m3.py (1)
serialize_f32(39-41)
🔇 Additional comments (5)
src/brainlayer/vector_store.py (1)
342-345: LGTM —source_project_idmigration is clean.Correctly checks
existing_cols(populated at Line 115) before altering, and creates the index conditionally. Consistent with the existing migration pattern.src/brainlayer/mcp/__init__.py (1)
1008-1017: Clean and well-documented bugfix for source-aware auto-scoping.The negative check
source not in ("youtube", "whatsapp", "telegram", "all")correctly preserves auto-scoping for the default case (source=None→ claude_code) and explicitsource="claude_code", while disabling it for sources whose chunks lack project values. The inline comment clearly explains the root cause.tests/test_kg_schema.py (3)
22-39: LGTM — Clean fixtures with deterministic embeddings.The
tmp_path-based store ensures test isolation, and the deterministic embedding generator using character ordinals provides reproducible vector search results. Both fixtures properly handle setup/teardown.
45-92: LGTM — Thorough schema validation.Covers all 5 KG tables plus the
source_project_idmigration, including exact column set assertions. Good use of set comparison for column validation.
490-521: LGTM — Valuable idempotency and persistence tests.Tests confirm
_init_dbis safe to re-run and that KG data survives acrossVectorStoreinstances. Good coverage for migration safety.
- upsert_entity: return actual stored ID on conflict (not caller's ID) - add_relation: same stale-ID fix - search_entities_semantic: over-fetch with entity_type filter (vec0 applies k before WHERE, so filtered results could be fewer than limit) - Remove dead REFERENCES clauses from kg_relations and kg_entity_chunks (SQLite FK enforcement is off by default, not worth enabling) - Test: assert original ID preserved on upsert conflict - Test: fix FTS trigger test — use unique terms to avoid OR match overlap Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| # Auto-scope project from CWD if not provided — but ONLY for claude_code source. | ||
| # Non-claude_code sources (youtube, whatsapp, etc.) have null/different project values, | ||
| # so auto-scoping filters them out entirely (bug: brain_search(source="youtube") → 0 results). | ||
| if project is None and source not in ("youtube", "whatsapp", "telegram", "all"): |
There was a problem hiding this comment.
Fragile negative source list breaks future source types
Medium Severity
The project auto-scope bypass uses a negative list (source not in ("youtube", "whatsapp", "telegram", "all")) instead of a positive list. The PR description states the intent as "skip project auto-scope for non-claude_code sources," but the implementation only skips for three hardcoded sources plus "all". Any future non-claude_code source (or existing ones like "claude_desktop") that has null/different project values would still get auto-scoped, silently returning 0 results — the exact same bug this fix aims to prevent. Using source in (None, "claude_code") would match the stated intent and default to the safe behavior (no auto-scoping) for unknown sources.


Summary
_init_db()migration:kg_entities— core entity storage (person, company, meeting, topic, project, golem, skill) withUNIQUE(entity_type, name)kg_relations— explicit relationships between entities (works_at, participant_in, mentioned_in, etc.) with confidence scoringkg_entity_chunks— bridge table linking KG entities to existing 270K chunks with relevance scoringkg_vec_entities— sqlite-vec virtual table for semantic entity search (1024 dims, matching existing bge-large)kg_entities_fts— FTS5 full-text search on entity names/metadata, with auto-sync triggers (insert/update/delete)source_project_idcolumn tochunkstable (critical issue feat(mcp): tool annotations + completions support #3 — links every chunk to its project entity)VectorStore:upsert_entity,add_relation,link_entity_chunk,get_entity,get_entity_by_name,get_entity_relations,get_entity_chunks,search_entities,search_entities_semantic,kg_statsbrain_search(source="youtube")to return 0 resultsTest plan
tests/test_kg_schema.pycovering:ruff checkpasses clean🤖 Generated with Claude Code
Note
Medium Risk
Moderate risk due to new sqlite schema migrations (new tables, triggers, and an
ALTER TABLEonchunks) that will run on existing DBs; functional impact is largely additive and is covered by new tests.Overview
Adds Phase 1 Knowledge Graph persistence to the sqlite-backed
VectorStore: a newsource_project_idcolumn onchunks, plus new KG tables (kg_entities,kg_relations,kg_entity_chunks,kg_vec_entities,kg_entities_fts) including FTS sync triggers and indexes, and exposes CRUD/query APIs for entity upsert, relations, entity↔chunk linking, FTS search, semantic entity search, and KG stats.Fixes
brain_searchauto project-scoping so it does not infer/force a CWD project when searching non-claude_codesources (e.g. YouTube/WhatsApp/Telegram), preventing those results from being filtered out. Adds a comprehensivetest_kg_schema.pysuite covering schema creation/idempotency, KG CRUD, search, and the source-scoping regression.Written by Cursor Bugbot for commit 0328b35. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Bug Fixes
Tests