feat: Phase 1 — KG schema + entity table + youtube source fix by EtanHey · Pull Request #29 · EtanHey/brainlayer

EtanHey · 2026-02-25T11:32:32Z

Summary

Add 5 knowledge graph tables to BrainLayer's sqlite DB via _init_db() migration:
- kg_entities — core entity storage (person, company, meeting, topic, project, golem, skill) with UNIQUE(entity_type, name)
- kg_relations — explicit relationships between entities (works_at, participant_in, mentioned_in, etc.) with confidence scoring
- kg_entity_chunks — bridge table linking KG entities to existing 270K chunks with relevance scoring
- kg_vec_entities — sqlite-vec virtual table for semantic entity search (1024 dims, matching existing bge-large)
- kg_entities_fts — FTS5 full-text search on entity names/metadata, with auto-sync triggers (insert/update/delete)
Add source_project_id column to chunks table (critical issue feat(mcp): tool annotations + completions support #3 — links every chunk to its project entity)
Add KG CRUD methods to VectorStore: upsert_entity, add_relation, link_entity_chunk, get_entity, get_entity_by_name, get_entity_relations, get_entity_chunks, search_entities, search_entities_semantic, kg_stats
Fix brain_search youtube bug: skip project auto-scope for non-claude_code sources (youtube, whatsapp, telegram) since those chunks have null/different project values, causing brain_search(source="youtube") to return 0 results

Test plan

40 new tests in tests/test_kg_schema.py covering:
- Table creation and column verification
- Entity CRUD (upsert, get, get_by_name, conflict handling)
- Relation CRUD (add, query by direction, conflict handling)
- Entity-chunk bridge (link, query, relevance ordering)
- FTS5 search (by name, metadata, type filter, trigger sync on update/delete)
- Vector search (basic, with type filter, multiple entities)
- KG stats
- source_project_id column existence
- YouTube source filter function signature verification
- Idempotency (double init, data persistence across instances)
299 total tests pass (259 existing + 40 new), 0 regressions
ruff check passes clean

🤖 Generated with Claude Code

Note

Medium Risk
Moderate risk due to new sqlite schema migrations (new tables, triggers, and an ALTER TABLE on chunks) that will run on existing DBs; functional impact is largely additive and is covered by new tests.

Overview
Adds Phase 1 Knowledge Graph persistence to the sqlite-backed VectorStore: a new source_project_id column on chunks, plus new KG tables (kg_entities, kg_relations, kg_entity_chunks, kg_vec_entities, kg_entities_fts) including FTS sync triggers and indexes, and exposes CRUD/query APIs for entity upsert, relations, entity↔chunk linking, FTS search, semantic entity search, and KG stats.

Fixes brain_search auto project-scoping so it does not infer/force a CWD project when searching non-claude_code sources (e.g. YouTube/WhatsApp/Telegram), preventing those results from being filtered out. Adds a comprehensive test_kg_schema.py suite covering schema creation/idempotency, KG CRUD, search, and the source-scoping regression.

^{Written by Cursor Bugbot for commit 0328b35. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

New Features
- Added a Knowledge Graph: entity management, relations, entity–chunk linking, full-text and semantic entity search, and KG stats for richer retrieval.
Bug Fixes
- Corrected source-based search filtering to avoid unintended matches for YouTube, WhatsApp, Telegram and "all" sources.
- Ensured chunks track project linkage to improve source-specific results.
Tests
- Added comprehensive tests covering KG schema, CRUD, FTS, vector search, relations, linking, stats, and source-filter behavior.

Add 5 knowledge graph tables to BrainLayer's sqlite DB: - kg_entities: core entity storage (person, company, meeting, topic, project, golem, skill) - kg_relations: explicit relationships between entities with confidence scoring - kg_entity_chunks: bridge table linking entities to existing 270K chunks - kg_vec_entities: sqlite-vec virtual table for semantic entity search (1024 dims) - kg_entities_fts: FTS5 full-text search on entity names/metadata with sync triggers Also adds: - source_project_id column on chunks table (critical issue #3 from plan) - KG CRUD methods on VectorStore (upsert_entity, add_relation, link_entity_chunk, search_entities, search_entities_semantic, kg_stats, etc.) - Fix brain_search youtube bug: skip project auto-scope for non-claude_code sources (youtube, whatsapp, telegram) since those chunks have null/different project values 40 new tests covering schema creation, CRUD, FTS5 triggers, vector search, idempotency, and source filter routing. 299 total tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-25T11:42:20Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 058262d and 0328b35.

📒 Files selected for processing (2)

src/brainlayer/vector_store.py
tests/test_kg_schema.py

📝 Walkthrough

Walkthrough

Introduces a Knowledge Graph subsystem into the VectorStore (schema, FTS5 and vector-backed entity search, CRUD APIs, and KG-chunk linking) and tightens auto-scope logic in brain_search to disable auto-scoping for non-claude_code sources (youtube, whatsapp, telegram, all).

Changes

Cohort / File(s)	Summary
Brain Search Auto-Scoping `src/brainlayer/mcp/__init__.py`	Tightened auto-scope: auto-scoping now applies only when project is None AND source is not in (`"youtube"`, `"whatsapp"`, `"telegram"`, `"all"`), avoiding unintended zero-result source-based searches.
Knowledge Graph Subsystem `src/brainlayer/vector_store.py`	Added KG schema tables/indexes (`kg_entities`, `kg_relations`, `kg_entity_chunks`, `kg_vec_entities`, `kg_entities_fts`), synchronization triggers, timezone-aware timestamps, added `source_project_id` on chunks, and nine KG public methods (upsert_entity, add_relation, link_entity_chunk, get_entity, get_entity_by_name, get_entity_relations, get_entity_chunks, search_entities, search_entities_semantic, kg_stats).
KG Tests `tests/test_kg_schema.py`	Comprehensive tests for KG schema creation, entity/relation CRUD, entity-chunk linking, FTS triggers and searches, semantic/vector search, and KG statistics; also validates the YouTube source filter fix and chunks schema changes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant VectorStore
  participant EmbeddingService
  participant DB

  Client->>VectorStore: search_entities_semantic(query or embedding)
  VectorStore->>EmbeddingService: embed(query)        %% if query text provided
  EmbeddingService-->>VectorStore: embedding
  VectorStore->>DB: query kg_vec_entities (ANN) with embedding + filters
  DB-->>VectorStore: matching entity_ids + scores
  VectorStore->>DB: fetch kg_entities (FTS fallback / metadata)
  DB-->>VectorStore: entity records
  VectorStore-->>Client: aggregated entity results (vectors + metadata)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: Phase 5 — search/store hardening #26: Modifies auto-scoping logic in src/brainlayer/mcp/__init__.py; closely related to the tightened auto-scope changes here.
test: Phase 4 QA — comprehensive tests for Phase 3 core fixes #4: Touches VectorStore schema and chunks fields; related to chunks schema changes (timestamps/source fields) and KG-related schema evolution.

Poem

🐰 I hopped through tables, rows in tow,
Entities bloom where vectors flow,
I fixed the scope by streams of light,
FTS and links now hum just right,
A tiny hop — the graph grows bright 🌱

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures all three main components of the changeset: KG schema introduction, entity table addition, and the YouTube source bug fix.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/phase-1-kg-schema

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2289-2312: search_entities_semantic currently passes the requested
limit as the vec0 "k" which is applied before the WHERE entity_type filter, so
when entity_type is set you can get fewer than limit results; update the
function (search_entities_semantic) to over-fetch and then slice: compute a
fetch_k (e.g. min(max(limit * 3, limit + 50), some_upper_bound)) and use fetch_k
in place of k when binding to the SQL (still using query_bytes and the same JOIN
on kg_vec_entities v / kg_entities e), execute the query, then post-filter/slice
the returned rows to return at most limit entries matching entity_type; ensure
you bound fetch_k to avoid unbounded scans and preserve the existing behavior
when entity_type is None by using limit directly.
- Around line 2049-2072: The add_relation function currently returns the
caller-supplied relation_id even when the INSERT ON CONFLICT update keeps the
existing row id; fix by retrieving and returning the actual stored id after the
upsert (e.g., SELECT id FROM kg_relations WHERE source_id=? AND target_id=? AND
relation_type=?), so add_relation returns that fetched id instead of the input
relation_id; update references to relation_id, source_id, target_id,
relation_type and the kg_relations table to perform the select after the execute
and return the selected id.
- Around line 2016-2047: The current upsert_entity returns the caller-provided
entity_id even when ON CONFLICT causes the existing row (with a different id) to
be kept; change upsert_entity so that after executing the INSERT ... ON
CONFLICT(...) DO UPDATE statement you query the actual stored id (e.g. SELECT id
FROM kg_entities WHERE entity_type = ? AND name = ?) to obtain the real entity
ID, use that stored_id when writing the embedding into kg_vec_entities (replace
uses of the original entity_id for the embedding INSERT/REPLACE), and return the
stored_id instead of the caller's id; keep all other metadata/embedding handling
the same and ensure you use the same entity_type/name parameters to locate the
row.
- Around line 347-428: The DB schema enables foreign key constraints on
kg_relations and kg_entity_chunks but SQLite’s foreign key enforcement is off by
default; in _init_db enable enforcement by executing PRAGMA foreign_keys = ON
(use cursor.execute("PRAGMA foreign_keys = ON") immediately after obtaining the
cursor/connection in _init_db) so the REFERENCES on kg_relations and
kg_entity_chunks are actually enforced; if you intentionally don’t want
enforcement, remove the REFERENCES clauses from kg_relations and
kg_entity_chunks instead.

In `@tests/test_kg_schema.py`:
- Around line 118-125: Update the test_upsert_entity_updates_on_conflict to
verify that on conflict the original entity ID is preserved and that
upsert_entity returns the stored ID: after the second upsert_entity call
retrieve the entity via get_entity_by_name and assert entity["id"] == "person-1"
(original ID preserved), and then call upsert_entity again (e.g., with
"person-3") and assert the returned_id equals entity["id"]; this will detect the
stale-ID bug in upsert_entity in vector_store.py where the method may return the
new candidate ID instead of the existing stored ID.
- Around line 358-377: In test_fts_trigger_updates_on_entity_update add an
assertion to verify the stale FTS entry was removed: after calling old_results =
store.search_entities("Old Name") assert len(old_results) == 0; this ensures the
UPDATE to kg_entities (done via direct SQL in the test) actually caused the FTS
triggers to remove the old entry—update the test function
test_fts_trigger_updates_on_entity_update accordingly, referencing old_results
and using store.search_entities to perform the check.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96eb5df and 058262d.

📒 Files selected for processing (3)

src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py
tests/test_kg_schema.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor Bugbot

🧰 Additional context used

📓 Path-based instructions (4)

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py

src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Files:

src/brainlayer/mcp/__init__.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests using pytest from the project root

Files:

tests/test_kg_schema.py

src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec with APSW for vector storage, WAL mode, and PRAGMA busy_timeout = 5000 for concurrent multi-process safety

Files:

src/brainlayer/vector_store.py

🧠 Learnings (1)

📚 Learning: 2026-02-23T16:51:38.317Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Applied to files:

src/brainlayer/mcp/__init__.py

🧬 Code graph analysis (1)

src/brainlayer/vector_store.py (2)

src/brainlayer/clustering.py (1)

serialize_f32 (53-60)

scripts/reembed_bge_m3.py (1)

serialize_f32 (39-41)

🔇 Additional comments (5)

src/brainlayer/vector_store.py (1)

342-345: LGTM — source_project_id migration is clean.

Correctly checks existing_cols (populated at Line 115) before altering, and creates the index conditionally. Consistent with the existing migration pattern.

src/brainlayer/mcp/__init__.py (1)

1008-1017: Clean and well-documented bugfix for source-aware auto-scoping.

The negative check source not in ("youtube", "whatsapp", "telegram", "all") correctly preserves auto-scoping for the default case (source=None → claude_code) and explicit source="claude_code", while disabling it for sources whose chunks lack project values. The inline comment clearly explains the root cause.

tests/test_kg_schema.py (3)

22-39: LGTM — Clean fixtures with deterministic embeddings.

The tmp_path-based store ensures test isolation, and the deterministic embedding generator using character ordinals provides reproducible vector search results. Both fixtures properly handle setup/teardown.

45-92: LGTM — Thorough schema validation.

Covers all 5 KG tables plus the source_project_id migration, including exact column set assertions. Good use of set comparison for column validation.

490-521: LGTM — Valuable idempotency and persistence tests.

Tests confirm _init_db is safe to re-run and that KG data survives across VectorStore instances. Good coverage for migration safety.

- upsert_entity: return actual stored ID on conflict (not caller's ID) - add_relation: same stale-ID fix - search_entities_semantic: over-fetch with entity_type filter (vec0 applies k before WHERE, so filtered results could be fewer than limit) - Remove dead REFERENCES clauses from kg_relations and kg_entity_chunks (SQLite FK enforcement is off by default, not worth enabling) - Test: assert original ID preserved on upsert conflict - Test: fix FTS trigger test — use unique terms to avoid OR match overlap Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-25T12:02:31Z

+    # Auto-scope project from CWD if not provided — but ONLY for claude_code source.
+    # Non-claude_code sources (youtube, whatsapp, etc.) have null/different project values,
+    # so auto-scoping filters them out entirely (bug: brain_search(source="youtube") → 0 results).
+    if project is None and source not in ("youtube", "whatsapp", "telegram", "all"):


Fragile negative source list breaks future source types

Medium Severity

The project auto-scope bypass uses a negative list (source not in ("youtube", "whatsapp", "telegram", "all")) instead of a positive list. The PR description states the intent as "skip project auto-scope for non-claude_code sources," but the implementation only skips for three hardcoded sources plus "all". Any future non-claude_code source (or existing ones like "claude_desktop") that has null/different project values would still get auto-scoped, silently returning 0 results — the exact same bug this fix aims to prevent. Using source in (None, "claude_code") would match the stated intent and default to the safe behavior (no auto-scoping) for unknown sources.

EtanHey and others added 2 commits February 25, 2026 13:32

style: format KG schema code with ruff

058262d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread src/brainlayer/vector_store.py

Comment thread src/brainlayer/vector_store.py Outdated

Comment thread src/brainlayer/vector_store.py Outdated

Comment thread src/brainlayer/vector_store.py

Comment thread tests/test_kg_schema.py

Comment thread tests/test_kg_schema.py Outdated

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread src/brainlayer/vector_store.py Outdated

EtanHey merged commit f7a8e17 into main Feb 25, 2026
5 checks passed

cursor Bot reviewed Feb 25, 2026

View reviewed changes

This was referenced Feb 26, 2026

feat(6pm): entity upgrades — per-person search, brain_get_person, entity-tagged store #33

Merged

feat: KG extraction pipeline — wires entities to KG tables #47

Merged

This was referenced Mar 28, 2026

docs: update README — 1,454 tests, 12 MCP tools, entity contracts #120

Merged

feat: implement all 6 BrainBar stub tools #135

Merged

fix: KG search via pure SQL path — no embedding model required (R69) #191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 1 — KG schema + entity table + youtube source fix#29

feat: Phase 1 — KG schema + entity table + youtube source fix#29
EtanHey merged 3 commits intomainfrom
feat/phase-1-kg-schema

EtanHey commented Feb 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 25, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Feb 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 25, 2026

Choose a reason for hiding this comment

Fragile negative source list breaks future source types

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Feb 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 25, 2026 •

edited

Loading