Skip to content

feat: Phase 1 — KG schema + entity table + youtube source fix#29

Merged
EtanHey merged 3 commits intomainfrom
feat/phase-1-kg-schema
Feb 25, 2026
Merged

feat: Phase 1 — KG schema + entity table + youtube source fix#29
EtanHey merged 3 commits intomainfrom
feat/phase-1-kg-schema

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 25, 2026

Summary

  • Add 5 knowledge graph tables to BrainLayer's sqlite DB via _init_db() migration:
    • kg_entities — core entity storage (person, company, meeting, topic, project, golem, skill) with UNIQUE(entity_type, name)
    • kg_relations — explicit relationships between entities (works_at, participant_in, mentioned_in, etc.) with confidence scoring
    • kg_entity_chunks — bridge table linking KG entities to existing 270K chunks with relevance scoring
    • kg_vec_entities — sqlite-vec virtual table for semantic entity search (1024 dims, matching existing bge-large)
    • kg_entities_fts — FTS5 full-text search on entity names/metadata, with auto-sync triggers (insert/update/delete)
  • Add source_project_id column to chunks table (critical issue feat(mcp): tool annotations + completions support #3 — links every chunk to its project entity)
  • Add KG CRUD methods to VectorStore: upsert_entity, add_relation, link_entity_chunk, get_entity, get_entity_by_name, get_entity_relations, get_entity_chunks, search_entities, search_entities_semantic, kg_stats
  • Fix brain_search youtube bug: skip project auto-scope for non-claude_code sources (youtube, whatsapp, telegram) since those chunks have null/different project values, causing brain_search(source="youtube") to return 0 results

Test plan

  • 40 new tests in tests/test_kg_schema.py covering:
    • Table creation and column verification
    • Entity CRUD (upsert, get, get_by_name, conflict handling)
    • Relation CRUD (add, query by direction, conflict handling)
    • Entity-chunk bridge (link, query, relevance ordering)
    • FTS5 search (by name, metadata, type filter, trigger sync on update/delete)
    • Vector search (basic, with type filter, multiple entities)
    • KG stats
    • source_project_id column existence
    • YouTube source filter function signature verification
    • Idempotency (double init, data persistence across instances)
  • 299 total tests pass (259 existing + 40 new), 0 regressions
  • ruff check passes clean

🤖 Generated with Claude Code


Note

Medium Risk
Moderate risk due to new sqlite schema migrations (new tables, triggers, and an ALTER TABLE on chunks) that will run on existing DBs; functional impact is largely additive and is covered by new tests.

Overview
Adds Phase 1 Knowledge Graph persistence to the sqlite-backed VectorStore: a new source_project_id column on chunks, plus new KG tables (kg_entities, kg_relations, kg_entity_chunks, kg_vec_entities, kg_entities_fts) including FTS sync triggers and indexes, and exposes CRUD/query APIs for entity upsert, relations, entity↔chunk linking, FTS search, semantic entity search, and KG stats.

Fixes brain_search auto project-scoping so it does not infer/force a CWD project when searching non-claude_code sources (e.g. YouTube/WhatsApp/Telegram), preventing those results from being filtered out. Adds a comprehensive test_kg_schema.py suite covering schema creation/idempotency, KG CRUD, search, and the source-scoping regression.

Written by Cursor Bugbot for commit 0328b35. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Added a Knowledge Graph: entity management, relations, entity–chunk linking, full-text and semantic entity search, and KG stats for richer retrieval.
  • Bug Fixes

    • Corrected source-based search filtering to avoid unintended matches for YouTube, WhatsApp, Telegram and "all" sources.
    • Ensured chunks track project linkage to improve source-specific results.
  • Tests

    • Added comprehensive tests covering KG schema, CRUD, FTS, vector search, relations, linking, stats, and source-filter behavior.

EtanHey and others added 2 commits February 25, 2026 13:32
Add 5 knowledge graph tables to BrainLayer's sqlite DB:
- kg_entities: core entity storage (person, company, meeting, topic, project, golem, skill)
- kg_relations: explicit relationships between entities with confidence scoring
- kg_entity_chunks: bridge table linking entities to existing 270K chunks
- kg_vec_entities: sqlite-vec virtual table for semantic entity search (1024 dims)
- kg_entities_fts: FTS5 full-text search on entity names/metadata with sync triggers

Also adds:
- source_project_id column on chunks table (critical issue #3 from plan)
- KG CRUD methods on VectorStore (upsert_entity, add_relation, link_entity_chunk,
  search_entities, search_entities_semantic, kg_stats, etc.)
- Fix brain_search youtube bug: skip project auto-scope for non-claude_code sources
  (youtube, whatsapp, telegram) since those chunks have null/different project values

40 new tests covering schema creation, CRUD, FTS5 triggers, vector search,
idempotency, and source filter routing. 299 total tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 25, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 058262d and 0328b35.

📒 Files selected for processing (2)
  • src/brainlayer/vector_store.py
  • tests/test_kg_schema.py

📝 Walkthrough

Walkthrough

Introduces a Knowledge Graph subsystem into the VectorStore (schema, FTS5 and vector-backed entity search, CRUD APIs, and KG-chunk linking) and tightens auto-scope logic in brain_search to disable auto-scoping for non-claude_code sources (youtube, whatsapp, telegram, all).

Changes

Cohort / File(s) Summary
Brain Search Auto-Scoping
src/brainlayer/mcp/__init__.py
Tightened auto-scope: auto-scoping now applies only when project is None AND source is not in ("youtube", "whatsapp", "telegram", "all"), avoiding unintended zero-result source-based searches.
Knowledge Graph Subsystem
src/brainlayer/vector_store.py
Added KG schema tables/indexes (kg_entities, kg_relations, kg_entity_chunks, kg_vec_entities, kg_entities_fts), synchronization triggers, timezone-aware timestamps, added source_project_id on chunks, and nine KG public methods (upsert_entity, add_relation, link_entity_chunk, get_entity, get_entity_by_name, get_entity_relations, get_entity_chunks, search_entities, search_entities_semantic, kg_stats).
KG Tests
tests/test_kg_schema.py
Comprehensive tests for KG schema creation, entity/relation CRUD, entity-chunk linking, FTS triggers and searches, semantic/vector search, and KG statistics; also validates the YouTube source filter fix and chunks schema changes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant VectorStore
  participant EmbeddingService
  participant DB

  Client->>VectorStore: search_entities_semantic(query or embedding)
  VectorStore->>EmbeddingService: embed(query)        %% if query text provided
  EmbeddingService-->>VectorStore: embedding
  VectorStore->>DB: query kg_vec_entities (ANN) with embedding + filters
  DB-->>VectorStore: matching entity_ids + scores
  VectorStore->>DB: fetch kg_entities (FTS fallback / metadata)
  DB-->>VectorStore: entity records
  VectorStore-->>Client: aggregated entity results (vectors + metadata)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through tables, rows in tow,
Entities bloom where vectors flow,
I fixed the scope by streams of light,
FTS and links now hum just right,
A tiny hop — the graph grows bright 🌱

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures all three main components of the changeset: KG schema introduction, entity table addition, and the YouTube source bug fix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/phase-1-kg-schema

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/vector_store.py`:
- Around line 2289-2312: search_entities_semantic currently passes the requested
limit as the vec0 "k" which is applied before the WHERE entity_type filter, so
when entity_type is set you can get fewer than limit results; update the
function (search_entities_semantic) to over-fetch and then slice: compute a
fetch_k (e.g. min(max(limit * 3, limit + 50), some_upper_bound)) and use fetch_k
in place of k when binding to the SQL (still using query_bytes and the same JOIN
on kg_vec_entities v / kg_entities e), execute the query, then post-filter/slice
the returned rows to return at most limit entries matching entity_type; ensure
you bound fetch_k to avoid unbounded scans and preserve the existing behavior
when entity_type is None by using limit directly.
- Around line 2049-2072: The add_relation function currently returns the
caller-supplied relation_id even when the INSERT ON CONFLICT update keeps the
existing row id; fix by retrieving and returning the actual stored id after the
upsert (e.g., SELECT id FROM kg_relations WHERE source_id=? AND target_id=? AND
relation_type=?), so add_relation returns that fetched id instead of the input
relation_id; update references to relation_id, source_id, target_id,
relation_type and the kg_relations table to perform the select after the execute
and return the selected id.
- Around line 2016-2047: The current upsert_entity returns the caller-provided
entity_id even when ON CONFLICT causes the existing row (with a different id) to
be kept; change upsert_entity so that after executing the INSERT ... ON
CONFLICT(...) DO UPDATE statement you query the actual stored id (e.g. SELECT id
FROM kg_entities WHERE entity_type = ? AND name = ?) to obtain the real entity
ID, use that stored_id when writing the embedding into kg_vec_entities (replace
uses of the original entity_id for the embedding INSERT/REPLACE), and return the
stored_id instead of the caller's id; keep all other metadata/embedding handling
the same and ensure you use the same entity_type/name parameters to locate the
row.
- Around line 347-428: The DB schema enables foreign key constraints on
kg_relations and kg_entity_chunks but SQLite’s foreign key enforcement is off by
default; in _init_db enable enforcement by executing PRAGMA foreign_keys = ON
(use cursor.execute("PRAGMA foreign_keys = ON") immediately after obtaining the
cursor/connection in _init_db) so the REFERENCES on kg_relations and
kg_entity_chunks are actually enforced; if you intentionally don’t want
enforcement, remove the REFERENCES clauses from kg_relations and
kg_entity_chunks instead.

In `@tests/test_kg_schema.py`:
- Around line 118-125: Update the test_upsert_entity_updates_on_conflict to
verify that on conflict the original entity ID is preserved and that
upsert_entity returns the stored ID: after the second upsert_entity call
retrieve the entity via get_entity_by_name and assert entity["id"] == "person-1"
(original ID preserved), and then call upsert_entity again (e.g., with
"person-3") and assert the returned_id equals entity["id"]; this will detect the
stale-ID bug in upsert_entity in vector_store.py where the method may return the
new candidate ID instead of the existing stored ID.
- Around line 358-377: In test_fts_trigger_updates_on_entity_update add an
assertion to verify the stale FTS entry was removed: after calling old_results =
store.search_entities("Old Name") assert len(old_results) == 0; this ensures the
UPDATE to kg_entities (done via direct SQL in the test) actually caused the FTS
triggers to remove the old entry—update the test function
test_fts_trigger_updates_on_entity_update accordingly, referencing old_results
and using store.search_entities to perform the check.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96eb5df and 058262d.

📒 Files selected for processing (3)
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
  • tests/test_kg_schema.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Files:

  • src/brainlayer/mcp/__init__.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests using pytest from the project root

Files:

  • tests/test_kg_schema.py
src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec with APSW for vector storage, WAL mode, and PRAGMA busy_timeout = 5000 for concurrent multi-process safety

Files:

  • src/brainlayer/vector_store.py
🧠 Learnings (1)
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Applied to files:

  • src/brainlayer/mcp/__init__.py
🧬 Code graph analysis (1)
src/brainlayer/vector_store.py (2)
src/brainlayer/clustering.py (1)
  • serialize_f32 (53-60)
scripts/reembed_bge_m3.py (1)
  • serialize_f32 (39-41)
🔇 Additional comments (5)
src/brainlayer/vector_store.py (1)

342-345: LGTM — source_project_id migration is clean.

Correctly checks existing_cols (populated at Line 115) before altering, and creates the index conditionally. Consistent with the existing migration pattern.

src/brainlayer/mcp/__init__.py (1)

1008-1017: Clean and well-documented bugfix for source-aware auto-scoping.

The negative check source not in ("youtube", "whatsapp", "telegram", "all") correctly preserves auto-scoping for the default case (source=None → claude_code) and explicit source="claude_code", while disabling it for sources whose chunks lack project values. The inline comment clearly explains the root cause.

tests/test_kg_schema.py (3)

22-39: LGTM — Clean fixtures with deterministic embeddings.

The tmp_path-based store ensures test isolation, and the deterministic embedding generator using character ordinals provides reproducible vector search results. Both fixtures properly handle setup/teardown.


45-92: LGTM — Thorough schema validation.

Covers all 5 KG tables plus the source_project_id migration, including exact column set assertions. Good use of set comparison for column validation.


490-521: LGTM — Valuable idempotency and persistence tests.

Tests confirm _init_db is safe to re-run and that KG data survives across VectorStore instances. Good coverage for migration safety.

Comment thread src/brainlayer/vector_store.py
Comment thread src/brainlayer/vector_store.py Outdated
Comment thread src/brainlayer/vector_store.py Outdated
Comment thread src/brainlayer/vector_store.py
Comment thread tests/test_kg_schema.py
Comment thread tests/test_kg_schema.py Outdated
Comment thread src/brainlayer/vector_store.py Outdated
- upsert_entity: return actual stored ID on conflict (not caller's ID)
- add_relation: same stale-ID fix
- search_entities_semantic: over-fetch with entity_type filter (vec0
  applies k before WHERE, so filtered results could be fewer than limit)
- Remove dead REFERENCES clauses from kg_relations and kg_entity_chunks
  (SQLite FK enforcement is off by default, not worth enabling)
- Test: assert original ID preserved on upsert conflict
- Test: fix FTS trigger test — use unique terms to avoid OR match overlap

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit f7a8e17 into main Feb 25, 2026
5 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

# Auto-scope project from CWD if not provided — but ONLY for claude_code source.
# Non-claude_code sources (youtube, whatsapp, etc.) have null/different project values,
# so auto-scoping filters them out entirely (bug: brain_search(source="youtube") → 0 results).
if project is None and source not in ("youtube", "whatsapp", "telegram", "all"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile negative source list breaks future source types

Medium Severity

The project auto-scope bypass uses a negative list (source not in ("youtube", "whatsapp", "telegram", "all")) instead of a positive list. The PR description states the intent as "skip project auto-scope for non-claude_code sources," but the implementation only skips for three hardcoded sources plus "all". Any future non-claude_code source (or existing ones like "claude_desktop") that has null/different project values would still get auto-scoped, silently returning 0 results — the exact same bug this fix aims to prevent. Using source in (None, "claude_code") would match the stated intent and default to the safe behavior (no auto-scoping) for unknown sources.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant