feat: promote concept tags into KG entities#199
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@coderabbitai review |
|
@codex review |
|
You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis change introduces a tag-to-entity promotion pipeline that discovers high-frequency concept tags not yet in the knowledge graph and promotes them into entities. It includes a CLI script, core pipeline logic with tag classification, candidate discovery, and entity creation, updated database schema seeds, and comprehensive test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as promote_tag_entities.py
participant Store as VectorStore
participant DB as SQLite DB
participant Pipeline as tag_entity_promotion
CLI->>Store: Initialize VectorStore
Store->>DB: Connect & seed entity_type_hierarchy
Store-->>CLI: Ready
CLI->>Pipeline: promote_tag_entities(store, min_count, limit, dry_run)
Pipeline->>Pipeline: find_promotion_candidates(store, min_count, limit)
Pipeline->>DB: Query chunk_tags for unlinked tags
DB-->>Pipeline: Tag frequency results
Pipeline->>Pipeline: Filter & classify each tag
Pipeline-->>Pipeline: Return candidates with entity_type
alt not dry_run
Pipeline->>DB: Check schema for mention_type column
DB-->>Pipeline: Schema info
loop For each candidate
Pipeline->>Store: upsert_entity(auto-tag-<slug>)
Store->>DB: Insert/get kg_entities entry
DB-->>Store: Entity ID
Store-->>Pipeline: Entity confirmed
Pipeline->>DB: INSERT kg_entity_chunks with mention_type
DB-->>Pipeline: Links created count
Pipeline->>Pipeline: Update stats
end
end
Pipeline-->>CLI: Return promotion stats (JSON)
CLI->>Store: Close VectorStore
CLI->>DB: Disconnect
CLI-->>User: Print stats, exit 0
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/promote_tag_entities.py`:
- Around line 19-28: The script opens the canonical DB implicitly which can
accidentally write to the wrong database; add an explicit CLI option (e.g.,
--db-path) to argparse and use that value when constructing
VectorStore(get_db_path()) (or replace get_db_path() with the provided arg), and
require explicit confirmation for non-dry runs (either a --confirm flag or an
interactive prompt) before calling promote_tag_entities and performing writes;
ensure the code path that performs writes checks args.dry_run and the
confirmation flag/response before proceeding so accidental writes are prevented.
In `@src/brainlayer/pipeline/tag_entity_promotion.py`:
- Around line 132-146: The SQL uses raw ct.tag which leads to split counts and
missed joins for casing/whitespace variants; update all queries (the
SELECT/GROUP BY/ORDER BY block that defines query and any later link queries) to
normalize keys using lower(trim(ct.tag)) (alias it e.g. norm_tag) everywhere you
reference ct.tag, use the same normalized expression in the JOIN condition
against lower(e.name), in NOT IN placeholders convert the excluded tags to
lower(trim(...)), and GROUP BY the normalized expression so counts and
subsequent linking (the code that inserts/promotes using ct.tag) operate on the
normalized tag values consistently.
- Around line 181-223: The loop that upserts entities and inserts into
kg_entity_chunks (uses cursor, store.conn, upsert_entity, kg_entity_chunks,
chunk_tags, stats) must be protected with exclusive-write orchestration and
SQLITE_BUSY retries: wrap the entire promotion sequence for all candidates in a
single exclusive write transaction (e.g., acquire a DB-level exclusive lock via
BEGIN EXCLUSIVE or an application-level one-writer mutex) so only one promoter
runs at a time, perform a WAL checkpoint before taking the exclusive lock, and
implement retry/backoff on SQLITE_BUSY (or set busy_timeout) for transient
contention; ensure you commit or rollback the exclusive transaction and release
the lock in a finally block so partial writes don’t occur and stats remain
consistent.
In `@tests/test_tag_entity_promotion.py`:
- Around line 47-57: Update the
test_find_candidates_skips_existing_and_activity_tags test (and similar tests
between lines 60-91) to include mixed-case tag variants (e.g., insert chunks
with ["Telegram", "telegram", "existing-topic"]) using the existing helper
_insert_chunk_with_tags so tags are created with differing cases, then call
find_promotion_candidates(store, min_count=2) and assert that the result
normalizes case by returning a single candidate "telegram" (lowercased) and that
existing entities inserted via store.upsert_entity("existing-topic", "topic",
"existing-topic") are still skipped; this ensures find_promotion_candidates and
the chunk/link logic are case-normalized end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d22d7fd4-60c1-48e0-b866-3e437f31b05c
📒 Files selected for processing (4)
scripts/promote_tag_entities.pysrc/brainlayer/pipeline/tag_entity_promotion.pysrc/brainlayer/vector_store.pytests/test_tag_entity_promotion.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Macroscope - Correctness Check
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
src/brainlayer/vector_store.pyscripts/promote_tag_entities.pytests/test_tag_entity_promotion.pysrc/brainlayer/pipeline/tag_entity_promotion.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package insrc/brainlayer/
All scripts and CLI must usepaths.py:get_db_path()for resolving database path instead of hardcoding
Implement chunk lifecycle management with columnssuperseded_by,aggregated_into,archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes onchunkstable and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic onSQLITE_BUSYerrors; each worker must use its own database connection
Useruff check src/ && ruff format src/for linting and formatting
Files:
src/brainlayer/vector_store.pysrc/brainlayer/pipeline/tag_entity_promotion.py
src/brainlayer/vector_store.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use sqlite-vec storage via APSW in
vector_store.pyfor vector operations
Files:
src/brainlayer/vector_store.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pytestfor testing
Files:
tests/test_tag_entity_promotion.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
Applied to files:
src/brainlayer/pipeline/tag_entity_promotion.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle management with columns `superseded_by`, `aggregated_into`, `archived_at`; default search excludes lifecycle-managed chunks
Applied to files:
src/brainlayer/pipeline/tag_entity_promotion.py
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.
Applied to files:
src/brainlayer/pipeline/tag_entity_promotion.py
🔇 Additional comments (2)
src/brainlayer/vector_store.py (1)
630-635: Type hierarchy seed additions look consistent with promotion tests.The new
entity_type_hierarchyseed rows are coherent and match the expectedchild_type -> parent_typeassertions used by the new regression test coverage.tests/test_tag_entity_promotion.py (1)
92-108: Hierarchy seed regression assertion is solid.This test correctly locks in the new taxonomy edges and protects future seed regressions.
| parser = argparse.ArgumentParser(description="Promote high-frequency chunk tags into KG entities") | ||
| parser.add_argument("--min-count", type=int, default=500, help="Minimum tagged chunk count to promote") | ||
| parser.add_argument("--limit", type=int, default=None, help="Optional candidate limit") | ||
| parser.add_argument("--dry-run", action="store_true", help="Show candidates without writing") | ||
| args = parser.parse_args() | ||
|
|
||
| store = None | ||
| try: | ||
| store = VectorStore(get_db_path()) | ||
| stats = promote_tag_entities( |
There was a problem hiding this comment.
Add explicit DB-target and write confirmation for non-dry runs.
Line [27] currently opens the canonical DB implicitly. A mistaken shell/env can write to the wrong database with no guard.
Proposed hardening
def main() -> int:
parser = argparse.ArgumentParser(description="Promote high-frequency chunk tags into KG entities")
+ parser.add_argument("--db-path", type=Path, default=None, help="Override DB path (defaults to get_db_path())")
parser.add_argument("--min-count", type=int, default=500, help="Minimum tagged chunk count to promote")
parser.add_argument("--limit", type=int, default=None, help="Optional candidate limit")
parser.add_argument("--dry-run", action="store_true", help="Show candidates without writing")
+ parser.add_argument("--yes", action="store_true", help="Confirm writes when not using --dry-run")
args = parser.parse_args()
+ if not args.dry_run and not args.yes:
+ parser.error("Refusing to write without --yes (or use --dry-run).")
+
+ db_path = args.db_path or get_db_path()
store = None
try:
- store = VectorStore(get_db_path())
+ store = VectorStore(db_path)
stats = promote_tag_entities(
store,
min_count=args.min_count,
limit=args.limit,
dry_run=args.dry_run,
)Also applies to: 34-35
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/promote_tag_entities.py` around lines 19 - 28, The script opens the
canonical DB implicitly which can accidentally write to the wrong database; add
an explicit CLI option (e.g., --db-path) to argparse and use that value when
constructing VectorStore(get_db_path()) (or replace get_db_path() with the
provided arg), and require explicit confirmation for non-dry runs (either a
--confirm flag or an interactive prompt) before calling promote_tag_entities and
performing writes; ensure the code path that performs writes checks args.dry_run
and the confirmation flag/response before proceeding so accidental writes are
prevented.
| query = f""" | ||
| SELECT ct.tag, COUNT(*) as cnt | ||
| FROM chunk_tags ct | ||
| LEFT JOIN kg_entities e ON lower(e.name) = lower(ct.tag) | ||
| WHERE e.id IS NULL | ||
| AND ct.tag IS NOT NULL | ||
| AND ct.tag != '' | ||
| AND ct.tag NOT LIKE 'act:%' | ||
| AND ct.tag NOT LIKE 'dom:%' | ||
| AND ct.tag NOT LIKE 'meta/%' | ||
| AND lower(ct.tag) NOT IN ({placeholders}) | ||
| GROUP BY ct.tag | ||
| HAVING COUNT(*) >= ? | ||
| ORDER BY cnt DESC, ct.tag ASC | ||
| """ |
There was a problem hiding this comment.
Normalize tag keys in SQL to prevent split counts and missed links.
Line [133]-Line [146] aggregates by raw ct.tag, and Line [209]/Line [219] links by exact tag equality. Mixed-case/whitespace variants can be undercounted or unlinked.
Proposed normalization fix
- query = f"""
- SELECT ct.tag, COUNT(*) as cnt
+ query = f"""
+ SELECT lower(trim(ct.tag)) AS normalized_tag, COUNT(*) as cnt
FROM chunk_tags ct
- LEFT JOIN kg_entities e ON lower(e.name) = lower(ct.tag)
+ LEFT JOIN kg_entities e ON lower(e.name) = lower(trim(ct.tag))
WHERE e.id IS NULL
AND ct.tag IS NOT NULL
AND ct.tag != ''
AND ct.tag NOT LIKE 'act:%'
AND ct.tag NOT LIKE 'dom:%'
AND ct.tag NOT LIKE 'meta/%'
- AND lower(ct.tag) NOT IN ({placeholders})
- GROUP BY ct.tag
+ AND lower(trim(ct.tag)) NOT IN ({placeholders})
+ GROUP BY normalized_tag
HAVING COUNT(*) >= ?
- ORDER BY cnt DESC, ct.tag ASC
+ ORDER BY cnt DESC, normalized_tag ASC
"""
@@
- "tag": row[0],
+ "tag": row[0],
"count": row[1],
"entity_type": classify_tag_entity_type(row[0]),
@@
- WHERE ct.tag = ?
+ WHERE lower(trim(ct.tag)) = ?
""",
- (entity_id, tag),
+ (entity_id, tag.lower().strip()),
)
@@
- WHERE ct.tag = ?
+ WHERE lower(trim(ct.tag)) = ?
""",
- (entity_id, tag),
+ (entity_id, tag.lower().strip()),
)Also applies to: 153-161, 203-222
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/pipeline/tag_entity_promotion.py` around lines 132 - 146, The
SQL uses raw ct.tag which leads to split counts and missed joins for
casing/whitespace variants; update all queries (the SELECT/GROUP BY/ORDER BY
block that defines query and any later link queries) to normalize keys using
lower(trim(ct.tag)) (alias it e.g. norm_tag) everywhere you reference ct.tag,
use the same normalized expression in the JOIN condition against lower(e.name),
in NOT IN placeholders convert the excluded tags to lower(trim(...)), and GROUP
BY the normalized expression so counts and subsequent linking (the code that
inserts/promotes using ct.tag) operate on the normalized tag values
consistently.
| cursor = store.conn.cursor() | ||
| kg_entity_chunk_cols = {row[1] for row in cursor.execute("PRAGMA table_info(kg_entity_chunks)")} | ||
| has_mention_type = "mention_type" in kg_entity_chunk_cols | ||
|
|
||
| for candidate in candidates: | ||
| tag = candidate["tag"] | ||
| entity_type = candidate["entity_type"] | ||
| entity_id = f"auto-tag-{_slugify_tag(tag)}" | ||
| existing = store.get_entity_by_name(entity_type, tag) | ||
| if existing is None: | ||
| store.upsert_entity( | ||
| entity_id, | ||
| entity_type, | ||
| tag, | ||
| metadata={"source": "tag-promotion", "tag_count": candidate["count"]}, | ||
| confidence=0.8, | ||
| importance=0.6, | ||
| ) | ||
| stats["entities_created"] += 1 | ||
| else: | ||
| entity_id = existing["id"] | ||
|
|
||
| if has_mention_type: | ||
| cursor.execute( | ||
| """ | ||
| INSERT OR IGNORE INTO kg_entity_chunks (entity_id, chunk_id, relevance, context, mention_type) | ||
| SELECT ?, ct.chunk_id, 0.8, 'tag-promotion', 'tag' | ||
| FROM chunk_tags ct | ||
| WHERE ct.tag = ? | ||
| """, | ||
| (entity_id, tag), | ||
| ) | ||
| else: | ||
| cursor.execute( | ||
| """ | ||
| INSERT OR IGNORE INTO kg_entity_chunks (entity_id, chunk_id, relevance, context) | ||
| SELECT ?, ct.chunk_id, 0.8, 'tag-promotion' | ||
| FROM chunk_tags ct | ||
| WHERE ct.tag = ? | ||
| """, | ||
| (entity_id, tag), | ||
| ) | ||
| stats["links_created"] += store.conn.changes() |
There was a problem hiding this comment.
Protect promotion writes with exclusive-write orchestration and BUSY retries.
This function performs bulk writes (kg_entities, kg_entity_chunks) without explicit one-writer coordination. Under concurrent enrichment/MCP writes, this can fail partially or contend heavily.
As per coding guidelines: "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work" and "Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/pipeline/tag_entity_promotion.py` around lines 181 - 223, The
loop that upserts entities and inserts into kg_entity_chunks (uses cursor,
store.conn, upsert_entity, kg_entity_chunks, chunk_tags, stats) must be
protected with exclusive-write orchestration and SQLITE_BUSY retries: wrap the
entire promotion sequence for all candidates in a single exclusive write
transaction (e.g., acquire a DB-level exclusive lock via BEGIN EXCLUSIVE or an
application-level one-writer mutex) so only one promoter runs at a time, perform
a WAL checkpoint before taking the exclusive lock, and implement retry/backoff
on SQLITE_BUSY (or set busy_timeout) for transient contention; ensure you commit
or rollback the exclusive transaction and release the lock in a finally block so
partial writes don’t occur and stats remain consistent.
| def test_find_candidates_skips_existing_and_activity_tags(self, store): | ||
| from brainlayer.pipeline.tag_entity_promotion import find_promotion_candidates | ||
|
|
||
| _insert_chunk_with_tags(store, "chunk-1", ["telegram", "debugging", "existing-topic"]) | ||
| _insert_chunk_with_tags(store, "chunk-2", ["telegram", "debugging", "existing-topic"]) | ||
| store.upsert_entity("existing-topic", "topic", "existing-topic") | ||
|
|
||
| candidates = find_promotion_candidates(store, min_count=2) | ||
|
|
||
| assert [candidate["tag"] for candidate in candidates] == ["telegram"] | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a mixed-case tag regression test.
Current scenarios only use lowercase tags. Please add coverage for variants like ["Telegram", "telegram"] to ensure candidate counting and chunk linking are case-normalized end-to-end.
Also applies to: 60-91
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_tag_entity_promotion.py` around lines 47 - 57, Update the
test_find_candidates_skips_existing_and_activity_tags test (and similar tests
between lines 60-91) to include mixed-case tag variants (e.g., insert chunks
with ["Telegram", "telegram", "existing-topic"]) using the existing helper
_insert_chunk_with_tags so tags are created with differing cases, then call
find_promotion_candidates(store, min_count=2) and assert that the result
normalizes case by returning a single candidate "telegram" (lowercased) and that
existing entities inserted via store.upsert_entity("existing-topic", "topic",
"existing-topic") are still skipped; this ensures find_promotion_candidates and
the chunk/link logic are case-normalized end-to-end.
Summary
chunk_tagsinto KG entities with conservative type heuristicstopic,protocol,community,health_metric,workflow, anddeviceTest plan
python3 -m pytest tests/test_tag_entity_promotion.py -vpython3 -m pytest tests/test_tag_entity_promotion.py tests/test_kg_standard.py -qpython3 -m pytestfull suite not captured to completion in this turnNote
Promote high-frequency concept tags into knowledge graph entities
chunk_tagsfor high-frequency tags, filters out tags already inkg_entitiesand activity-like tags, then creates new KG entities with deterministic IDs and links them to their source chunks.person,technology,community, ortopic.--min-count,--limit, and--dry-runflags that prints promotion statistics as JSON.VectorStore._init_dbto seed new entity type hierarchy entries:topic,protocol,community,health_metric,workflow, anddevice.entity_type_hierarchyon every startup (viaINSERT OR IGNORE).Macroscope summarized f2d1a35.
Summary by CodeRabbit
New Features
Tests