Skip to content

feat: promote concept tags into KG entities#199

Merged
EtanHey merged 2 commits intomainfrom
feat/concept-entity-extraction
Apr 4, 2026
Merged

feat: promote concept tags into KG entities#199
EtanHey merged 2 commits intomainfrom
feat/concept-entity-extraction

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 4, 2026

Summary

  • add a reusable tag-promotion pipeline that promotes high-frequency non-activity tags from chunk_tags into KG entities with conservative type heuristics
  • seed the new entity hierarchy types needed for concept-level extraction: topic, protocol, community, health_metric, workflow, and device
  • add a small CLI script for running tag promotion and regression tests for candidate filtering, type classification, entity creation, and chunk linking

Test plan

  • python3 -m pytest tests/test_tag_entity_promotion.py -v
  • python3 -m pytest tests/test_tag_entity_promotion.py tests/test_kg_standard.py -q
  • python3 -m pytest full suite not captured to completion in this turn

Note

Promote high-frequency concept tags into knowledge graph entities

  • Adds tag_entity_promotion.py with a pipeline that queries chunk_tags for high-frequency tags, filters out tags already in kg_entities and activity-like tags, then creates new KG entities with deterministic IDs and links them to their source chunks.
  • Tag entity type is inferred via heuristics mapping tags to person, technology, community, or topic.
  • Adds a CLI script promote_tag_entities.py with --min-count, --limit, and --dry-run flags that prints promotion statistics as JSON.
  • Extends VectorStore._init_db to seed new entity type hierarchy entries: topic, protocol, community, health_metric, workflow, and device.
  • Behavioral Change: DB initialization now inserts additional rows into entity_type_hierarchy on every startup (via INSERT OR IGNORE).

Macroscope summarized f2d1a35.

Summary by CodeRabbit

  • New Features

    • Added command-line tool to automatically promote frequently-used tags into knowledge-graph entities with configurable thresholds
    • Extended entity type hierarchy with new types: topic, protocol, health metric, workflow, community, and device
  • Tests

    • Added comprehensive test coverage for tag-to-entity promotion pipeline

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@coderabbitai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@cursor @BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 4, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

This change introduces a tag-to-entity promotion pipeline that discovers high-frequency concept tags not yet in the knowledge graph and promotes them into entities. It includes a CLI script, core pipeline logic with tag classification, candidate discovery, and entity creation, updated database schema seeds, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
CLI Entry Point
scripts/promote_tag_entities.py
New executable script providing command-line interface for tag promotion with --min-count, --limit, and --dry-run arguments. Initializes VectorStore, invokes promotion pipeline, and outputs JSON statistics.
Promotion Pipeline
src/brainlayer/pipeline/tag_entity_promotion.py
Core promotion logic: tag classification (person, technology, community, topic), candidate discovery from chunk_tags excluding already-promoted tags, and entity creation with deterministic IDs (auto-tag-<slug>). Includes conditional mention_type column handling based on schema inspection.
Database Schema
src/brainlayer/vector_store.py
Added entity type hierarchy seed data extending taxonomy with topic, protocol, health_metric, workflow, community, and device types.
Test Coverage
tests/test_tag_entity_promotion.py
Comprehensive tests validating tag classification mappings, candidate filtering logic, entity creation with correct IDs and types, and kg_entity_chunks link insertion with proper mention_type values.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as promote_tag_entities.py
    participant Store as VectorStore
    participant DB as SQLite DB
    participant Pipeline as tag_entity_promotion

    CLI->>Store: Initialize VectorStore
    Store->>DB: Connect & seed entity_type_hierarchy
    Store-->>CLI: Ready

    CLI->>Pipeline: promote_tag_entities(store, min_count, limit, dry_run)
    
    Pipeline->>Pipeline: find_promotion_candidates(store, min_count, limit)
    Pipeline->>DB: Query chunk_tags for unlinked tags
    DB-->>Pipeline: Tag frequency results
    Pipeline->>Pipeline: Filter & classify each tag
    Pipeline-->>Pipeline: Return candidates with entity_type

    alt not dry_run
        Pipeline->>DB: Check schema for mention_type column
        DB-->>Pipeline: Schema info
        
        loop For each candidate
            Pipeline->>Store: upsert_entity(auto-tag-<slug>)
            Store->>DB: Insert/get kg_entities entry
            DB-->>Store: Entity ID
            Store-->>Pipeline: Entity confirmed
            
            Pipeline->>DB: INSERT kg_entity_chunks with mention_type
            DB-->>Pipeline: Links created count
            Pipeline->>Pipeline: Update stats
        end
    end

    Pipeline-->>CLI: Return promotion stats (JSON)
    CLI->>Store: Close VectorStore
    CLI->>DB: Disconnect
    CLI-->>User: Print stats, exit 0
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hops through the tags with glee,
Where concepts bloom into entities!
High-frequency friends get their throne,
In graphs they call home, fully grown. 🌱✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: promoting concept tags into KG entities, which is the core functionality added across the new modules and CLI script.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/concept-entity-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Comment thread src/brainlayer/pipeline/tag_entity_promotion.py Outdated
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/promote_tag_entities.py`:
- Around line 19-28: The script opens the canonical DB implicitly which can
accidentally write to the wrong database; add an explicit CLI option (e.g.,
--db-path) to argparse and use that value when constructing
VectorStore(get_db_path()) (or replace get_db_path() with the provided arg), and
require explicit confirmation for non-dry runs (either a --confirm flag or an
interactive prompt) before calling promote_tag_entities and performing writes;
ensure the code path that performs writes checks args.dry_run and the
confirmation flag/response before proceeding so accidental writes are prevented.

In `@src/brainlayer/pipeline/tag_entity_promotion.py`:
- Around line 132-146: The SQL uses raw ct.tag which leads to split counts and
missed joins for casing/whitespace variants; update all queries (the
SELECT/GROUP BY/ORDER BY block that defines query and any later link queries) to
normalize keys using lower(trim(ct.tag)) (alias it e.g. norm_tag) everywhere you
reference ct.tag, use the same normalized expression in the JOIN condition
against lower(e.name), in NOT IN placeholders convert the excluded tags to
lower(trim(...)), and GROUP BY the normalized expression so counts and
subsequent linking (the code that inserts/promotes using ct.tag) operate on the
normalized tag values consistently.
- Around line 181-223: The loop that upserts entities and inserts into
kg_entity_chunks (uses cursor, store.conn, upsert_entity, kg_entity_chunks,
chunk_tags, stats) must be protected with exclusive-write orchestration and
SQLITE_BUSY retries: wrap the entire promotion sequence for all candidates in a
single exclusive write transaction (e.g., acquire a DB-level exclusive lock via
BEGIN EXCLUSIVE or an application-level one-writer mutex) so only one promoter
runs at a time, perform a WAL checkpoint before taking the exclusive lock, and
implement retry/backoff on SQLITE_BUSY (or set busy_timeout) for transient
contention; ensure you commit or rollback the exclusive transaction and release
the lock in a finally block so partial writes don’t occur and stats remain
consistent.

In `@tests/test_tag_entity_promotion.py`:
- Around line 47-57: Update the
test_find_candidates_skips_existing_and_activity_tags test (and similar tests
between lines 60-91) to include mixed-case tag variants (e.g., insert chunks
with ["Telegram", "telegram", "existing-topic"]) using the existing helper
_insert_chunk_with_tags so tags are created with differing cases, then call
find_promotion_candidates(store, min_count=2) and assert that the result
normalizes case by returning a single candidate "telegram" (lowercased) and that
existing entities inserted via store.upsert_entity("existing-topic", "topic",
"existing-topic") are still skipped; this ensures find_promotion_candidates and
the chunk/link logic are case-normalized end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d22d7fd4-60c1-48e0-b866-3e437f31b05c

📥 Commits

Reviewing files that changed from the base of the PR and between 58ea11c and abdbbb2.

📒 Files selected for processing (4)
  • scripts/promote_tag_entities.py
  • src/brainlayer/pipeline/tag_entity_promotion.py
  • src/brainlayer/vector_store.py
  • tests/test_tag_entity_promotion.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Macroscope - Correctness Check
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/vector_store.py
  • scripts/promote_tag_entities.py
  • tests/test_tag_entity_promotion.py
  • src/brainlayer/pipeline/tag_entity_promotion.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/vector_store.py
  • src/brainlayer/pipeline/tag_entity_promotion.py
src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec storage via APSW in vector_store.py for vector operations

Files:

  • src/brainlayer/vector_store.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_tag_entity_promotion.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Build extraction, classification, chunking, embedding, and indexing pipeline with post-processing for enrichment, brain graph, and Obsidian export

Applied to files:

  • src/brainlayer/pipeline/tag_entity_promotion.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle management with columns `superseded_by`, `aggregated_into`, `archived_at`; default search excludes lifecycle-managed chunks

Applied to files:

  • src/brainlayer/pipeline/tag_entity_promotion.py
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

  • src/brainlayer/pipeline/tag_entity_promotion.py
🔇 Additional comments (2)
src/brainlayer/vector_store.py (1)

630-635: Type hierarchy seed additions look consistent with promotion tests.

The new entity_type_hierarchy seed rows are coherent and match the expected child_type -> parent_type assertions used by the new regression test coverage.

tests/test_tag_entity_promotion.py (1)

92-108: Hierarchy seed regression assertion is solid.

This test correctly locks in the new taxonomy edges and protects future seed regressions.

Comment on lines +19 to +28
parser = argparse.ArgumentParser(description="Promote high-frequency chunk tags into KG entities")
parser.add_argument("--min-count", type=int, default=500, help="Minimum tagged chunk count to promote")
parser.add_argument("--limit", type=int, default=None, help="Optional candidate limit")
parser.add_argument("--dry-run", action="store_true", help="Show candidates without writing")
args = parser.parse_args()

store = None
try:
store = VectorStore(get_db_path())
stats = promote_tag_entities(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add explicit DB-target and write confirmation for non-dry runs.

Line [27] currently opens the canonical DB implicitly. A mistaken shell/env can write to the wrong database with no guard.

Proposed hardening
 def main() -> int:
     parser = argparse.ArgumentParser(description="Promote high-frequency chunk tags into KG entities")
+    parser.add_argument("--db-path", type=Path, default=None, help="Override DB path (defaults to get_db_path())")
     parser.add_argument("--min-count", type=int, default=500, help="Minimum tagged chunk count to promote")
     parser.add_argument("--limit", type=int, default=None, help="Optional candidate limit")
     parser.add_argument("--dry-run", action="store_true", help="Show candidates without writing")
+    parser.add_argument("--yes", action="store_true", help="Confirm writes when not using --dry-run")
     args = parser.parse_args()

+    if not args.dry_run and not args.yes:
+        parser.error("Refusing to write without --yes (or use --dry-run).")
+
+    db_path = args.db_path or get_db_path()
     store = None
     try:
-        store = VectorStore(get_db_path())
+        store = VectorStore(db_path)
         stats = promote_tag_entities(
             store,
             min_count=args.min_count,
             limit=args.limit,
             dry_run=args.dry_run,
         )

Also applies to: 34-35

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/promote_tag_entities.py` around lines 19 - 28, The script opens the
canonical DB implicitly which can accidentally write to the wrong database; add
an explicit CLI option (e.g., --db-path) to argparse and use that value when
constructing VectorStore(get_db_path()) (or replace get_db_path() with the
provided arg), and require explicit confirmation for non-dry runs (either a
--confirm flag or an interactive prompt) before calling promote_tag_entities and
performing writes; ensure the code path that performs writes checks args.dry_run
and the confirmation flag/response before proceeding so accidental writes are
prevented.

Comment on lines +132 to +146
query = f"""
SELECT ct.tag, COUNT(*) as cnt
FROM chunk_tags ct
LEFT JOIN kg_entities e ON lower(e.name) = lower(ct.tag)
WHERE e.id IS NULL
AND ct.tag IS NOT NULL
AND ct.tag != ''
AND ct.tag NOT LIKE 'act:%'
AND ct.tag NOT LIKE 'dom:%'
AND ct.tag NOT LIKE 'meta/%'
AND lower(ct.tag) NOT IN ({placeholders})
GROUP BY ct.tag
HAVING COUNT(*) >= ?
ORDER BY cnt DESC, ct.tag ASC
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize tag keys in SQL to prevent split counts and missed links.

Line [133]-Line [146] aggregates by raw ct.tag, and Line [209]/Line [219] links by exact tag equality. Mixed-case/whitespace variants can be undercounted or unlinked.

Proposed normalization fix
-    query = f"""
-        SELECT ct.tag, COUNT(*) as cnt
+    query = f"""
+        SELECT lower(trim(ct.tag)) AS normalized_tag, COUNT(*) as cnt
         FROM chunk_tags ct
-        LEFT JOIN kg_entities e ON lower(e.name) = lower(ct.tag)
+        LEFT JOIN kg_entities e ON lower(e.name) = lower(trim(ct.tag))
         WHERE e.id IS NULL
           AND ct.tag IS NOT NULL
           AND ct.tag != ''
           AND ct.tag NOT LIKE 'act:%'
           AND ct.tag NOT LIKE 'dom:%'
           AND ct.tag NOT LIKE 'meta/%'
-          AND lower(ct.tag) NOT IN ({placeholders})
-        GROUP BY ct.tag
+          AND lower(trim(ct.tag)) NOT IN ({placeholders})
+        GROUP BY normalized_tag
         HAVING COUNT(*) >= ?
-        ORDER BY cnt DESC, ct.tag ASC
+        ORDER BY cnt DESC, normalized_tag ASC
     """
@@
-            "tag": row[0],
+            "tag": row[0],
             "count": row[1],
             "entity_type": classify_tag_entity_type(row[0]),
@@
-                WHERE ct.tag = ?
+                WHERE lower(trim(ct.tag)) = ?
                 """,
-                (entity_id, tag),
+                (entity_id, tag.lower().strip()),
             )
@@
-                WHERE ct.tag = ?
+                WHERE lower(trim(ct.tag)) = ?
                 """,
-                (entity_id, tag),
+                (entity_id, tag.lower().strip()),
             )

Also applies to: 153-161, 203-222

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/tag_entity_promotion.py` around lines 132 - 146, The
SQL uses raw ct.tag which leads to split counts and missed joins for
casing/whitespace variants; update all queries (the SELECT/GROUP BY/ORDER BY
block that defines query and any later link queries) to normalize keys using
lower(trim(ct.tag)) (alias it e.g. norm_tag) everywhere you reference ct.tag,
use the same normalized expression in the JOIN condition against lower(e.name),
in NOT IN placeholders convert the excluded tags to lower(trim(...)), and GROUP
BY the normalized expression so counts and subsequent linking (the code that
inserts/promotes using ct.tag) operate on the normalized tag values
consistently.

Comment on lines +181 to +223
cursor = store.conn.cursor()
kg_entity_chunk_cols = {row[1] for row in cursor.execute("PRAGMA table_info(kg_entity_chunks)")}
has_mention_type = "mention_type" in kg_entity_chunk_cols

for candidate in candidates:
tag = candidate["tag"]
entity_type = candidate["entity_type"]
entity_id = f"auto-tag-{_slugify_tag(tag)}"
existing = store.get_entity_by_name(entity_type, tag)
if existing is None:
store.upsert_entity(
entity_id,
entity_type,
tag,
metadata={"source": "tag-promotion", "tag_count": candidate["count"]},
confidence=0.8,
importance=0.6,
)
stats["entities_created"] += 1
else:
entity_id = existing["id"]

if has_mention_type:
cursor.execute(
"""
INSERT OR IGNORE INTO kg_entity_chunks (entity_id, chunk_id, relevance, context, mention_type)
SELECT ?, ct.chunk_id, 0.8, 'tag-promotion', 'tag'
FROM chunk_tags ct
WHERE ct.tag = ?
""",
(entity_id, tag),
)
else:
cursor.execute(
"""
INSERT OR IGNORE INTO kg_entity_chunks (entity_id, chunk_id, relevance, context)
SELECT ?, ct.chunk_id, 0.8, 'tag-promotion'
FROM chunk_tags ct
WHERE ct.tag = ?
""",
(entity_id, tag),
)
stats["links_created"] += store.conn.changes()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Protect promotion writes with exclusive-write orchestration and BUSY retries.

This function performs bulk writes (kg_entities, kg_entity_chunks) without explicit one-writer coordination. Under concurrent enrichment/MCP writes, this can fail partially or contend heavily.

As per coding guidelines: "Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work" and "Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/tag_entity_promotion.py` around lines 181 - 223, The
loop that upserts entities and inserts into kg_entity_chunks (uses cursor,
store.conn, upsert_entity, kg_entity_chunks, chunk_tags, stats) must be
protected with exclusive-write orchestration and SQLITE_BUSY retries: wrap the
entire promotion sequence for all candidates in a single exclusive write
transaction (e.g., acquire a DB-level exclusive lock via BEGIN EXCLUSIVE or an
application-level one-writer mutex) so only one promoter runs at a time, perform
a WAL checkpoint before taking the exclusive lock, and implement retry/backoff
on SQLITE_BUSY (or set busy_timeout) for transient contention; ensure you commit
or rollback the exclusive transaction and release the lock in a finally block so
partial writes don’t occur and stats remain consistent.

Comment on lines +47 to +57
def test_find_candidates_skips_existing_and_activity_tags(self, store):
from brainlayer.pipeline.tag_entity_promotion import find_promotion_candidates

_insert_chunk_with_tags(store, "chunk-1", ["telegram", "debugging", "existing-topic"])
_insert_chunk_with_tags(store, "chunk-2", ["telegram", "debugging", "existing-topic"])
store.upsert_entity("existing-topic", "topic", "existing-topic")

candidates = find_promotion_candidates(store, min_count=2)

assert [candidate["tag"] for candidate in candidates] == ["telegram"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a mixed-case tag regression test.

Current scenarios only use lowercase tags. Please add coverage for variants like ["Telegram", "telegram"] to ensure candidate counting and chunk linking are case-normalized end-to-end.

Also applies to: 60-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tag_entity_promotion.py` around lines 47 - 57, Update the
test_find_candidates_skips_existing_and_activity_tags test (and similar tests
between lines 60-91) to include mixed-case tag variants (e.g., insert chunks
with ["Telegram", "telegram", "existing-topic"]) using the existing helper
_insert_chunk_with_tags so tags are created with differing cases, then call
find_promotion_candidates(store, min_count=2) and assert that the result
normalizes case by returning a single candidate "telegram" (lowercased) and that
existing entities inserted via store.upsert_entity("existing-topic", "topic",
"existing-topic") are still skipped; this ensures find_promotion_candidates and
the chunk/link logic are case-normalized end-to-end.

@EtanHey EtanHey merged commit 4d7573d into main Apr 4, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant