feat: KG rebuild pipeline with audit fixes by EtanHey · Pull Request #67 · EtanHey/brainlayer

EtanHey · 2026-03-02T15:39:35Z

Summary

Rebuild the knowledge graph from scratch — the KG had only 46 entities from 194K enriched chunks because the extraction hook had use_llm=False, use_gliner=False, seed_entities={} (triple disabled)
Add two-tier extraction: Tier 1 (seed matching + tag parsing, zero API calls) and Tier 2 (Groq NER with multi-chunk batching)
Fix thread-safety issue causing "Connection is busy in another thread" errors during parallel MCP brain_search calls
Fix 4 audit findings: mention_type downgrade, stale read_conn reference, non-string tag crash, dot normalization

Changes

Core fixes

enrichment.py: Enable DEFAULT_SEED_ENTITIES + tag-based extraction in the enrichment hook
kg_repo.py: Preserve explicit mention_type — don't let inferred overwrite it on conflict
vector_store.py: Per-thread read connections via threading.local(), clear reference on close()
entity_extraction.py: Add extract_entities_from_tags(), skip non-string tags, normalize dots

New modules

kg_extraction_groq.py: Groq-backed multi-chunk NER (batching, 429 retry backoff, rate limiter)
scripts/kg_rebuild.py: Batch rebuild script (tier1 local, tier2 Groq)
scripts/kg_dedup.py: Entity dedup, alias management, false positive cleanup

Results

Entities: 46 → 119
Entity-chunk links: 597 → 153,773 (257x increase)
18 new aliases added
18 false positives removed

Test plan

18 new tests in test_kg_rebuild.py covering tag extraction, Groq NER parsing, enrichment hook, mention_type preservation, close() cleanup
680 tests passing, 0 failures
Ruff lint clean
Run Tier 2 Groq NER when rate limit resets for 200+ additional relations

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- KG deduplication and rebuild pipelines (local seed/tag extraction + multi-chunk NER)
- Tag-based entity extraction and known tag sets
- KG overlay in brain graph (entity nodes, edges, positions)
Bug Fixes
- Preserve explicit mention types during entity linking
Documentation
- Updated storage path and DB consolidation notes; added safe bulk DB operation procedures
Tests
- Comprehensive tests for KG rebuild, tag extraction, NER parsing, enrichment hooks
Performance
- Improved concurrent DB reads via per-thread connections

…ety, audit fixes Rebuild the knowledge graph from scratch. The KG had 46 entities from 194K enriched chunks because the extraction hook had use_llm=False, use_gliner=False, and seed_entities={} — triple disabled. Changes: - Enable seed entity matching + tag-based extraction in enrichment hook - Add extract_entities_from_tags() for zero-API-call entity extraction - Add kg_extraction_groq.py for Groq-backed multi-chunk NER batching - Add scripts/kg_rebuild.py (tier1 seed+tag, tier2 Groq NER) - Add scripts/kg_dedup.py for entity dedup, alias, false positive cleanup - Fix thread-safety: per-thread read connections via threading.local() - Fix mention_type downgrade: explicit mentions preserved over inferred - Fix close() leaving stale read_conn reference after closing - Fix non-string tags crashing extract_entities_from_tags - Fix tag normalization missing dots (node.js → nodejs) - Update CLAUDE.md with accurate enrichment backend docs Results: 119 entities (was 46), 153,773 entity-chunk links (was 597). Tests: 18 new tests in test_kg_rebuild.py, 680 total passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-03-02T15:39:56Z

Note

Currently processing new changes in this PR. This may take a few minutes, please wait...

📥 Commits

Reviewing files that changed from the base of the PR and between 2efb942 and ef1aa2f.

📒 Files selected for processing (7)

scripts/kg_dedup.py
scripts/kg_rebuild.py
src/brainlayer/pipeline/brain_graph.py
src/brainlayer/pipeline/enrichment.py
src/brainlayer/pipeline/entity_extraction.py
src/brainlayer/pipeline/kg_extraction_groq.py
src/brainlayer/vector_store.py

 ____________________________________________
< 99 little bugs in the code... I filed 100. >
 --------------------------------------------
  \
   \   (\__/)
       (•ㅅ•)
       / 　 づ

_{✏️ Tip: You can disable in-progress messages and the fortune message in your review settings.}

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

📝 Walkthrough

Walkthrough

The PR introduces KG extraction with tag-based and seed-based approaches, implements two-tier batch rebuild via Groq NER, adds entity deduplication and cleanup scripts, enables thread-local read connections for concurrent database access, and updates documentation on enrichment backend configuration and bulk DB operation safety.

Changes

Cohort / File(s)	Summary
Documentation `CLAUDE.md`	Updated storage path references (legacy zikaron.db to brainlayer.db), revised enrichment backend configuration (MLX primary with Ollama fallback), added Brain graph API endpoints and MCP tools, introduced bulk DB operations safety procedures with WAL checkpointing and FTS trigger management.
KG Deduplication & Cleanup `scripts/kg_dedup.py`	New script implementing six-stage KG cleanup: merge specific duplicates, reclassify mistyped entities, cross-type merges, typo fixes, false positive removal, and alias creation; supports dry-run mode with `--execute` flag for applying changes.
KG Rebuild Pipeline `scripts/kg_rebuild.py`	New script for two-tier batch KG rebuild: Tier 1 extracts entities from enriched chunks via seed matching and tag parsing; Tier 2 uses Groq NER API on high-importance chunks with multi-chunk batching and rate limiting; includes progress persistence and statistics reporting.
Entity Extraction Enhancements `src/brainlayer/pipeline/entity_extraction.py`	Added tag-based enrichment via `KNOWN_TECH_TAGS` and `KNOWN_PROJECT_TAGS` constants; new `extract_entities_from_tags()` function normalizes tags and produces ExtractedEntity objects with fixed confidences (0.85 projects, 0.80 technologies).
Groq NER Integration `src/brainlayer/pipeline/kg_extraction_groq.py`	New module for Groq-backed KG extraction with multi-chunk batching; includes NER prompt builder, response parser, resilient API calling with rate limiting and retry logic (`RateLimiter` class and `call_groq_ner()` function).
Enrichment Pipeline Integration `src/brainlayer/pipeline/enrichment.py`	Integrated seed-based KG extraction using `DEFAULT_SEED_ENTITIES` and tag-based entity extraction via `extract_entities_from_tags()`; adds new enrichment tag processing branch to populate KG from enrichment metadata.
KG Repository `src/brainlayer/kg_repo.py`	Modified `link_entity_chunk()` to preserve explicit mention types using CASE expression: explicit values are retained, others fall back to provided or existing values.
Vector Store Concurrency `src/brainlayer/vector_store.py`	Replaced shared read connection with thread-local per-thread SQLite read-only connections via `threading.local()`; added `_get_read_conn()` helper for lazy per-thread initialization and updated cleanup logic to handle per-thread connection lifecycle.
KG Rebuild Tests `tests/test_kg_rebuild.py`	Comprehensive test module covering tag extraction, Groq response parsing, multi-chunk prompt building, enrichment hook integration, thread-local connection cleanup, seed-based entity extraction, and mention type preservation.

Sequence Diagram(s)

sequenceDiagram
    participant Chunk as Enriched Chunk
    participant Pipeline as Enrichment Pipeline
    participant TagExt as Tag Extractor
    participant SeedExt as Seed Extractor
    participant KGStore as KG Store

    Chunk->>Pipeline: enrichment metadata + content
    
    alt Tag Processing
        Pipeline->>TagExt: tags from enrichment
        TagExt->>TagExt: normalize & match tags
        TagExt->>KGStore: ExtractionResult (0.85/0.80 confidence)
    end
    
    alt Seed-based Extraction
        Pipeline->>SeedExt: chunk content + DEFAULT_SEED_ENTITIES
        SeedExt->>SeedExt: extract entities from text
        SeedExt->>KGStore: ExtractionResult + relationships
    end
    
    KGStore->>KGStore: process_extraction_result
    KGStore->>KGStore: link entities to chunk

sequenceDiagram
    participant Store as VectorStore
    participant Tier1 as Tier 1<br/>Seed & Tags
    participant Tier2 as Tier 2<br/>Groq NER
    participant Groq as Groq API
    participant KGBackend as KG Backend

    Store->>Tier1: Tier 1: extract all chunks
    Tier1->>Tier1: batch seed/tag extraction
    Tier1->>KGBackend: process results
    
    Store->>Tier2: Tier 2: fetch high-importance chunks
    Tier2->>Tier2: batch chunks (chunks_per_call)
    Tier2->>Tier2: RateLimiter: throttle calls
    Tier2->>Groq: build_multi_chunk_ner_prompt
    Groq-->>Tier2: parse_multi_chunk_response
    Tier2->>KGBackend: entities + relations per chunk
    Tier2->>Tier2: save_progress (resume capable)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #47: Modifies the same enrichment pipeline to run KG extraction and wire extract_kg_from_chunk and process_extraction_result functions, directly overlapping with this PR's enrichment→KG integration.
PR #31: Implements overlapping entity-extraction and KG pipeline features including entity_extraction.py, vector_store.py KG alias APIs, and comprehensive entity-related tests.
PR #54: Touches core KG/search code paths (kg_repo.py, vector_store.py, pipeline/enrichment.py) with overlapping KG and pipeline refactoring and behavior changes.

Poem

🐰 A bunny hops through knowledge graphs so fine,
Seeds and tags in perfect design,
Groq calls batch-wise, dedupe cleans the way,
Thread-local reads keep the slow lag at bay, ✨
Knowledge rebuilt, enrichment's bright flame!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the two main objectives of the PR: implementing a knowledge graph rebuild pipeline and fixing a mention_type preservation bug.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/kg-rebuild-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix Tier 2 pagination bug: remove OFFSET (shrinking result set skips chunks) - Use setbusytimeout(30_000) for read connections (consistent with write conn) - Log KG extraction failures at WARNING level (was DEBUG, invisible in prod) - Use _read_cursor() for read-only queries in kg_dedup.py - Remove unused imports (uuid, os, time, parse_llm_ner_response, resolve_entity) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 68-69: Add a blank line immediately after the Markdown heading "##
Bulk DB Operations (SAFETY)" so the heading is followed by an empty line (to
satisfy MD022); update the CLAUDE.md file by inserting a single newline line
between that heading and the following list item "1. **Stop enrichment workers
first** — never run bulk ops while enrichment is writing (causes WAL bloat +
potential freeze)".

In `@scripts/kg_dedup.py`:
- Around line 124-128: The deletion block removes rows from kg_entity_chunks,
kg_relations, kg_entity_aliases and kg_entities but misses deleting
corresponding embeddings in kg_vec_entities, leaving orphans; update the
deletion sequence (the cursor.execute block that uses eid) to also execute
"DELETE FROM kg_vec_entities WHERE entity_id = ?" (using the same eid parameter)
before/alongside deleting from kg_entities, and ensure this runs inside the same
transaction so vector and KG tables remain consistent.

In `@scripts/kg_rebuild.py`:
- Around line 229-277: Wrap the processing of each chunk_result in a try/except
so one malformed chunk_result cannot abort the entire parsed_results loop:
inside the for chunk_result in parsed_results loop (the block that builds
entities, relations, creates ExtractionResult and calls
process_extraction_result), catch any Exception, log the error with context
(chunk_id and the chunk_result payload) and continue to the next chunk_result;
apply the same per-item try/except pattern to the analogous block that handles
results at the later location referenced (the code that appends to
stats["entities_found"] and stats["relations_found"]) so each chunk is isolated
from failures.
- Around line 64-67: load_progress currently assumes PROGRESS_FILE contains
valid JSON and save_progress writes directly to the final path, so implement
atomic persistence: modify save_progress to write JSON to a temporary file
(e.g., PROGRESS_FILE.with_suffix(".tmp") or similar) and then atomically replace
the final file (os.replace) to avoid truncated files; update load_progress to
catch JSONDecodeError (and FileNotFoundError) and fall back to the default dict
{"tier1_done": False, "tier2_last_offset": 0, "tier2_processed": 0} so a
partial/corrupted file won't break --resume; apply these changes to the
functions named save_progress and load_progress and use the PROGRESS_FILE symbol
to locate the file handling.
- Around line 185-196: The query uses OFFSET pagination which will skip rows as
links are created because the WHERE clause filters out linked chunks; replace
OFFSET-based paging with stable keyset pagination: change the SQL in the query
variable to accept a last-seen cursor (e.g., last_importance and last_id) and
add a WHERE clause like "AND (c.importance < :last_importance OR (c.importance =
:last_importance AND c.id > :last_id))" (matching the existing ORDER BY
c.importance DESC, c.id) and remove OFFSET; update the loop that calls this
query to pass and update the cursor values (track the last row's importance and
id after each page) and repeat until fewer than LIMIT rows are returned; apply
the same change pattern to the other queries noted around the chunks of code at
the locations you mentioned (the queries used at lines 206-209 and 284-289) so
all pagination switches from LIMIT/OFFSET to keyset using last_importance and
last_id.
- Around line 326-356: Wrap the orchestration from after store creation in a
try/finally so store.close() always runs: create db_path and store with
get_db_path() and VectorStore(...) as before, then put the logic that checks
args.stats, args.tier1/2, calls print_kg_stats, tier1_seed_and_tags,
tier2_groq_ner, and logger.info("Done!") inside a try block and move the single
store.close() into the finally block; remove the multiple early store.close()
calls so the finally always handles cleanup even if an exception occurs.

In `@src/brainlayer/pipeline/entity_extraction.py`:
- Around line 450-481: The loop over tags in entity_extraction.py currently
appends an ExtractedEntity for every matching tag, causing duplicates; fix it by
deduplicating on the normalized tag key (the same key used to lookup
norm_projects/norm_tech) before emitting entities—introduce a seen_norm_tags
set, compute tag_norm the same way (tag.lower().replace("-", "").replace("_",
"").replace(".", "")), skip if in seen_norm_tags, otherwise add to
seen_norm_tags and then append the ExtractedEntity (preserving the existing
project/technology priority and fields such as text, entity_type, start, end,
confidence, source).

In `@src/brainlayer/pipeline/kg_extraction_groq.py`:
- Around line 70-80: The loop over parsed.get("chunks", []) assumes each
chunk_data is a dict and will raise if an element is malformed; update the loop
that builds results (the block using chunk_data, chunk_id, entities, relations,
and results.append) to first verify chunk_data is a dict (e.g., isinstance
check), and if not skip it or coerce to an empty dict, then safely call .get for
"chunk_id", "entities", and "relations" with your existing default values so
malformed entries do not abort processing.

In `@src/brainlayer/vector_store.py`:
- Around line 503-519: The close() path currently only closes the current
thread's self._local.read_conn and misses per-thread read connections created by
_get_read_conn; to fix, add a thread-safe registry (e.g., self._read_conns) that
_get_read_conn appends each new apsw.Connection to (use weakref.WeakSet or store
weakrefs and protect with a lock) and then update close() to iterate this
registry and close every connection found, clearing the registry and removing
any references; ensure you still set/clear self._local.read_conn as before and
guard registry mutations with a threading.Lock to avoid races.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76b53a3 and 2efb942.

📒 Files selected for processing (9)

CLAUDE.md
scripts/kg_dedup.py
scripts/kg_rebuild.py
src/brainlayer/kg_repo.py
src/brainlayer/pipeline/enrichment.py
src/brainlayer/pipeline/entity_extraction.py
src/brainlayer/pipeline/kg_extraction_groq.py
src/brainlayer/vector_store.py
tests/test_kg_rebuild.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: test (3.11)
GitHub Check: test (3.13)
GitHub Check: test (3.12)

🧰 Additional context used

📓 Path-based instructions (5)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests with pytest

Files:

tests/test_kg_rebuild.py

src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Typer CLI framework for command-line interface implementation in src/brainlayer/
Handle concurrency by retrying on SQLITE_BUSY errors; each worker should use its own database connection
Export brain graph as JSON for Next.js dashboard using brainlayer brain-export command
Export Obsidian vault with Markdown files including backlinks and tags using brainlayer export-obsidian command

Files:

src/brainlayer/pipeline/kg_extraction_groq.py
src/brainlayer/kg_repo.py
src/brainlayer/pipeline/enrichment.py
src/brainlayer/pipeline/entity_extraction.py
src/brainlayer/vector_store.py

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

src/brainlayer/pipeline/kg_extraction_groq.py
src/brainlayer/kg_repo.py
src/brainlayer/pipeline/enrichment.py
src/brainlayer/pipeline/entity_extraction.py
src/brainlayer/vector_store.py

src/brainlayer/**/*enrich*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*enrich*.py: Use either Ollama (glm4) or MLX as enrichment backends, configurable via BRAINLAYER_ENRICH_BACKEND environment variable
Set "think": false for GLM-4.7 enrichment backend for speed optimization
Enrichment should add metadata: summary, tags, importance, and intent; capture decisions and corrections in session enrichment
Cache prompts in ~/.local/share/brainlayer/prompts/
Use enrichment lock file at /tmp/brainlayer-enrichment.lock to prevent concurrent enrichment

Files:

src/brainlayer/pipeline/enrichment.py

src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/vector_store.py: Use APSW with sqlite-vec for vector storage implementation in vector_store.py
Store database at ~/.local/share/brainlayer/brainlayer.db with WAL mode and busy_timeout=5000

Files:

src/brainlayer/vector_store.py

🧠 Learnings (12)