feat: KG extraction pipeline — wires entities to KG tables by EtanHey · Pull Request #47 · EtanHey/brainlayer

EtanHey · 2026-02-27T09:30:59Z

Summary

New pipeline/kg_extraction.py module wiring entity extraction → KG standard tables
process_extraction_result: converts ExtractedEntity/Relation → kg_entities + kg_relations + kg_entity_chunks
extract_kg_from_chunk: full extraction flow (seed/LLM/GLiNER → KG)
Hooked into _enrich_one as non-critical post-enrichment step (seed-only for now)
Confidence propagation: seed entities get high confidence, LLM entities inherit extraction confidence
mention_type tagging (explicit/inferred) and source_chunk_id provenance

Test plan

11 new tests in test_kg_extraction.py
All 92 KG tests pass (schema + standard + extraction)
Lint clean (ruff check + format)

🤖 Generated with Claude Code

Note

Medium Risk
Adds new post-enrichment writes into KG tables (entities, relations, chunk links) and runs automatically for each enriched chunk, which could impact runtime and KG data quality despite being best-effort and wrapped in a non-fatal try/except.

Overview
Adds a new KG extraction step that runs after chunk enrichment: extracted entities are resolved/upserted into kg_entities, linked to their source chunks via kg_entity_chunks with mention_type (explicit vs inferred), and extracted relations are persisted to kg_relations with source_chunk_id provenance.

Wires this into _enrich_one as a non-critical post-enrichment hook (currently seed-only with LLM/GLiNER disabled) and adds a comprehensive test suite covering entity creation/dedup, relation creation, chunk linking, and confidence propagation.

^{Written by Cursor Bugbot for commit ad4d328. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

New Features
- Added knowledge graph extraction capability that automatically processes enriched content to identify and link entities and relationships with confidence tracking and deduplication support.
Tests
- Comprehensive test coverage added for knowledge graph extraction functionality, including validation of entity creation, relationship linking, and chunk-specific source attribution.

New module pipeline/kg_extraction.py: - process_extraction_result: entities + relations → KG tables - extract_kg_from_chunk: full extraction flow for a single chunk - Confidence propagation from extraction sources to KG entities - mention_type tagging (explicit/inferred) - source_chunk_id provenance on relations Wired into enrichment pipeline (_enrich_one) as non-critical post-step. 11 new tests covering entity creation, relation creation, dedup, confidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-27T09:31:12Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86f097a and ad4d328.

📒 Files selected for processing (3)

src/brainlayer/pipeline/enrichment.py
src/brainlayer/pipeline/kg_extraction.py
tests/test_kg_extraction.py

📝 Walkthrough

Walkthrough

This pull request introduces KG extraction capabilities to the enrichment pipeline. A new kg_extraction module processes chunks to extract entities and relations, resolving them against a vector store and creating knowledge graph connections. The enrichment process now performs best-effort KG extraction after enriching each chunk.

Changes

Cohort / File(s)	Summary
Enrichment Integration `src/brainlayer/pipeline/enrichment.py`	Adds non-critical, exception-handled KG extraction call in `_enrich_one` after chunk enrichment. Logs failures at debug level without interrupting enrichment.
KG Extraction Pipeline `src/brainlayer/pipeline/kg_extraction.py`	New module implementing end-to-end KG extraction: `extract_kg_from_chunk` orchestrates chunk reading and extraction via combined entity extraction (seed/GliNER/LLM); `process_extraction_result` handles entity resolution, creation, confidence updates, and relation storage with chunk linking; `_mention_type_from_source` maps extraction sources to mention types.
KG Extraction Tests `tests/test_kg_extraction.py`	Comprehensive test suite covering entity/relation creation, deduplication via resolution, confidence propagation, chunk linking, empty extraction handling, and full extraction flows with mocked LLM/GliNER paths.

Sequence Diagram

sequenceDiagram
    participant Enrich as Enrichment Pipeline
    participant KGE as KG Extraction Module
    participant Extract as Entity Extraction
    participant VStore as VectorStore
    participant Resolve as Entity Resolution

    Enrich->>Enrich: Enrich chunk content
    Enrich->>KGE: extract_kg_from_chunk(chunk_id)
    KGE->>VStore: Read chunk content
    KGE->>Extract: extract_entities_combined(content, seed/llm/gliner)
    Extract-->>KGE: ExtractionResult (entities + relations)
    KGE->>KGE: process_extraction_result(result)
    loop For each extracted entity
        KGE->>Resolve: resolve_entity(entity_text)
        Resolve-->>KGE: entity_id or create new
        KGE->>VStore: Update entity confidence
        KGE->>VStore: Link entity to chunk
    end
    loop For each extracted relation
        KGE->>VStore: Resolve source/target entities
        KGE->>VStore: Store relation with metadata
    end
    KGE-->>Enrich: stats (entities/relations created)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: Phase 1 — KG schema + entity table + youtube source fix #29: Introduces the VectorStore KG APIs (entity upsert, relations, entity-chunk linking, resolution) that the KG extraction pipeline directly depends on for all storage and linking operations.
feat: MLX auto-detection, stall detection, current_context fix #14: Modifies _enrich_one in the enrichment pipeline alongside this PR's integration point.
feat: Phase 2 — Entity extraction pipeline with bilingual NER #31: Provides the extraction and entity resolution logic (extract_entities_combined, resolve_entity) consumed by the new KG extraction module.

Poem

🐰 Knowledge graphs grow with every chunk,
Entities dance and relations link,
Seeds and LLMs extract with trust,
Confidence flows through the semantic dust,
Building webs of wisdom, glistening-bright! ✨

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/kg-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-27T09:37:19Z

+                new_confidence = max(existing.get("confidence", 0) or 0, ext_entity.confidence)
+            else:
+                # Newly created — set confidence from extraction source
+                new_confidence = ext_entity.confidence


Pre-existing entity check narrower than resolution logic

Medium Severity

The pre_existing check via store.get_entity_by_name() performs a case-sensitive SQL lookup (WHERE name = ?), but resolve_entity() uses case-insensitive matching (FTS5 + _normalize_name()) and alias resolution. When an entity exists as "Etan" and an extraction finds "etan" (different case), pre_existing is None while resolve_entity returns the existing entity's ID. This causes the code to take the "newly created" branch and set new_confidence = ext_entity.confidence, potentially overwriting a higher existing confidence with a lower one (e.g., 0.95 → 0.6).

cursor · 2026-02-27T09:37:19Z

+                seed_entities={},  # TODO: load seed entities from config
+                use_llm=False,     # Seed-only for now (LLM extraction is expensive)
+                use_gliner=False,
+            )


No-op extraction runs unnecessary DB query per chunk

Low Severity

The extract_kg_from_chunk call passes seed_entities={} with both use_llm=False and use_gliner=False, meaning no extraction strategy has anything to match. Yet each invocation still executes SELECT content FROM chunks WHERE id = ? before discovering there's nothing to extract. This unnecessary database read runs for every successfully enriched chunk.

cursor · 2026-02-27T09:37:19Z

+                ext_entity.entity_type,
+                existing["name"],
+                confidence=new_confidence,
+            )


Confidence update inadvertently resets importance and metadata

Medium Severity

The upsert_entity call intends to update only confidence, but because upsert_entity's ON CONFLICT clause unconditionally overwrites metadata, description, and importance, calling it without those arguments resets importance to the default 0.5, clears metadata to {}, and sets description to NULL. Any previously stored values for those fields on an existing entity are silently lost every time process_extraction_result processes it.

EtanHey merged commit 488ab6e into main Feb 27, 2026
1 of 5 checks passed

cursor Bot reviewed Feb 27, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Mar 3, 2026

feat: KG entity quality — validation, prompts, cleanup #69

Merged

7 tasks

This was referenced Apr 2, 2026

feat: co-occurrence relation extraction for KG edges #178

Merged

feat: LLM-powered entity extraction with gleaning (R68 Round 1) #188

Merged

feat: promote concept tags into KG entities #199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: KG extraction pipeline — wires entities to KG tables#47

feat: KG extraction pipeline — wires entities to KG tables#47
EtanHey merged 1 commit intomainfrom
feat/kg-extraction

EtanHey commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 27, 2026

Uh oh!

cursor Bot Feb 27, 2026

Uh oh!

cursor Bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Feb 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 27, 2026

Choose a reason for hiding this comment

Pre-existing entity check narrower than resolution logic

Uh oh!

cursor Bot Feb 27, 2026

Choose a reason for hiding this comment

No-op extraction runs unnecessary DB query per chunk

Uh oh!

cursor Bot Feb 27, 2026

Choose a reason for hiding this comment

Confidence update inadvertently resets importance and metadata

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading