Skip to content

feat: KG extraction pipeline — wires entities to KG tables#47

Merged
EtanHey merged 1 commit intomainfrom
feat/kg-extraction
Feb 27, 2026
Merged

feat: KG extraction pipeline — wires entities to KG tables#47
EtanHey merged 1 commit intomainfrom
feat/kg-extraction

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 27, 2026

Summary

  • New pipeline/kg_extraction.py module wiring entity extraction → KG standard tables
  • process_extraction_result: converts ExtractedEntity/Relation → kg_entities + kg_relations + kg_entity_chunks
  • extract_kg_from_chunk: full extraction flow (seed/LLM/GLiNER → KG)
  • Hooked into _enrich_one as non-critical post-enrichment step (seed-only for now)
  • Confidence propagation: seed entities get high confidence, LLM entities inherit extraction confidence
  • mention_type tagging (explicit/inferred) and source_chunk_id provenance

Test plan

  • 11 new tests in test_kg_extraction.py
  • All 92 KG tests pass (schema + standard + extraction)
  • Lint clean (ruff check + format)

🤖 Generated with Claude Code


Note

Medium Risk
Adds new post-enrichment writes into KG tables (entities, relations, chunk links) and runs automatically for each enriched chunk, which could impact runtime and KG data quality despite being best-effort and wrapped in a non-fatal try/except.

Overview
Adds a new KG extraction step that runs after chunk enrichment: extracted entities are resolved/upserted into kg_entities, linked to their source chunks via kg_entity_chunks with mention_type (explicit vs inferred), and extracted relations are persisted to kg_relations with source_chunk_id provenance.

Wires this into _enrich_one as a non-critical post-enrichment hook (currently seed-only with LLM/GLiNER disabled) and adds a comprehensive test suite covering entity creation/dedup, relation creation, chunk linking, and confidence propagation.

Written by Cursor Bugbot for commit ad4d328. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Added knowledge graph extraction capability that automatically processes enriched content to identify and link entities and relationships with confidence tracking and deduplication support.
  • Tests

    • Comprehensive test coverage added for knowledge graph extraction functionality, including validation of entity creation, relationship linking, and chunk-specific source attribution.

New module pipeline/kg_extraction.py:
- process_extraction_result: entities + relations → KG tables
- extract_kg_from_chunk: full extraction flow for a single chunk
- Confidence propagation from extraction sources to KG entities
- mention_type tagging (explicit/inferred)
- source_chunk_id provenance on relations

Wired into enrichment pipeline (_enrich_one) as non-critical post-step.
11 new tests covering entity creation, relation creation, dedup, confidence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit 488ab6e into main Feb 27, 2026
1 of 5 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86f097a and ad4d328.

📒 Files selected for processing (3)
  • src/brainlayer/pipeline/enrichment.py
  • src/brainlayer/pipeline/kg_extraction.py
  • tests/test_kg_extraction.py

📝 Walkthrough

Walkthrough

This pull request introduces KG extraction capabilities to the enrichment pipeline. A new kg_extraction module processes chunks to extract entities and relations, resolving them against a vector store and creating knowledge graph connections. The enrichment process now performs best-effort KG extraction after enriching each chunk.

Changes

Cohort / File(s) Summary
Enrichment Integration
src/brainlayer/pipeline/enrichment.py
Adds non-critical, exception-handled KG extraction call in _enrich_one after chunk enrichment. Logs failures at debug level without interrupting enrichment.
KG Extraction Pipeline
src/brainlayer/pipeline/kg_extraction.py
New module implementing end-to-end KG extraction: extract_kg_from_chunk orchestrates chunk reading and extraction via combined entity extraction (seed/GliNER/LLM); process_extraction_result handles entity resolution, creation, confidence updates, and relation storage with chunk linking; _mention_type_from_source maps extraction sources to mention types.
KG Extraction Tests
tests/test_kg_extraction.py
Comprehensive test suite covering entity/relation creation, deduplication via resolution, confidence propagation, chunk linking, empty extraction handling, and full extraction flows with mocked LLM/GliNER paths.

Sequence Diagram

sequenceDiagram
    participant Enrich as Enrichment Pipeline
    participant KGE as KG Extraction Module
    participant Extract as Entity Extraction
    participant VStore as VectorStore
    participant Resolve as Entity Resolution

    Enrich->>Enrich: Enrich chunk content
    Enrich->>KGE: extract_kg_from_chunk(chunk_id)
    KGE->>VStore: Read chunk content
    KGE->>Extract: extract_entities_combined(content, seed/llm/gliner)
    Extract-->>KGE: ExtractionResult (entities + relations)
    KGE->>KGE: process_extraction_result(result)
    loop For each extracted entity
        KGE->>Resolve: resolve_entity(entity_text)
        Resolve-->>KGE: entity_id or create new
        KGE->>VStore: Update entity confidence
        KGE->>VStore: Link entity to chunk
    end
    loop For each extracted relation
        KGE->>VStore: Resolve source/target entities
        KGE->>VStore: Store relation with metadata
    end
    KGE-->>Enrich: stats (entities/relations created)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Knowledge graphs grow with every chunk,
Entities dance and relations link,
Seeds and LLMs extract with trust,
Confidence flows through the semantic dust,
Building webs of wisdom, glistening-bright! ✨

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/kg-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

new_confidence = max(existing.get("confidence", 0) or 0, ext_entity.confidence)
else:
# Newly created — set confidence from extraction source
new_confidence = ext_entity.confidence
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing entity check narrower than resolution logic

Medium Severity

The pre_existing check via store.get_entity_by_name() performs a case-sensitive SQL lookup (WHERE name = ?), but resolve_entity() uses case-insensitive matching (FTS5 + _normalize_name()) and alias resolution. When an entity exists as "Etan" and an extraction finds "etan" (different case), pre_existing is None while resolve_entity returns the existing entity's ID. This causes the code to take the "newly created" branch and set new_confidence = ext_entity.confidence, potentially overwriting a higher existing confidence with a lower one (e.g., 0.95 → 0.6).

Fix in Cursor Fix in Web

seed_entities={}, # TODO: load seed entities from config
use_llm=False, # Seed-only for now (LLM extraction is expensive)
use_gliner=False,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-op extraction runs unnecessary DB query per chunk

Low Severity

The extract_kg_from_chunk call passes seed_entities={} with both use_llm=False and use_gliner=False, meaning no extraction strategy has anything to match. Yet each invocation still executes SELECT content FROM chunks WHERE id = ? before discovering there's nothing to extract. This unnecessary database read runs for every successfully enriched chunk.

Fix in Cursor Fix in Web

ext_entity.entity_type,
existing["name"],
confidence=new_confidence,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence update inadvertently resets importance and metadata

Medium Severity

The upsert_entity call intends to update only confidence, but because upsert_entity's ON CONFLICT clause unconditionally overwrites metadata, description, and importance, calling it without those arguments resets importance to the default 0.5, clears metadata to {}, and sets description to NULL. Any previously stored values for those fields on an existing entity are silently lost every time process_extraction_result processes it.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant