feat: KG extraction pipeline — wires entities to KG tables#47
Conversation
New module pipeline/kg_extraction.py: - process_extraction_result: entities + relations → KG tables - extract_kg_from_chunk: full extraction flow for a single chunk - Confidence propagation from extraction sources to KG entities - mention_type tagging (explicit/inferred) - source_chunk_id provenance on relations Wired into enrichment pipeline (_enrich_one) as non-critical post-step. 11 new tests covering entity creation, relation creation, dedup, confidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis pull request introduces KG extraction capabilities to the enrichment pipeline. A new kg_extraction module processes chunks to extract entities and relations, resolving them against a vector store and creating knowledge graph connections. The enrichment process now performs best-effort KG extraction after enriching each chunk. Changes
Sequence DiagramsequenceDiagram
participant Enrich as Enrichment Pipeline
participant KGE as KG Extraction Module
participant Extract as Entity Extraction
participant VStore as VectorStore
participant Resolve as Entity Resolution
Enrich->>Enrich: Enrich chunk content
Enrich->>KGE: extract_kg_from_chunk(chunk_id)
KGE->>VStore: Read chunk content
KGE->>Extract: extract_entities_combined(content, seed/llm/gliner)
Extract-->>KGE: ExtractionResult (entities + relations)
KGE->>KGE: process_extraction_result(result)
loop For each extracted entity
KGE->>Resolve: resolve_entity(entity_text)
Resolve-->>KGE: entity_id or create new
KGE->>VStore: Update entity confidence
KGE->>VStore: Link entity to chunk
end
loop For each extracted relation
KGE->>VStore: Resolve source/target entities
KGE->>VStore: Store relation with metadata
end
KGE-->>Enrich: stats (entities/relations created)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| new_confidence = max(existing.get("confidence", 0) or 0, ext_entity.confidence) | ||
| else: | ||
| # Newly created — set confidence from extraction source | ||
| new_confidence = ext_entity.confidence |
There was a problem hiding this comment.
Pre-existing entity check narrower than resolution logic
Medium Severity
The pre_existing check via store.get_entity_by_name() performs a case-sensitive SQL lookup (WHERE name = ?), but resolve_entity() uses case-insensitive matching (FTS5 + _normalize_name()) and alias resolution. When an entity exists as "Etan" and an extraction finds "etan" (different case), pre_existing is None while resolve_entity returns the existing entity's ID. This causes the code to take the "newly created" branch and set new_confidence = ext_entity.confidence, potentially overwriting a higher existing confidence with a lower one (e.g., 0.95 → 0.6).
| seed_entities={}, # TODO: load seed entities from config | ||
| use_llm=False, # Seed-only for now (LLM extraction is expensive) | ||
| use_gliner=False, | ||
| ) |
There was a problem hiding this comment.
No-op extraction runs unnecessary DB query per chunk
Low Severity
The extract_kg_from_chunk call passes seed_entities={} with both use_llm=False and use_gliner=False, meaning no extraction strategy has anything to match. Yet each invocation still executes SELECT content FROM chunks WHERE id = ? before discovering there's nothing to extract. This unnecessary database read runs for every successfully enriched chunk.
| ext_entity.entity_type, | ||
| existing["name"], | ||
| confidence=new_confidence, | ||
| ) |
There was a problem hiding this comment.
Confidence update inadvertently resets importance and metadata
Medium Severity
The upsert_entity call intends to update only confidence, but because upsert_entity's ON CONFLICT clause unconditionally overwrites metadata, description, and importance, calling it without those arguments resets importance to the default 0.5, clears metadata to {}, and sets description to NULL. Any previously stored values for those fields on an existing entity are silently lost every time process_extraction_result processes it.


Summary
pipeline/kg_extraction.pymodule wiring entity extraction → KG standard tablesprocess_extraction_result: converts ExtractedEntity/Relation → kg_entities + kg_relations + kg_entity_chunksextract_kg_from_chunk: full extraction flow (seed/LLM/GLiNER → KG)_enrich_oneas non-critical post-enrichment step (seed-only for now)Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Adds new post-enrichment writes into KG tables (entities, relations, chunk links) and runs automatically for each enriched chunk, which could impact runtime and KG data quality despite being best-effort and wrapped in a non-fatal try/except.
Overview
Adds a new KG extraction step that runs after chunk enrichment: extracted entities are resolved/upserted into
kg_entities, linked to their source chunks viakg_entity_chunkswithmention_type(explicit vs inferred), and extracted relations are persisted tokg_relationswithsource_chunk_idprovenance.Wires this into
_enrich_oneas a non-critical post-enrichment hook (currently seed-only with LLM/GLiNER disabled) and adds a comprehensive test suite covering entity creation/dedup, relation creation, chunk linking, and confidence propagation.Written by Cursor Bugbot for commit ad4d328. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Tests