Skip to content

fix: persist digest LLM entities#290

Merged
EtanHey merged 1 commit into
mainfrom
fix/kg-entity-persistence
May 17, 2026
Merged

fix: persist digest LLM entities#290
EtanHey merged 1 commit into
mainfrom
fix/kg-entity-persistence

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 17, 2026

Summary

  • Re-enable LLM entity extraction in the brain_digest batch extraction wrapper.
  • Add a regression test that digests PEOPLE-ROLES-style content and verifies entity_lookup can find a newly extracted person with evidence.

Root Cause

src/brainlayer/pipeline/batch_extraction.py:87 passed use_llm=llm_caller is not None into extract_entities_combined(). The normal MCP/CLI brain_digest path does not pass an explicit test llm_caller, so default Gemini extraction was silently disabled and non-seed people were never materialized into kg_entities / kg_entity_chunks.

Test Plan

  • RED first: pytest -q tests/test_phase3_digest.py::test_digest_content_persists_llm_people_entities_for_lookup failed with entities_found == 0 before the fix.
  • pytest -q tests/test_phase3_digest.py::test_digest_content_persists_llm_people_entities_for_lookup
  • GOOGLE_API_KEY= GEMINI_API_KEY= pytest -q tests/test_phase3_digest.py tests/test_digest_pipeline_v2.py tests/test_mcp_digest_modes.py tests/test_kg_extraction.py tests/test_kg_rebuild.py tests/test_kg_relations.py tests/test_entity_extraction.py tests/test_entity_contracts.py tests/test_daemon_kg -> 185 passed, 6 skipped.
  • Pre-push test gate passed: 1980 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated eval/hook routing 32 passed; bun 1 passed; FTS5 determinism shell passed.

Notes

A system-Python full pytest -q outside the repo venv failed during collection due unrelated local dependency state: missing deepchecks, plus numba rejecting NumPy 2.4 through ranx. The repo pre-push gate uses .venv and passed.


Note

Medium Risk
Changes digest/batch extraction behavior to always run LLM-based entity extraction, which can increase external LLM calls/cost and introduce new failure modes if credentials/rate limits are misconfigured.

Overview
Fixes brain_digest entity persistence by forcing process_chunk to call extract_entities_combined(..., use_llm=True) even when no explicit llm_caller is passed.

Adds a regression test that stubs Gemini extraction, digests PEOPLE-ROLES content, and asserts newly LLM-extracted person entities are stored with evidence and retrievable via entity_lookup.

Reviewed by Cursor Bugbot for commit 3b61cb1. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix process_chunk to persist LLM-extracted entities during digest

Sets use_llm=True unconditionally in process_chunk when calling extract_entities_combined, fixing a bug where LLM-extracted entities were not persisted during digest. Previously, use_llm was only set when llm_caller was not None, but the condition did not work as intended. A new test verifies that digest_content persists person entities from LLM extraction so they are retrievable via entity_lookup.

Macroscope summarized 3b61cb1.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Entity extraction has been significantly enhanced to consistently utilize advanced processing capabilities, providing improved accuracy and reliability in identifying relevant entities across all projects and datasets. Extracted entities are now reliably persisted and remain accessible for subsequent lookup and reference operations, delivering better overall quality in entity recognition and more robust knowledge graph management functionality.

Review Change Stack

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

This PR decouples LLM entity extraction from the presence of a custom LLM caller by forcing process_chunk to always enable LLM usage, and validates the end-to-end behavior through an integration test that confirms LLM-extracted entities persist and are retrievable.

Changes

LLM Entity Extraction and Persistence

Layer / File(s) Summary
Force LLM usage in entity extraction
src/brainlayer/pipeline/batch_extraction.py
process_chunk now unconditionally passes use_llm=True to extract_entities_combined, removing the prior condition that tied LLM execution to custom LLM caller presence.
Entity persistence and lookup validation
tests/test_phase3_digest.py
New test monkeypatches Gemini extraction to return three predetermined person entities, runs digest_content to persist them in the knowledge graph, and validates entity_lookup retrieves a specific extracted person by name with evidence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • EtanHey/brainlayer#188: Aligns with a prior refactor that decouples LLM execution from explicit caller provision, achieving consistent LLM usage control flow.
  • EtanHey/brainlayer#32: The forced LLM extraction behavior directly impacts the entity persistence pathway through digest_content and entity_lookup that this PR tests.

Poem

🐰 LLM entities now flow free,
No caller needed, extraction decree—
Through digest they hop and persist with care,
Lookup finds them everywhere! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: persist digest LLM entities' directly summarizes the main change: re-enabling LLM entity extraction in the digest pipeline by fixing how use_llm is passed.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kg-entity-persistence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EtanHey EtanHey force-pushed the fix/kg-entity-persistence branch from 9a0cbc4 to 3b61cb1 Compare May 17, 2026 20:33
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b61cb1bc7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

seed_entities,
llm_caller=llm_caller,
use_llm=llm_caller is not None,
use_llm=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sanitize digest text before LLM extraction

When a Gemini key is configured, digest_content() reaches this path without an explicit llm_caller, so extract_entities_llm() falls back to call_gemini_for_extraction() and sends the raw chunk text to an external API. The existing digest Gemini enrichment path immediately below uses Sanitizer.from_env()/build_external_prompt() before calling Gemini, and pipeline/sanitize.py documents that PII is stripped before external LLM APIs; this change bypasses that guard for CLI/MCP brain_digest inputs containing names, emails, paths, or secrets. Please either keep default extraction local/opt-in or build the extraction prompt from sanitized content as well.

Useful? React with 👍 / 👎.

@EtanHey EtanHey merged commit 6fd85eb into main May 17, 2026
7 checks passed
@EtanHey EtanHey deleted the fix/kg-entity-persistence branch May 17, 2026 21:15
EtanHey added a commit that referenced this pull request May 17, 2026
Extends the Recent Hardening window from 2026-05-02 to 2026-05-17 and adds a
"Phase 5 ship wave" subsection covering:

- PR #289 — reject MCP-unavailable diagnostics + PreCompact checkpoint noise at
  the watcher / drain / store ingest heads; demote (not remove) any chunk with
  precompact/quarantine signals in hybrid rerank so explicit include_checkpoints
  callers still see them.
- PR #290 — fix KG persistence regression in process_chunk where
  use_llm=llm_caller is not None silently disabled Gemini entity extraction on
  the MCP/CLI digest path. Non-seed person entities were never materialized into
  kg_entities. Second recurrence of the same 2026-04-06 root cause; RED-first
  regression test guards it.
- Enrichment LaunchAgent recovered after 2026-05-15 11:50 IDT unload;
  com.brainlayer.enrichment verified live (launchctl PID present) draining the
  56K-chunk backfill against the Gemini flex tier.

Every claim cites the merged PR by number.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant