fix: persist digest LLM entities#290
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThis PR decouples LLM entity extraction from the presence of a custom LLM caller by forcing ChangesLLM Entity Extraction and Persistence
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9a0cbc4 to
3b61cb1
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b61cb1bc7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| seed_entities, | ||
| llm_caller=llm_caller, | ||
| use_llm=llm_caller is not None, | ||
| use_llm=True, |
There was a problem hiding this comment.
Sanitize digest text before LLM extraction
When a Gemini key is configured, digest_content() reaches this path without an explicit llm_caller, so extract_entities_llm() falls back to call_gemini_for_extraction() and sends the raw chunk text to an external API. The existing digest Gemini enrichment path immediately below uses Sanitizer.from_env()/build_external_prompt() before calling Gemini, and pipeline/sanitize.py documents that PII is stripped before external LLM APIs; this change bypasses that guard for CLI/MCP brain_digest inputs containing names, emails, paths, or secrets. Please either keep default extraction local/opt-in or build the extraction prompt from sanitized content as well.
Useful? React with 👍 / 👎.
Extends the Recent Hardening window from 2026-05-02 to 2026-05-17 and adds a "Phase 5 ship wave" subsection covering: - PR #289 — reject MCP-unavailable diagnostics + PreCompact checkpoint noise at the watcher / drain / store ingest heads; demote (not remove) any chunk with precompact/quarantine signals in hybrid rerank so explicit include_checkpoints callers still see them. - PR #290 — fix KG persistence regression in process_chunk where use_llm=llm_caller is not None silently disabled Gemini entity extraction on the MCP/CLI digest path. Non-seed person entities were never materialized into kg_entities. Second recurrence of the same 2026-04-06 root cause; RED-first regression test guards it. - Enrichment LaunchAgent recovered after 2026-05-15 11:50 IDT unload; com.brainlayer.enrichment verified live (launchctl PID present) draining the 56K-chunk backfill against the Gemini flex tier. Every claim cites the merged PR by number.
Summary
brain_digestbatch extraction wrapper.entity_lookupcan find a newly extracted person with evidence.Root Cause
src/brainlayer/pipeline/batch_extraction.py:87passeduse_llm=llm_caller is not Noneintoextract_entities_combined(). The normal MCP/CLIbrain_digestpath does not pass an explicit testllm_caller, so default Gemini extraction was silently disabled and non-seed people were never materialized intokg_entities/kg_entity_chunks.Test Plan
pytest -q tests/test_phase3_digest.py::test_digest_content_persists_llm_people_entities_for_lookupfailed withentities_found == 0before the fix.pytest -q tests/test_phase3_digest.py::test_digest_content_persists_llm_people_entities_for_lookupGOOGLE_API_KEY= GEMINI_API_KEY= pytest -q tests/test_phase3_digest.py tests/test_digest_pipeline_v2.py tests/test_mcp_digest_modes.py tests/test_kg_extraction.py tests/test_kg_rebuild.py tests/test_kg_relations.py tests/test_entity_extraction.py tests/test_entity_contracts.py tests/test_daemon_kg-> 185 passed, 6 skipped.Notes
A system-Python full
pytest -qoutside the repo venv failed during collection due unrelated local dependency state: missingdeepchecks, plusnumbarejecting NumPy 2.4 throughranx. The repo pre-push gate uses.venvand passed.Note
Medium Risk
Changes digest/batch extraction behavior to always run LLM-based entity extraction, which can increase external LLM calls/cost and introduce new failure modes if credentials/rate limits are misconfigured.
Overview
Fixes
brain_digestentity persistence by forcingprocess_chunkto callextract_entities_combined(..., use_llm=True)even when no explicitllm_calleris passed.Adds a regression test that stubs Gemini extraction, digests PEOPLE-ROLES content, and asserts newly LLM-extracted
personentities are stored with evidence and retrievable viaentity_lookup.Reviewed by Cursor Bugbot for commit 3b61cb1. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fix
process_chunkto persist LLM-extracted entities during digestSets
use_llm=Trueunconditionally inprocess_chunkwhen callingextract_entities_combined, fixing a bug where LLM-extracted entities were not persisted during digest. Previously,use_llmwas only set whenllm_callerwas notNone, but the condition did not work as intended. A new test verifies thatdigest_contentpersistspersonentities from LLM extraction so they are retrievable viaentity_lookup.Macroscope summarized 3b61cb1.
Summary by CodeRabbit
Release Notes