feat(ingest): reject diagnostic + PreCompact noise + reranker demote#289
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2da6ab444
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "precompact", | ||
| "precompactcheckpoint", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Recognize the quarantine tag emitted by the script
When scripts/quarantine_noise.py applies a row it appends NOISE_TAG = "quarantined/noise", but _is_noise_tag removes non-alphanumerics before comparing tags, so that tag becomes quarantinednoise and is not in this marker set. In include_archived searches, or for any row tagged the same way before/without archiving, the new rerank demotion will not treat the quarantined chunk as noise and it can rank normally.
Useful? React with 👍 / 👎.
| where_parts = [ | ||
| "content IS NOT NULL", | ||
| "(LOWER(content) LIKE :infra_phrase)", | ||
| ] |
There was a problem hiding this comment.
Avoid fetching every non-empty chunk as a candidate
Because _fetch_candidates later joins every where_parts entry with OR, this content IS NOT NULL arm makes the SQL match almost the entire chunks table; _candidate_map then filters in Python and --limit is applied only after all rows are loaded. On the canonical large DB, even a dry-run with a small limit can scan and materialize every ordinary chunk, so this should be an outer AND condition instead of one of the candidate predicates.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.
| FROM chunks | ||
| WHERE ( | ||
| {" OR ".join(where_parts)} | ||
| ) |
There was a problem hiding this comment.
SQL OR with content IS NOT NULL fetches all rows
High Severity
"content IS NOT NULL" is included in where_parts and joined with " OR ", producing WHERE (content IS NOT NULL OR ...). Since nearly every row has non-null content, this condition short-circuits all the actual noise-detection clauses and the query returns the entire chunks table. The intent is clearly for content IS NOT NULL to be an AND prerequisite, with the noise-matching conditions OR'd together. On a large database this will load all rows into memory, causing severe slowness or OOM, and the subsequent Python-level filtering in _candidate_map masks the problem for small test databases.
Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.
| "precompactcheckpoint", | ||
| } | ||
| ) | ||
| _QUARANTINE_TAG_MARKERS = frozenset({"quarantine", "quarantined"}) |
There was a problem hiding this comment.
Reranker fails to detect quarantined/noise tag from script
Low Severity
The quarantine script applies the tag "quarantined/noise" (NOISE_TAG), but _is_noise_tag normalizes by stripping all non-alphanumeric chars, producing "quarantinednoise". This doesn't match any entry in _QUARANTINE_TAG_MARKERS ({"quarantine", "quarantined"}), so chunks tagged by the script aren't demoted by the reranker. Impact is mitigated because the script also archives chunks, but the reranker defense layer is silently ineffective for these tags.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.
|
|
||
| normalized_content = content.casefold() if content else "" | ||
| if any(pattern in normalized_content for pattern in _NOISE_CONTENT_PATTERNS): | ||
| has_precompact_signal = True |
There was a problem hiding this comment.
Substring match on generic patterns causes false demotion
Medium Severity
_NOISE_CONTENT_PATTERNS (e.g. "session-restore") is checked via substring in against the full content, but the equivalent check in is_precompact_checkpoint_content uses startswith. Any chunk merely mentioning "session-restore" anywhere in its text gets has_precompact_signal = True and a 95% score demotion (_NOISE_RERANK_DEMOTION = 0.05), effectively burying legitimate results that reference session-restore as a concept rather than being checkpoint noise.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a208b46cd0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from brainlayer.chunk_origin import is_precompact_checkpoint_content | ||
| from brainlayer.paths import get_db_path |
There was a problem hiding this comment.
Add src to sys.path before script imports
When this new script is run exactly as documented from a checkout (python3 scripts/quarantine_noise.py, including --help), Python only puts scripts/ on sys.path, so the from brainlayer... imports fail with ModuleNotFoundError unless the package was separately installed. Other repo scripts that are intended to run this way insert ../src first; without the same bootstrap this quarantine tool is unusable in the common local/ops workflow.
Useful? React with 👍 / 👎.
Extends the Recent Hardening window from 2026-05-02 to 2026-05-17 and adds a "Phase 5 ship wave" subsection covering: - PR #289 — reject MCP-unavailable diagnostics + PreCompact checkpoint noise at the watcher / drain / store ingest heads; demote (not remove) any chunk with precompact/quarantine signals in hybrid rerank so explicit include_checkpoints callers still see them. - PR #290 — fix KG persistence regression in process_chunk where use_llm=llm_caller is not None silently disabled Gemini entity extraction on the MCP/CLI digest path. Non-seed person entities were never materialized into kg_entities. Second recurrence of the same 2026-04-06 root cause; RED-first regression test guards it. - Enrichment LaunchAgent recovered after 2026-05-15 11:50 IDT unload; com.brainlayer.enrichment verified live (launchctl PID present) draining the 56K-chunk backfill against the Gemini flex tier. Every claim cites the merged PR by number.


Summary
Safety
scripts/quarantine_noise.pydefaults to read-only dry-run.Test Plan
ruff check src/brainlayer/ingest_guard.py src/brainlayer/drain.py src/brainlayer/watcher_bridge.py src/brainlayer/search_repo.py scripts/quarantine_noise.py tests/test_ingest_guard.py tests/test_hybrid_search.py tests/test_quarantine_noise.py tests/test_precompact_chunk_origin.pypytest -q tests/test_ingest_guard.py tests/test_quarantine_noise.py tests/test_hybrid_search.py tests/test_precompact_chunk_origin.py tests/test_hybrid_search_decay.py tests/test_audit_search_quality.py tests/test_search_exact_chunk_id.py1995 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated eval/hook routing 32 passed; Bun suite 1 passed; regression shelltest_fts5_determinism.shpassed.Note
Medium Risk
Changes ingestion filtering across watcher/drain/store paths and adjusts hybrid search reranking, which can affect what content is stored and what results surface. Also introduces an optional DB-mutating quarantine script (guarded by
--apply) that could archive data if misused.Overview
Prevents two new classes of “noise” from becoming durable memory by extending
recursive_mcp_output_reasonto detect BrainLayer MCP unavailability diagnostics and (optionally) PreCompact checkpoint content, and updating watcher/drain ingestion paths to reject them.Adds a new
scripts/quarantine_noise.pytool to scan for existing infra/PreCompact candidates and, when explicitly run with--apply, archive them and append aquarantined/noisetag.Updates
hybrid_searchreranking to demote (not remove) results with precompact/quarantine signals (via tags/metadata/content), while preservinginclude_checkpointsbehavior; tests are expanded/adjusted to cover the new guards and demotion logic.Reviewed by Cursor Bugbot for commit a208b46. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Reject precompact checkpoint and F-infra diagnostic noise during ingest and demote in reranker
recursive_mcp_output_reasonin ingest_guard.py to detect F-infra MCP unavailable diagnostics and precompact checkpoint content, causing them to be rejected at ingest time across drain, watcher bridge, and store handlers.SearchMixin.hybrid_searchin search_repo.py for chunks with precompact or quarantine signals in metadata/tags, keeping them discoverable but deprioritized.quarantined/noisetag.chunk_origin=precompact_checkpointare now rejected at ingest; existing tests updated to reflectchunk_origin=Nonefor these rows.Macroscope summarized a208b46.