Skip to content

feat(ingest): reject diagnostic + PreCompact noise + reranker demote#289

Merged
EtanHey merged 2 commits into
mainfrom
feat/noise-filter
May 17, 2026
Merged

feat(ingest): reject diagnostic + PreCompact noise + reranker demote#289
EtanHey merged 2 commits into
mainfrom
feat/noise-filter

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 17, 2026

Summary

  • Reject BrainLayer MCP unavailable diagnostics at ingest head so boot/tooling failures do not become durable memory.
  • Reject PreCompact checkpoint noise from watcher/drain ingestion while preserving explicit checkpoint APIs used by brain_resume.
  • Add a dry-run-first quarantine script for existing F-infra and PreCompact candidates; live DB mutation requires explicit --apply and was not run.
  • Demote pre-compact/quarantined chunks in hybrid reranking for default search, while preserving explicit include_checkpoints behavior.

Safety

  • scripts/quarantine_noise.py defaults to read-only dry-run.
  • No live DB quarantine was applied.
  • Existing .gemini/ and .worktrees/ untracked files were left untouched.

Test Plan

  • ruff check src/brainlayer/ingest_guard.py src/brainlayer/drain.py src/brainlayer/watcher_bridge.py src/brainlayer/search_repo.py scripts/quarantine_noise.py tests/test_ingest_guard.py tests/test_hybrid_search.py tests/test_quarantine_noise.py tests/test_precompact_chunk_origin.py
  • pytest -q tests/test_ingest_guard.py tests/test_quarantine_noise.py tests/test_hybrid_search.py tests/test_precompact_chunk_origin.py tests/test_hybrid_search_decay.py tests/test_audit_search_quality.py tests/test_search_exact_chunk_id.py
  • Pre-push gate: 1995 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated eval/hook routing 32 passed; Bun suite 1 passed; regression shell test_fts5_determinism.sh passed.

Note

Medium Risk
Changes ingestion filtering across watcher/drain/store paths and adjusts hybrid search reranking, which can affect what content is stored and what results surface. Also introduces an optional DB-mutating quarantine script (guarded by --apply) that could archive data if misused.

Overview
Prevents two new classes of “noise” from becoming durable memory by extending recursive_mcp_output_reason to detect BrainLayer MCP unavailability diagnostics and (optionally) PreCompact checkpoint content, and updating watcher/drain ingestion paths to reject them.

Adds a new scripts/quarantine_noise.py tool to scan for existing infra/PreCompact candidates and, when explicitly run with --apply, archive them and append a quarantined/noise tag.

Updates hybrid_search reranking to demote (not remove) results with precompact/quarantine signals (via tags/metadata/content), while preserving include_checkpoints behavior; tests are expanded/adjusted to cover the new guards and demotion logic.

Reviewed by Cursor Bugbot for commit a208b46. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Reject precompact checkpoint and F-infra diagnostic noise during ingest and demote in reranker

  • Extends recursive_mcp_output_reason in ingest_guard.py to detect F-infra MCP unavailable diagnostics and precompact checkpoint content, causing them to be rejected at ingest time across drain, watcher bridge, and store handlers.
  • Adds score demotion in SearchMixin.hybrid_search in search_repo.py for chunks with precompact or quarantine signals in metadata/tags, keeping them discoverable but deprioritized.
  • Adds scripts/quarantine_noise.py, a CLI tool to dry-run or apply quarantine on existing noise chunks in the BrainLayer SQLite DB, updating archived fields and appending a quarantined/noise tag.
  • Behavioral Change: chunks previously stored with chunk_origin=precompact_checkpoint are now rejected at ingest; existing tests updated to reflect chunk_origin=None for these rows.

Macroscope summarized a208b46.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 40 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b7ea4368-3840-449e-b1f9-21b2bf82c05b

📥 Commits

Reviewing files that changed from the base of the PR and between 8b924bd and a208b46.

📒 Files selected for processing (9)
  • scripts/quarantine_noise.py
  • src/brainlayer/drain.py
  • src/brainlayer/ingest_guard.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/watcher_bridge.py
  • tests/test_hybrid_search.py
  • tests/test_ingest_guard.py
  • tests/test_precompact_chunk_origin.py
  • tests/test_quarantine_noise.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/noise-filter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2da6ab444

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"precompact",
"precompactcheckpoint",
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recognize the quarantine tag emitted by the script

When scripts/quarantine_noise.py applies a row it appends NOISE_TAG = "quarantined/noise", but _is_noise_tag removes non-alphanumerics before comparing tags, so that tag becomes quarantinednoise and is not in this marker set. In include_archived searches, or for any row tagged the same way before/without archiving, the new rerank demotion will not treat the quarantined chunk as noise and it can rank normally.

Useful? React with 👍 / 👎.

Comment on lines +167 to +170
where_parts = [
"content IS NOT NULL",
"(LOWER(content) LIKE :infra_phrase)",
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid fetching every non-empty chunk as a candidate

Because _fetch_candidates later joins every where_parts entry with OR, this content IS NOT NULL arm makes the SQL match almost the entire chunks table; _candidate_map then filters in Python and --limit is applied only after all rows are loaded. On the canonical large DB, even a dry-run with a small limit can scan and materialize every ordinary chunk, so this should be an outer AND condition instead of one of the candidate predicates.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.

FROM chunks
WHERE (
{" OR ".join(where_parts)}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL OR with content IS NOT NULL fetches all rows

High Severity

"content IS NOT NULL" is included in where_parts and joined with " OR ", producing WHERE (content IS NOT NULL OR ...). Since nearly every row has non-null content, this condition short-circuits all the actual noise-detection clauses and the query returns the entire chunks table. The intent is clearly for content IS NOT NULL to be an AND prerequisite, with the noise-matching conditions OR'd together. On a large database this will load all rows into memory, causing severe slowness or OOM, and the subsequent Python-level filtering in _candidate_map masks the problem for small test databases.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.

"precompactcheckpoint",
}
)
_QUARANTINE_TAG_MARKERS = frozenset({"quarantine", "quarantined"})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reranker fails to detect quarantined/noise tag from script

Low Severity

The quarantine script applies the tag "quarantined/noise" (NOISE_TAG), but _is_noise_tag normalizes by stripping all non-alphanumeric chars, producing "quarantinednoise". This doesn't match any entry in _QUARANTINE_TAG_MARKERS ({"quarantine", "quarantined"}), so chunks tagged by the script aren't demoted by the reranker. Impact is mitigated because the script also archives chunks, but the reranker defense layer is silently ineffective for these tags.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.


normalized_content = content.casefold() if content else ""
if any(pattern in normalized_content for pattern in _NOISE_CONTENT_PATTERNS):
has_precompact_signal = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substring match on generic patterns causes false demotion

Medium Severity

_NOISE_CONTENT_PATTERNS (e.g. "session-restore") is checked via substring in against the full content, but the equivalent check in is_precompact_checkpoint_content uses startswith. Any chunk merely mentioning "session-restore" anywhere in its text gets has_precompact_signal = True and a 95% score demotion (_NOISE_RERANK_DEMOTION = 0.05), effectively burying legitimate results that reference session-restore as a concept rather than being checkpoint noise.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e2da6ab. Configure here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a208b46cd0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +21 to +22
from brainlayer.chunk_origin import is_precompact_checkpoint_content
from brainlayer.paths import get_db_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add src to sys.path before script imports

When this new script is run exactly as documented from a checkout (python3 scripts/quarantine_noise.py, including --help), Python only puts scripts/ on sys.path, so the from brainlayer... imports fail with ModuleNotFoundError unless the package was separately installed. Other repo scripts that are intended to run this way insert ../src first; without the same bootstrap this quarantine tool is unusable in the common local/ops workflow.

Useful? React with 👍 / 👎.

@EtanHey EtanHey merged commit 127717b into main May 17, 2026
7 checks passed
@EtanHey EtanHey deleted the feat/noise-filter branch May 17, 2026 21:15
EtanHey added a commit that referenced this pull request May 17, 2026
Extends the Recent Hardening window from 2026-05-02 to 2026-05-17 and adds a
"Phase 5 ship wave" subsection covering:

- PR #289 — reject MCP-unavailable diagnostics + PreCompact checkpoint noise at
  the watcher / drain / store ingest heads; demote (not remove) any chunk with
  precompact/quarantine signals in hybrid rerank so explicit include_checkpoints
  callers still see them.
- PR #290 — fix KG persistence regression in process_chunk where
  use_llm=llm_caller is not None silently disabled Gemini entity extraction on
  the MCP/CLI digest path. Non-seed person entities were never materialized into
  kg_entities. Second recurrence of the same 2026-04-06 root cause; RED-first
  regression test guards it.
- Enrichment LaunchAgent recovered after 2026-05-15 11:50 IDT unload;
  com.brainlayer.enrichment verified live (launchctl PID present) draining the
  56K-chunk backfill against the Gemini flex tier.

Every claim cites the merged PR by number.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant