feat: tiered enrichment selectivity — T0/T1/T2/T3 classifier by EtanHey · Pull Request #82 · EtanHey/brainlayer

EtanHey · 2026-03-14T01:23:05Z

Summary

Adds EnrichmentTier (IntEnum T0–T3) and classify_chunk_tier() in src/brainlayer/pipeline/enrichment_tiers.py
Implements Finding 1 from the enrichment efficiency audit: current HIGH_VALUE_TYPES covers 97.75% of all chunks, making it effectively full-stream enrichment
Source-aware tier logic: T0=manual/digest (immediate), T1=recent claude_code≤7d (hourly), T2=old claude_code>7d (lazy/remote batch), T3=youtube (explicit only)
Adds pythonpath = ["src"] to pytest config so worktree src takes precedence over editable install during development
24 new TDD tests, all green; 834 total tests pass with 0 regressions

Tier design

Tier	What	When
T0 IMMEDIATE	`manual`, `digest` sources	always
T1 HOURLY	recent `claude_code` (≤7 days)	hourly local
T2 LAZY	old `claude_code` (>7 days)	lazy / remote batch
T3 EXPLICIT	`youtube` transcripts	explicit request only

Test plan

python3 -m pytest tests/test_enrichment_tiers.py -v — 24/24 pass
Full suite python3 -m pytest tests/ -q -m "not live and not slow and not integration" — 834 passed, 0 regressions
Integration: wire classify_chunk_tier into enrich_batch and export_unenriched_chunks selectors (follow-up PR)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added input validation for search parameters with clear error messages for invalid values and bounds.
- Introduced a tiered enrichment system to classify content by priority and freshness.
- Enhanced backup operations with checkpoint-based tracking and batch cost estimation.
Performance Improvements
- Implemented caching for hybrid search results to reduce latency on repeated queries.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

EtanHey · 2026-03-14T01:23:12Z

@codex review

coderabbitai · 2026-03-14T01:23:13Z

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 32c2b5f3-b5d6-424c-aae1-678c123f585c

📥 Commits

Reviewing files that changed from the base of the PR and between 8690de9 and 87819b8.

📒 Files selected for processing (3)

pyproject.toml
src/brainlayer/pipeline/enrichment_tiers.py
tests/test_enrichment_tiers.py

📝 Walkthrough

Walkthrough

This PR introduces a tiered enrichment system for prioritizing chunk processing, implements LRU caching for hybrid search results, adds input validation to the search handler, refactors the cloud backfill script with APSW-based sidecar checkpointing, and includes comprehensive test coverage for new features and validation scenarios.

Changes

Cohort / File(s)	Summary
Enrichment Tier System `src/brainlayer/pipeline/enrichment_tiers.py`, `tests/test_enrichment_tiers.py`	New module introducing EnrichmentTier enum (T0_IMMEDIATE, T1_HOURLY, T2_LAZY, T3_EXPLICIT) and classification functions to determine chunk enrichment priority based on source, content type, and recency. Comprehensive test coverage validates tier assignment for various content sources and age thresholds.
Hybrid Search Caching `src/brainlayer/search_repo.py`, `tests/test_search_validation.py`	Adds LRU caching with TTL for hybrid_search results, keyed by store, query, and filter parameters. Implements defensive deep-copy semantics to prevent cache pollution. Tests verify cache hit behavior, per-store scoping, and defensive copy isolation.
Search Input Validation `src/brainlayer/mcp/search_handler.py`, `tests/test_search_validation.py`	Introduces validation constants (_VALID_SEARCH_DETAILS, _MAX_PUBLIC_NUM_RESULTS, _MIN_PUBLIC_NUM_RESULTS) and early error-checking in _brain_search to reject invalid detail values and num_results bounds, returning descriptive errors before branching logic.
Cloud Backfill Checkpoint Refactor `scripts/cloud_backfill.py`, `tests/test_cloud_backfill.py`	Major refactoring introducing APSW-based sidecar checkpoint database, replacing in-process checkpoint handling. Adds batch request builder (build_batch_request_line), cost estimator (estimate_batch_cost_usd), read-only fallback store pattern, and new utilities for checkpoint migration, JSONL path tracking, and batch reuse. Tests validate checkpoint isolation, cost calculations, export filtering, and read-only fallback behavior.
Test Infrastructure `tests/conftest.py`, `pyproject.toml`	Adds pytest pythonpath configuration for src imports and introduces eval_project fixture to provide isolated project namespaces per test to prevent cross-case data contamination.
Test Utilities `tests/test_eval_baselines.py`	Introduces hook path resolution logic (_resolve_hook_path, REPO_HOOK_PATH, HOME_HOOK_PATH) to prefer repo-level hooks over home dotfiles, and adds runtime guards to skip hook tests when production DB is unavailable.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~110 minutes

Possibly related PRs

fix: Phase 8b code quality — dead code, MCP hardening, doc cleanup #17 — Adds input validation bounds checks to _brain_search (num_results/detail parameter validation).
fix: consolidate DB paths — single canonical brainlayer.db #77 — Modifies scripts/cloud_backfill.py with overlapping database path resolution and checkpointing logic changes.
feat: Phase 2 — Entity extraction pipeline with bilingual NER #31 — Introduces pipeline module changes and entity extraction logic alongside test infrastructure and pyproject configuration updates.

Poem

🐰 A rabbit hops through enrichment tiers so bright,
Caching searches swift, from morning to night,
Checkpoints in APSW, sorted with care,
Validation guards gate the search thoroughfare,
Three components aligned—a harmonious sight! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately reflects the main change: introduction of tiered enrichment selectivity with a T0/T1/T2/T3 classifier. However, it does not capture the full scope of the changeset, which also includes search validation hardening, cloud backfill improvements, hybrid search caching, and test infrastructure updates. The title is specific to the primary objective but is somewhat narrow relative to the total extent of changes.
Docstring Coverage	✅ Passed	Docstring coverage is 85.11% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/tiered-enrichment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9434be4cec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-14T01:30:44Z

 DEFAULT_DB_PATH = get_db_path()
 EXPORT_DIR = Path(__file__).resolve().parent / "backfill_data"
 CHECKPOINT_TABLE = "enrichment_checkpoints"
+CHECKPOINT_DB_PATH = DEFAULT_DB_PATH.with_name("enrichment_checkpoints.db")


Derive checkpoint sidecar path from the selected DB

CHECKPOINT_DB_PATH is fixed from DEFAULT_DB_PATH at import time, so even when the CLI passes a different --db path (see the runtime dispatch in run_full_backfill/resume_backfill), checkpoint reads and writes still target the default sidecar file. This mixes checkpoint state across databases and makes --resume/--status incorrect for non-default DBs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-14T01:30:44Z

+        INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)})
+        VALUES ({placeholders})


Prevent legacy migration from clobbering newer sidecar rows

The migration path runs on every ensure_checkpoint_table call and uses INSERT OR REPLACE for all legacy rows, which can overwrite newer sidecar statuses with stale values still stored in the main DB table. After one run updates a batch to completed/imported in the sidecar, a later startup can revert it back to an old submitted state and incorrectly treat finished jobs as pending.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/cloud_backfill.py (1)

940-959: ⚠️ Potential issue | 🟠 Major

--status still hard-fails on a busy main DB.

The checkpoint query moved to the sidecar, but show_status() still constructs VectorStore before it reaches that block. If the main DB is locked, the status command never gets to the sidecar path even though it only needs read access plus best-effort stats.

Suggested change

 def show_status(db_path: Path) -> None:
     """Show status of all checkpoint jobs."""
-    store = VectorStore(db_path)
+    store = open_backfill_store(db_path, allow_read_only_fallback=True)
     ensure_checkpoint_table(store)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/cloud_backfill.py` around lines 940 - 959, show_status() constructs
VectorStore (VectorStore, ensure_checkpoint_table) before opening the sidecar
checkpoint DB (_open_checkpoint_conn) so a locked main DB can abort the --status
path; make VectorStore creation lazy or move it after the sidecar branch so the
function first opens the sidecar connection and runs
_ensure_checkpoint_table_in_conn / SELECT on CHECKPOINT_TABLE (using
_open_checkpoint_conn and _ensure_checkpoint_table_in_conn) and only creates
VectorStore and calls ensure_checkpoint_table if/when the main DB is actually
needed for stats; keep function names unchanged and avoid touching the sidecar
read path so --status can succeed when the main DB is busy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/cloud_backfill.py`:
- Around line 189-203: _migrate_checkpoints_from_main_db currently uses "INSERT
OR REPLACE" which can overwrite newer sidecar rows with stale main-db data;
change the SQL in _migrate_checkpoints_from_main_db to perform an insert-only
migration (e.g., use "INSERT OR IGNORE INTO {CHECKPOINT_TABLE} ({',
'.join(CHECKPOINT_COLUMNS)}) VALUES (...)" or an INSERT ... SELECT ... WHERE NOT
EXISTS pattern) so existing sidecar rows are not replaced, and apply the same
insert-only change to the other migration block that uses
CHECKPOINT_TABLE/CHECKPOINT_COLUMNS (the similar code around the second
migration you pointed out). Ensure the executemany call still supplies the same
rows but the SQL prevents replacing existing records so save_checkpoint updates
in the sidecar are preserved.
- Around line 252-292: The save_checkpoint function performs a non-atomic SELECT
then INSERT/UPDATE which causes a race if two processes check-then-insert
concurrently; change it to perform a single atomic upsert against
CHECKPOINT_TABLE (e.g. use SQLite's INSERT ... ON CONFLICT(batch_id) DO UPDATE
SET ... or REPLACE INTO) instead of the separate existence check, building the
column list and the conflict-target update clause from kwargs so the write never
fails with a unique-constraint error; keep the existing retry logic around the
single upsert execution and ensure the operation is executed within the same
connection/transaction and the connection is closed in the finally block.
- Around line 358-366: The type hints are too narrow: functions like
export_unenriched_chunks and _init_sanitizer (called from run_full_backfill
which can return ReadOnlyBackfillStore via open_backfill_store) should accept
either VectorStore or ReadOnlyBackfillStore; update their annotations to a union
(VectorStore | ReadOnlyBackfillStore) or define a small StoreProtocol that
declares the used members (.conn, .get_enrichment_stats, .close) and use that
protocol as the parameter type, then update open_backfill_store,
run_full_backfill, export_unenriched_chunks, and _init_sanitizer signatures to
use the new union/protocol so type-checkers accept both store types.

In `@src/brainlayer/pipeline/enrichment_tiers.py`:
- Around line 46-50: classify_chunk_tier currently ignores the content_type
parameter which lets low-signal "noise" pass through; update classify_chunk_tier
to branch on content_type and handle types per guidelines: skip/return no tier
for "noise"; preserve verbatim (do not summarize or alter) for "ai_code",
"stack_trace", and "user_message"; treat "build_log" by summarizing it before
classification; for "dir_listing" extract only structural info (filenames/paths)
and drop file contents, then classify; ensure you reference and change logic
inside the classify_chunk_tier function (and the sibling classification block
around lines marked 63-75) so content_type drives the flow.
- Around line 71-74: classify_chunk_tier() and get_tier_source_filter()
disagree: classify_chunk_tier treats non-claude sources (e.g., "unknown") as
eligible for T1/T2 based on recency, but get_tier_source_filter() only includes
"claude_code" for those tiers; pick one consistent semantic and implement it:
either expand get_tier_source_filter() to include the same source categories
classify_chunk_tier() treats as T1/T2 (e.g., include "unknown" and other
non-claude sources) or restrict classify_chunk_tier() to only return T1/T2 for
"claude_code"; update both occurrences noted (the classification branch around
classify_chunk_tier and the source filter logic in get_tier_source_filter,
including the similar logic at lines ~116-120) so both functions use the same
source set for T1/T2. Ensure tests/usage expecting T1/T2 selection reflect the
chosen rule.

In `@src/brainlayer/search_repo.py`:
- Line 27: The module-level OrderedDict _hybrid_cache is accessed
non-atomically; add a module-level lock (e.g., _hybrid_cache_lock =
threading.RLock()) and wrap every compound check/read/move/update sequence that
touches _hybrid_cache in a with _hybrid_cache_lock: block so operations (lookup,
move_to_end, insert, pop) are atomic; update all places that manipulate
_hybrid_cache (including the initial definition site and the other usage sites
referenced by the comment) to acquire the lock before reading/modifying and
release after, ensuring you still use OrderedDict.move_to_end/popitem as before
but under the lock.
- Around line 30-65: The cache key builder _hybrid_cache_key is missing the
query_embedding so hybrid_search results can be incorrectly reused; update
_hybrid_cache_key to accept a query_embedding parameter and include a stable
representation of it (e.g., a deterministic hash like SHA-256 of the embedding
bytes or a tuple of rounded floats) in the returned tuple, and update any
callers (e.g., hybrid_search) to pass the embedding into _hybrid_cache_key so
cache entries are unique per embedding.

In `@tests/test_enrichment_tiers.py`:
- Around line 4-6: The docstring header "T0 (IMMEDIATE): manual/digest sources
and high-signal types (decision, mistake)" overstates T0 by referencing
memory_type values not accepted by the classifier/test API; update the docstring
in tests/test_enrichment_tiers.py (the header line starting with "T0
(IMMEDIATE):") to remove the parenthetical "decision, mistake" or replace it
with wording that matches the actual contract (e.g., "manual/digest sources and
high-signal types" or "manual/digest sources where classifier does not accept
memory_type"), ensuring the test description matches the classifier/test API
inputs.

---

Outside diff comments:
In `@scripts/cloud_backfill.py`:
- Around line 940-959: show_status() constructs VectorStore (VectorStore,
ensure_checkpoint_table) before opening the sidecar checkpoint DB
(_open_checkpoint_conn) so a locked main DB can abort the --status path; make
VectorStore creation lazy or move it after the sidecar branch so the function
first opens the sidecar connection and runs _ensure_checkpoint_table_in_conn /
SELECT on CHECKPOINT_TABLE (using _open_checkpoint_conn and
_ensure_checkpoint_table_in_conn) and only creates VectorStore and calls
ensure_checkpoint_table if/when the main DB is actually needed for stats; keep
function names unchanged and avoid touching the sidecar read path so --status
can succeed when the main DB is busy.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fce696de-dfb8-4f06-a221-02eb49d1706a

📥 Commits

Reviewing files that changed from the base of the PR and between 50d637d and 8690de9.

📒 Files selected for processing (10)

pyproject.toml
scripts/cloud_backfill.py
src/brainlayer/mcp/search_handler.py
src/brainlayer/pipeline/enrichment_tiers.py
src/brainlayer/search_repo.py
tests/conftest.py
tests/test_cloud_backfill.py
tests/test_enrichment_tiers.py
tests/test_eval_baselines.py
tests/test_search_validation.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: test (3.11)
GitHub Check: test (3.13)
GitHub Check: test (3.12)

🧰 Additional context used

📓 Path-based instructions (2)

src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Python package structure should follow the layout: src/brainlayer/ for package code, with separate modules for vector_store.py, embeddings.py, daemon.py, dashboard/, and mcp/ for different concerns
Use paths.py:get_db_path() for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (~/.local/share/brainlayer/brainlayer.db)
Lint and format Python code using ruff check src/ and ruff format src/
Preserve verbatim content for ai_code, stack_trace, and user_message message types during classification and chunking; skip noise content entirely; summarize build_log content; extract structure-only for dir_listing
Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking
Handle SQLite concurrency by implementing retry logic on SQLITE_BUSY errors; ensure each worker uses its own database connection
Prioritize MLX (Qwen2.5-Coder-14B-Instruct-4bit) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (glm-4.7-flash on port 11434) after 3 consecutive MLX failures; support backend override via BRAINLAYER_ENRICH_BACKEND environment variable
Brain graph API must expose endpoints: /brain/graph, /brain/node/{node_id} (FastAPI)
Backlog API must support endpoints: /backlog/items with GET, POST, PATCH, DELETE operations (FastAPI)
Provide brainlayer brain-export command to export brain graph as JSON for dashboard consumption
Provide brainlayer export-obsidian command to export as Markdown vault with backlinks and tags
For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from chunks while FTS trigger is active

Files:

src/brainlayer/mcp/search_handler.py
src/brainlayer/pipeline/enrichment_tiers.py
src/brainlayer/search_repo.py

src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

MCP tools must implement: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person with legacy brainlayer_* aliases for backward compatibility

Files:

src/brainlayer/mcp/search_handler.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/mcp/**/*.py : MCP tools must implement: `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_get_person` with legacy `brainlayer_*` aliases for backward compatibility

Applied to files:

src/brainlayer/mcp/search_handler.py

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Preserve verbatim content for `ai_code`, `stack_trace`, and `user_message` message types during classification and chunking; skip `noise` content entirely; summarize `build_log` content; extract structure-only for `dir_listing`

Applied to files:

src/brainlayer/pipeline/enrichment_tiers.py

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Prioritize MLX (`Qwen2.5-Coder-14B-Instruct-4bit`) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (`glm-4.7-flash` on port 11434) after 3 consecutive MLX failures; support backend override via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

src/brainlayer/pipeline/enrichment_tiers.py

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active

Applied to files:

scripts/cloud_backfill.py

🔇 Additional comments (7)

pyproject.toml (1)

107-110: Good pytest path precedence change.

pythonpath = ["src"] is a solid fix for source-first imports in local/dev test runs.

tests/conftest.py (1)

17-25: Nice isolation fixture for eval tests.

Unique per-invocation project names are the right approach to avoid cross-case data leakage.

src/brainlayer/mcp/search_handler.py (1)

130-135: Good boundary validation at the MCP edge.

Rejecting invalid detail/num_results before routing is the right behavior and prevents silent fallback/clamping.

tests/test_search_validation.py (1)

23-29: Strong regression coverage for validation and cache behavior.

These tests are well targeted and materially reduce risk for the new search boundary + warm-cache path.

Also applies to: 31-152, 303-411
tests/test_eval_baselines.py (1)
931-936: No action needed — get_db_path() is explicitly typed to return Path.

The function signature at src/brainlayer/paths.py:14 declares def get_db_path() -> Path: and both implementations (line 21 and 24) return Path objects. Calling .exists() on the result is safe and correct.
			> Likely an incorrect or invalid review comment.
scripts/cloud_backfill.py (1)

111-134: Nice extraction of the batch payload and pricing helpers.

Centralizing thinkingBudget=0 and the batch-rate math makes the export path and its regression tests much harder to drift.

tests/test_cloud_backfill.py (1)

38-250: Good regression coverage for the new sidecar flow.

These tests hit the highest-risk additions here: thinkingBudget=0, discounted cost math, main→sidecar migration, export reuse, and read-only fallback.

coderabbitai · 2026-03-14T01:53:22Z

+def _migrate_checkpoints_from_main_db(store: Optional[VectorStore], conn: apsw.Connection) -> int:
+    """Copy legacy checkpoint rows into the sidecar DB so state is preserved."""
+    rows = _main_checkpoint_rows(store)
+    if not rows:
+        return 0
+
+    placeholders = ", ".join("?" for _ in CHECKPOINT_COLUMNS)
+    conn.cursor().executemany(
+        f"""
+        INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)})
+        VALUES ({placeholders})
+        """,
+        rows,
+    )
+    return len(rows)


⚠️ Potential issue | 🟠 Major

Make the legacy migration insert-only.

ensure_checkpoint_table() runs on every invocation, but _migrate_checkpoints_from_main_db() currently replays legacy rows with INSERT OR REPLACE. Since save_checkpoint() no longer updates the main table, a later startup can overwrite a newer sidecar row back to the stale main-DB status.

Suggested change

- conn.cursor().executemany( - f""" - INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)}) - VALUES ({placeholders}) - """, - rows, - ) + conn.cursor().executemany( + f""" + INSERT OR IGNORE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)}) + VALUES ({placeholders}) + """, + rows, + )

Also applies to: 242-249

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/cloud_backfill.py` around lines 189 - 203, _migrate_checkpoints_from_main_db currently uses "INSERT OR REPLACE" which can overwrite newer sidecar rows with stale main-db data; change the SQL in _migrate_checkpoints_from_main_db to perform an insert-only migration (e.g., use "INSERT OR IGNORE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)}) VALUES (...)" or an INSERT ... SELECT ... WHERE NOT EXISTS pattern) so existing sidecar rows are not replaced, and apply the same insert-only change to the other migration block that uses CHECKPOINT_TABLE/CHECKPOINT_COLUMNS (the similar code around the second migration you pointed out). Ensure the executemany call still supplies the same rows but the SQL prevents replacing existing records so save_checkpoint updates in the sidecar are preserved.

coderabbitai · 2026-03-14T01:53:23Z

+def save_checkpoint(store: VectorStore, batch_id: str, **kwargs) -> None:
+    """Insert or update a checkpoint row in the sidecar checkpoint DB."""
+    del store  # Legacy signature retained for callers; checkpoint writes avoid the main DB.
+    last_error: Exception | None = None
+
+    for attempt in range(CHECKPOINT_WRITE_MAX_RETRIES):
+        conn = None
+        try:
+            conn = _open_checkpoint_conn()
+            _ensure_checkpoint_table_in_conn(conn)
+            cursor = conn.cursor()
+            existing = list(
+                cursor.execute(
+                    f"SELECT batch_id FROM {CHECKPOINT_TABLE} WHERE batch_id = ?",
+                    [batch_id],
+                )
+            )
+            if existing:
+                sets = ", ".join(f"{k} = ?" for k in kwargs)
+                params = list(kwargs.values()) + [batch_id]
+                cursor.execute(f"UPDATE {CHECKPOINT_TABLE} SET {sets} WHERE batch_id = ?", params)
+            else:
+                cols = ["batch_id"] + list(kwargs.keys())
+                vals = [batch_id] + list(kwargs.values())
+                placeholders = ", ".join("?" for _ in cols)
+                cursor.execute(
+                    f"INSERT INTO {CHECKPOINT_TABLE} ({', '.join(cols)}) VALUES ({placeholders})",
+                    vals,
+                )
+            return
+        except apsw.BusyError as exc:
+            last_error = exc
+            if attempt == CHECKPOINT_WRITE_MAX_RETRIES - 1:
+                break
+            time.sleep(CHECKPOINT_WRITE_BASE_DELAY * (2**attempt))
+        finally:
+            if conn is not None:
+                conn.close()
+
+    if last_error is not None:
+        raise last_error


⚠️ Potential issue | 🟠 Major

Avoid the SELECT/INSERT race in save_checkpoint.

The existence check and follow-up write are not atomic. Two concurrent backfill processes can both miss the row, then one dies on the INSERT with a constraint error that this retry loop never handles.

Suggested change

- existing = list( - cursor.execute( - f"SELECT batch_id FROM {CHECKPOINT_TABLE} WHERE batch_id = ?", - [batch_id], - ) - ) - if existing: - sets = ", ".join(f"{k} = ?" for k in kwargs) - params = list(kwargs.values()) + [batch_id] - cursor.execute(f"UPDATE {CHECKPOINT_TABLE} SET {sets} WHERE batch_id = ?", params) - else: - cols = ["batch_id"] + list(kwargs.keys()) - vals = [batch_id] + list(kwargs.values()) - placeholders = ", ".join("?" for _ in cols) - cursor.execute( - f"INSERT INTO {CHECKPOINT_TABLE} ({', '.join(cols)}) VALUES ({placeholders})", - vals, - ) + cols = ["batch_id"] + list(kwargs.keys()) + vals = [batch_id] + list(kwargs.values()) + placeholders = ", ".join("?" for _ in cols) + updates = ", ".join(f"{col} = excluded.{col}" for col in kwargs) + cursor.execute( + f""" + INSERT INTO {CHECKPOINT_TABLE} ({', '.join(cols)}) + VALUES ({placeholders}) + ON CONFLICT(batch_id) DO UPDATE SET {updates} + """, + vals, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/cloud_backfill.py` around lines 252 - 292, The save_checkpoint function performs a non-atomic SELECT then INSERT/UPDATE which causes a race if two processes check-then-insert concurrently; change it to perform a single atomic upsert against CHECKPOINT_TABLE (e.g. use SQLite's INSERT ... ON CONFLICT(batch_id) DO UPDATE SET ... or REPLACE INTO) instead of the separate existence check, building the column list and the conflict-target update clause from kwargs so the write never fails with a unique-constraint error; keep the existing retry logic around the single upsert execution and ensure the operation is executed within the same connection/transaction and the connection is closed in the finally block.

coderabbitai · 2026-03-14T01:53:23Z

+def open_backfill_store(db_path: Path, *, allow_read_only_fallback: bool = False):
+    """Open the main DB for backfill work, with read-only fallback for export-only runs."""
+    try:
+        return VectorStore(db_path)
+    except apsw.BusyError:
+        if not allow_read_only_fallback:
+            raise
+        print("Main DB open hit BusyError; falling back to read-only export mode.")
+        return ReadOnlyBackfillStore(db_path)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Sanitizer helper definition ==\n' rg -n -C5 --type=py 'def build_name_dictionary\(' src scripts printf '\n== Fallback/export call chain ==\n' rg -n -C3 --type=py 'class ReadOnlyBackfillStore|def open_backfill_store\(|def _init_sanitizer\(|def export_unenriched_chunks\(' scripts

Repository: EtanHey/brainlayer

Length of output: 2741

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== ReadOnlyBackfillStore full implementation ==\n' sed -n '315,356p' scripts/cloud_backfill.py printf '\n== _init_sanitizer full implementation ==\n' sed -n '371,403p' scripts/cloud_backfill.py printf '\n== build_name_dictionary full implementation ==\n' sed -n '466,485p' src/brainlayer/pipeline/sanitize.py

Repository: EtanHey/brainlayer

Length of output: 4207

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== VectorStore conn attribute ==\n' rg -n -C8 'class VectorStore' src scripts | head -40 printf '\n== export_unenriched_chunks full implementation ==\n' sed -n '405,500p' scripts/cloud_backfill.py printf '\n== Lines 798-823 (second location) ==\n' sed -n '798,823p' scripts/cloud_backfill.py

Repository: EtanHey/brainlayer

Length of output: 6158

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== VectorStore.__init__ and conn ==\n' rg -n -A30 'class VectorStore' src/brainlayer/vector_store.py | grep -A30 '__init__' printf '\n== VectorStore.conn attribute ==\n' rg -n 'self\.conn' src/brainlayer/vector_store.py | head -20

Repository: EtanHey/brainlayer

Length of output: 1395

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Verify both connection types support cursor() ==\n' python3 << 'EOF' import sqlite3 import sys # Check sqlite3.Connection API print("sqlite3.Connection methods:") conn_methods = [m for m in dir(sqlite3.Connection) if not m.startswith('_')] print(f" - has cursor(): {'cursor' in conn_methods}") # Check apsw.Connection API (document it) print("\napsw.Connection documentation:") try: import apsw apsw_methods = [m for m in dir(apsw.Connection) if not m.startswith('_')] print(f" - has cursor(): {'cursor' in apsw_methods}") except ImportError: print(" (apsw not available in sandbox, but it's a known library)") EOF printf '\n== Check all store method calls in export chain ==\n' printf 'Methods called on store parameter:\n' rg -n 'store\.' scripts/cloud_backfill.py | grep -E '(export_unenriched|_init_sanitizer|open_backfill|798|803)'

Repository: EtanHey/brainlayer

Length of output: 392

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Check if ReadOnlyBackfillStore.get_enrichment_stats exists ==\n' sed -n '315,356p' scripts/cloud_backfill.py | grep -A2 'def get_enrichment_stats' printf '\n== Verify ReadOnlyBackfillStore has close method ==\n' sed -n '315,356p' scripts/cloud_backfill.py | grep -A1 'def close' printf '\n== Check if export flow actually uses store methods beyond conn/close ==\n' printf 'All store.* calls in export_unenriched_chunks:\n' sed -n '405,500p' scripts/cloud_backfill.py | grep 'store\.'

Repository: EtanHey/brainlayer

Length of output: 555

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Check if VectorStore.close exists ==\n' rg -n 'def close' src/brainlayer/vector_store.py printf '\n== Check what type get_enrichment_stats returns ==\n' rg -n -A5 'def get_enrichment_stats' src/brainlayer/vector_store.py

Repository: EtanHey/brainlayer

Length of output: 186

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Search for get_enrichment_stats across codebase ==\n' rg -n 'def get_enrichment_stats' src scripts printf '\n== Check mixin classes for get_enrichment_stats ==\n' rg -n 'class SearchMixin\|class KGMixin\|class SessionMixin' src/brainlayer/vector_store.py printf '\n== Check line 803 context - what store is used ==\n' sed -n '795,810p' scripts/cloud_backfill.py

Repository: EtanHey/brainlayer

Length of output: 340

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== Verify VectorStore inherits from SessionMixin ==\n' sed -n '56p' src/brainlayer/vector_store.py printf '\n== Check SessionMixin.get_enrichment_stats ==\n' sed -n '186,210p' src/brainlayer/session_repo.py printf '\n== Summary: All store methods used in export flow ==\n' cat << 'EOF' Methods called on store in export flow: 1. store.conn.cursor() - line 427 in export_unenriched_chunks 2. store.get_enrichment_stats() - line 803 in run_full_backfill ReadOnlyBackfillStore provides: - .conn (sqlite3.Connection) - .get_enrichment_stats() (custom implementation) - .close() VectorStore provides (via inheritance): - .conn (apsw.Connection) - .get_enrichment_stats() (from SessionMixin) - .close() Connection compatibility: - Both sqlite3.Connection and apsw.Connection support .cursor() EOF

Repository: EtanHey/brainlayer

Length of output: 1844

Update type hints to properly reflect that functions accept either store type.

run_full_backfill() can pass a ReadOnlyBackfillStore into export_unenriched_chunks() and _init_sanitizer(), but these functions are typed as store: VectorStore. This works at runtime because both ReadOnlyBackfillStore and VectorStore provide .conn, .get_enrichment_stats(), and .close(), and both connection types support .cursor()—but the type mismatch creates a type-checking problem.

Consider using a Protocol or Union type to document the actual contract these functions require (e.g., store: VectorStore | ReadOnlyBackfillStore or a StoreProtocol with just the needed methods).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/cloud_backfill.py` around lines 358 - 366, The type hints are too narrow: functions like export_unenriched_chunks and _init_sanitizer (called from run_full_backfill which can return ReadOnlyBackfillStore via open_backfill_store) should accept either VectorStore or ReadOnlyBackfillStore; update their annotations to a union (VectorStore | ReadOnlyBackfillStore) or define a small StoreProtocol that declares the used members (.conn, .get_enrichment_stats, .close) and use that protocol as the parameter type, then update open_backfill_store, run_full_backfill, export_unenriched_chunks, and _init_sanitizer signatures to use the new union/protocol so type-checkers accept both store types.

coderabbitai · 2026-03-14T01:53:23Z

+_HYBRID_CACHE_MAX = 128  # max entries (LRU eviction)
+
+# Module-level LRU cache: {cache_key: (result, timestamp)}
+_hybrid_cache: "OrderedDict[tuple, tuple[dict, float]]" = OrderedDict()


⚠️ Potential issue | 🟠 Major

Protect module-level LRU cache with a lock.

The current check/read/move/update sequence on _hybrid_cache is non-atomic. Under concurrent requests, this can race and produce inconsistent cache state or intermittent KeyError.

Suggested fix

import copy import json import math import os +import threading import time from collections import OrderedDict @@ _hybrid_cache: "OrderedDict[tuple, tuple[dict, float]]" = OrderedDict() +_hybrid_cache_lock = threading.RLock() @@ - if cache_key in _hybrid_cache: - cached_result, cached_at = _hybrid_cache[cache_key] - if now - cached_at < _HYBRID_CACHE_TTL: - _hybrid_cache.move_to_end(cache_key) # LRU touch - return _clone_hybrid_result(cached_result) - else: - del _hybrid_cache[cache_key] + with _hybrid_cache_lock: + if cache_key in _hybrid_cache: + cached_result, cached_at = _hybrid_cache[cache_key] + if now - cached_at < _HYBRID_CACHE_TTL: + _hybrid_cache.move_to_end(cache_key) # LRU touch + return _clone_hybrid_result(cached_result) + else: + del _hybrid_cache[cache_key] @@ - _hybrid_cache[cache_key] = (_clone_hybrid_result(result), time.monotonic()) - _hybrid_cache.move_to_end(cache_key) - if len(_hybrid_cache) > _HYBRID_CACHE_MAX: - _hybrid_cache.popitem(last=False) # evict oldest + with _hybrid_cache_lock: + _hybrid_cache[cache_key] = (_clone_hybrid_result(result), time.monotonic()) + _hybrid_cache.move_to_end(cache_key) + if len(_hybrid_cache) > _HYBRID_CACHE_MAX: + _hybrid_cache.popitem(last=False) # evict oldest

Also applies to: 443-449, 671-674

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/search_repo.py` at line 27, The module-level OrderedDict _hybrid_cache is accessed non-atomically; add a module-level lock (e.g., _hybrid_cache_lock = threading.RLock()) and wrap every compound check/read/move/update sequence that touches _hybrid_cache in a with _hybrid_cache_lock: block so operations (lookup, move_to_end, insert, pop) are atomic; update all places that manipulate _hybrid_cache (including the initial definition site and the other usage sites referenced by the comment) to acquire the lock before reading/modifying and release after, ensuring you still use OrderedDict.move_to_end/popitem as before but under the lock.

coderabbitai · 2026-03-14T01:53:23Z

+  T0 (IMMEDIATE): manual/digest sources and high-signal types (decision, mistake)
+  T1 (HOURLY):    recent claude_code chunks (ai_code, stack_trace, user_message, assistant_text)
+  T2 (LAZY):      old claude_code backlog (older than recency_days)


⚠️ Potential issue | 🟡 Minor

Test docstring overstates T0 criteria.

The header says T0 includes decision/mistake types, but the classifier/test API has no memory_type input. Please align the docstring to the actual contract.

Suggested fix

-T0 (IMMEDIATE): manual/digest sources and high-signal types (decision, mistake) +T0 (IMMEDIATE): manual/digest sources

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_enrichment_tiers.py` around lines 4 - 6, The docstring header "T0 (IMMEDIATE): manual/digest sources and high-signal types (decision, mistake)" overstates T0 by referencing memory_type values not accepted by the classifier/test API; update the docstring in tests/test_enrichment_tiers.py (the header line starting with "T0 (IMMEDIATE):") to remove the parenthetical "decision, mistake" or replace it with wording that matches the actual contract (e.g., "manual/digest sources and high-signal types" or "manual/digest sources where classifier does not accept memory_type"), ensuring the test description matches the classifier/test API inputs.

EtanHey · 2026-03-14T02:07:47Z

@cursor review
@BugBot review

Add `EnrichmentTier` (IntEnum) and `classify_chunk_tier()` in `src/brainlayer/pipeline/enrichment_tiers.py`. Tier logic: T0 IMMEDIATE — manual/digest sources (always enrich, highest priority) T1 HOURLY — recent claude_code within recency window (default 7 days) T2 LAZY — old claude_code backlog (lazy/remote batch) T3 EXPLICIT — youtube transcripts (never touched by default pipeline) Implements Finding 1 from the enrichment efficiency audit: "the current HIGH_VALUE_TYPES selector covers 97.75% of all chunks — source-aware tiers are needed". Also adds `pythonpath = ["src"]` to pytest config so the worktree src dir takes precedence over the editable install during development. 24 new TDD tests, all green. Full suite: 834 passed, 0 regressions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove unused `pytest` import from test_enrichment_tiers.py (F401) - Sort import blocks in test_cloud_backfill.py and test_search_validation.py (I001) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two issues from CodeRabbit review: 1. content_type was accepted but never used, allowing 'noise' chunks to enter the tier pipeline. Add SKIP_CONTENT_TYPES = {"noise"} and check it first. 2. classify_chunk_tier() assigned unknown sources to T1/T2 based on recency, but get_tier_source_filter(T1) only returned {"claude_code"}, making the two functions inconsistent. Add T1_T2_SOURCES = {"claude_code"} and route unrecognised sources to T2 unconditionally (lazy backlog, not hourly T1). Tests: 26/26 pass, 836 total suite passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Mar 14, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 14, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 14, 2026

View reviewed changes

EtanHey and others added 4 commits March 14, 2026 04:28

fix: resolve ruff lint errors — remove unused import, sort imports

0452c69

- Remove unused `pytest` import from test_enrichment_tiers.py (F401) - Sort import blocks in test_cloud_backfill.py and test_search_validation.py (I001) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: apply ruff format to new and pre-existing files

e1d15d4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

EtanHey force-pushed the fix/tiered-enrichment branch from 36fb246 to 87819b8 Compare March 14, 2026 02:29

EtanHey merged commit 4e75ef4 into main Mar 14, 2026
1 check passed

EtanHey deleted the fix/tiered-enrichment branch March 14, 2026 02:30

coderabbitai Bot mentioned this pull request Mar 22, 2026

feat: unified enrichment controller + 3-mode brain_digest #100

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 16, 2026

MMR post-retrieval dedup for brain_search (R86 Q4) #242

Merged

		INSERT OR REPLACE INTO {CHECKPOINT_TABLE} ({', '.join(CHECKPOINT_COLUMNS)})
		VALUES ({placeholders})

Conversation

EtanHey commented Mar 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tier design

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

EtanHey commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

EtanHey commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Mar 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading