Skip to content

fix: include NULL-project entries in search + embed BrainBar chunks#95

Merged
EtanHey merged 2 commits intomainfrom
fix/search-quality-null-project
Mar 19, 2026
Merged

fix: include NULL-project entries in search + embed BrainBar chunks#95
EtanHey merged 2 commits intomainfrom
fix/search-quality-null-project

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Mar 19, 2026

Summary

  • BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL — invisible to both semantic and keyword search
  • Changed c.project = ? to (c.project = ? OR c.project IS NULL) in 4 filter locations in search_repo.py
  • Expanded embed_pending_chunks to process source IN ('manual', 'mcp') so BrainBar entries get embeddings
  • Updated README: added Groq as primary enrichment backend, corrected chunk count to 224K

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added Groq cloud as a new enrichment backend option.
  • Bug Fixes

    • Improved search filtering to include results with unset project values across all search types.
    • Extended embedding generation to cover chunks from additional data sources.
  • Documentation

    • Updated README with current indexing scale and example CLI configuration.

BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL —
invisible to both semantic and keyword search due to strict project filter.

- search_repo.py: change `c.project = ?` to `(c.project = ? OR c.project IS NULL)`
  in 4 locations (semantic, FTS5, text search, post-RRF filter)
- store.py: expand embed_pending_chunks to `source IN ('manual', 'mcp')`
  so BrainBar entries get vector embeddings
- README.md: add Groq as primary enrichment backend, update chunk count to 224K

Verified: 4/5 test queries now surface BrainBar entries as #1 result (was 0/5).
923 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 50 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a283f297-293a-49d3-ab0b-533dd6bbcb5f

📥 Commits

Reviewing files that changed from the base of the PR and between 943a592 and 15a212a.

📒 Files selected for processing (1)
  • README.md
📝 Walkthrough

Walkthrough

Updated README with new backend information and revised indexing metrics. Modified project filtering logic in search queries to include NULL values, and expanded chunk embeddings backfill scope to support MCP-sourced chunks alongside manual sources.

Changes

Cohort / File(s) Summary
Documentation Updates
README.md
Updated indexing scale from 317,000+ to 224,000+ chunks. Added Groq (cloud) as primary enrichment backend with ~1–2s/chunk timing (March 2026+). Updated example CLI command to use BRAINLAYER_ENRICH_BACKEND=groq instead of mlx.
Database Query Logic
src/brainlayer/search_repo.py, src/brainlayer/store.py
Modified project filtering to include NULL project values across semantic vector search, text LIKE search, and FTS5 hybrid search queries. Updated post-RRF filtering to treat None projects as matching. Expanded embed_pending_chunks() to backfill embeddings for both 'manual' and 'mcp' sourced chunks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Searching through the garden of chunks, we hop with glee,
NULL values now included in our queries so free!
Groq joins the feast, swift as a spring,
While MCP sources fill our embedding ring—
Brainlayer grows stronger, a rabbit's delight! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fixes: including NULL-project entries in search and embedding BrainBar (MCP) chunks, directly matching the core changes across search_repo.py and store.py.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/search-quality-null-project
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 15: Update the README text that currently states "9 MCP tools" to reflect
the Phase B contract by changing it to "8 MCP tools" and document the required
canonical tool names: brain_search, brain_store, brain_recall, brain_entity,
brain_expand, brain_update, brain_digest, brain_tags; also note that
brain_get_person should be documented as a legacy alias (backward compatibility)
replaced by brain_tags.
- Around line 164-170: Update the README's backend precedence to reframe Groq as
an optional cloud backend (not the default) and document the runtime selection:
auto-detect MLX on Apple Silicon, fall back to Ollama after 3 consecutive MLX
failures, and allow explicit override via the BRAINLAYER_ENRICH_BACKEND
environment variable; also update any wording that currently calls Groq the
“Primary backend” to avoid implying it is the default. Ensure the README
mentions the failure-count fallback behavior and the override env var so docs
match the implemented auto-detect logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2269aaab-35e7-4c79-b82d-e2943f0e4f94

📥 Commits

Reviewing files that changed from the base of the PR and between 1fc7ba7 and 943a592.

📒 Files selected for processing (3)
  • README.md
  • src/brainlayer/search_repo.py
  • src/brainlayer/store.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Python package structure should follow the layout: src/brainlayer/ for package code, with separate modules for vector_store.py, embeddings.py, daemon.py, dashboard/, and mcp/ for different concerns
Use paths.py:get_db_path() for all database path resolution instead of hardcoding paths; support environment variable overrides and canonical path fallback (~/.local/share/brainlayer/brainlayer.db)
Lint and format Python code using ruff check src/ and ruff format src/
Preserve verbatim content for ai_code, stack_trace, and user_message message types during classification and chunking; skip noise content entirely; summarize build_log content; extract structure-only for dir_listing
Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking
Handle SQLite concurrency by implementing retry logic on SQLITE_BUSY errors; ensure each worker uses its own database connection
Prioritize MLX (Qwen2.5-Coder-14B-Instruct-4bit) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (glm-4.7-flash on port 11434) after 3 consecutive MLX failures; support backend override via BRAINLAYER_ENRICH_BACKEND environment variable
Brain graph API must expose endpoints: /brain/graph, /brain/node/{node_id} (FastAPI)
Backlog API must support endpoints: /backlog/items with GET, POST, PATCH, DELETE operations (FastAPI)
Provide brainlayer brain-export command to export brain graph as JSON for dashboard consumption
Provide brainlayer export-obsidian command to export as Markdown vault with backlinks and tags
For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from chunks while FTS trigger is active

Files:

  • src/brainlayer/store.py
  • src/brainlayer/search_repo.py
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/store.py
  • src/brainlayer/search_repo.py
🧠 Learnings (3)
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Prioritize MLX (`Qwen2.5-Coder-14B-Instruct-4bit`) on Apple Silicon (port 8080) as the enrichment backend; fall back to Ollama (`glm-4.7-flash` on port 11434) after 3 consecutive MLX failures; support backend override via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • README.md
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

  • README.md
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB

Applied to files:

  • README.md
🔇 Additional comments (6)
README.md (1)

174-174: Groq override example looks good.

Explicitly showing BRAINLAYER_ENRICH_BACKEND=groq is clear and matches an opt-in backend flow.

src/brainlayer/store.py (1)

219-230: LGTM — correctly expands embedding backfill to include MCP-sourced chunks.

The SQL IN ('manual', 'mcp') clause is syntactically correct and the per-chunk error handling isolates failures. The function already runs in a dedicated background thread with its own VectorStore connection (per store_handler.py:336-357), so no new concurrency concerns are introduced.

src/brainlayer/search_repo.py (4)

136-138: LGTM — consistent update across all search paths.

The SQL predicate (c.project = ? OR c.project IS NULL) correctly includes NULL-project entries when a project filter is specified. This change is consistently applied to semantic search, text LIKE search, and FTS5 search.


203-205: Consistent with semantic search update.


512-514: Consistent with semantic search update.


644-650: Post-RRF filter correctly aligned with SQL changes.

The tuple membership check not in (project_filter, None) ensures FTS-only results with project=None are retained, matching the behavior of the SQL predicate changes upstream.

Comment thread README.md
---

**317,000+ chunks indexed** · **1,002 Python + 28 Swift tests** · **Real-time indexing hooks** · **9 MCP tools** · **BrainBar daemon (209KB)** · **Zero cloud dependencies**
**224,000+ chunks indexed** · **1,002 Python + 28 Swift tests** · **Real-time indexing hooks** · **9 MCP tools** · **BrainBar daemon (209KB)** · **Zero cloud dependencies**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update MCP tool count to match the current contract.

Line 15 says 9 MCP tools, but current Phase B spec requires 8 canonical tools with brain_tags replacing brain_get_person (which should be documented as a legacy alias only).

Based on learnings: The 8 required MCP tools are brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_tags, with legacy aliases kept for backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 15, Update the README text that currently states "9 MCP
tools" to reflect the Phase B contract by changing it to "8 MCP tools" and
document the required canonical tool names: brain_search, brain_store,
brain_recall, brain_entity, brain_expand, brain_update, brain_digest,
brain_tags; also note that brain_get_person should be documented as a legacy
alias (backward compatibility) replaced by brain_tags.

Comment thread README.md Outdated
Comment on lines 164 to 170
Three enrichment backends:

| Backend | Best for | Speed |
|---------|----------|-------|
| **Groq** (cloud) | Primary backend (March 2026+) | ~1-2s/chunk |
| **MLX** (Apple Silicon) | M1/M2/M3 Macs | 21-87% faster than Ollama |
| **Ollama** | Any platform | ~1s/chunk (short), ~13s (long) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Backend precedence is internally inconsistent in docs.

Declaring Groq as “Primary backend” conflicts with the README’s default behavior (auto-detect MLX on Apple Silicon, else Ollama) and local-first/zero-cloud positioning. This should be reframed as an optional cloud backend unless runtime default selection changed accordingly.

Based on learnings: Prioritize MLX on Apple Silicon, fall back to Ollama after 3 consecutive MLX failures, and allow override via BRAINLAYER_ENRICH_BACKEND.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 164 - 170, Update the README's backend precedence to
reframe Groq as an optional cloud backend (not the default) and document the
runtime selection: auto-detect MLX on Apple Silicon, fall back to Ollama after 3
consecutive MLX failures, and allow explicit override via the
BRAINLAYER_ENRICH_BACKEND environment variable; also update any wording that
currently calls Groq the “Primary backend” to avoid implying it is the default.
Ensure the README mentions the failure-count fallback behavior and the override
env var so docs match the implemented auto-detect logic.

Reframe Groq as fallback (not primary) to match auto-detect logic:
MLX → Ollama → Groq. Addresses CodeRabbit review on #95.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EtanHey EtanHey merged commit 4af55ff into main Mar 19, 2026
6 checks passed
@EtanHey EtanHey deleted the fix/search-quality-null-project branch March 19, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant