fix: Phase 3 core fixes — DB paths, date filtering, search metadata#1
fix: Phase 3 core fixes — DB paths, date filtering, search metadata#1
Conversation
Critical fixes: - Centralize DB path resolution in paths.py (fixes empty DB bug) - All modules now resolve to ~/.local/share/zikaron/zikaron.db - ENV override: BRAINLAYER_DB for custom paths New features: - Date filtering: created_at column, date_from/date_to search params - Backfill script for existing chunks (searches archives) - Search results include created_at and source metadata - Project name normalization (decode Claude path encoding) - Sentence-aware text chunking for oversized paragraphs - Comprehensive data locations documentation Files: 10 modified, 3 new (paths.py, backfill-created-at.py, data-locations.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| if os.path.isdir(candidate_path): | ||
| return candidate_name | ||
| # Fallback: return first segment (best guess) | ||
| return remaining[0] |
There was a problem hiding this comment.
Filesystem fallback truncates multi-segment project names
Medium Severity
normalize_project_name falls back to remaining[0] (only the first segment after "Gits") when the os.path.isdir() filesystem lookup fails. For multi-segment project names like rudy-monorepo, this returns "rudy" instead of "rudy-monorepo" when the original directory path no longer exists. Since the PR documents that repos moved from ~/Desktop/Gits/ to ~/Gits/, old encoded paths will fail the lookup against the old gits_dir and produce truncated names, breaking project filtering.
| filter_params.append(date_from) | ||
| if date_to: | ||
| where_clauses.append("c.created_at <= ?") | ||
| filter_params.append(date_to) |
There was a problem hiding this comment.
Date-only date_to excludes entries from that day
Medium Severity
The date_to filter uses created_at <= ? with string comparison, but the MCP tool schema encourages date-only values like '2026-02-19'. Since created_at stores full ISO 8601 timestamps (e.g. "2026-02-19T10:30:00+00:00"), and in lexicographic ordering "2026-02-19T..." > "2026-02-19", all entries actually on the date_to date are excluded. date_from is unaffected because the inequality goes the other direction.
Additional Locations (2)
| import re | ||
| name = re.sub(r'-(?:nightshift|haiku|worktree)-\d+$', '', name) | ||
|
|
||
| return name |
There was a problem hiding this comment.
Duplicate project normalization diverges from CLI implementation
Low Severity
normalize_project_name in mcp/__init__.py reimplements project name normalization that already exists in cli/__init__.py (_clean_project_name + _normalize_project_name). The two implementations differ: the CLI version handles additional path markers (Desktop, projects, config), supports monorepo package mappings and project aliases, and has no filesystem dependency. Having two divergent implementations risks inconsistent behavior between indexing (CLI) and searching (MCP).
For ~160K chunks whose source JSONL files no longer exist (pre-archiver era), estimate dates using rowid proximity to chunks with known dates. Chunks are indexed sequentially, so nearby rowids correlate with similar timestamps. Result: 268,864/268,864 chunks (100%) now have created_at. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL —
invisible to both semantic and keyword search due to strict project filter.
- search_repo.py: change `c.project = ?` to `(c.project = ? OR c.project IS NULL)`
in 4 locations (semantic, FTS5, text search, post-RRF filter)
- store.py: expand embed_pending_chunks to `source IN ('manual', 'mcp')`
so BrainBar entries get vector embeddings
- README.md: add Groq as primary enrichment backend, update chunk count to 224K
Verified: 4/5 test queries now surface BrainBar entries as #1 result (was 0/5).
923 tests pass, 0 regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) * fix: include NULL-project entries in search + embed BrainBar chunks BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL — invisible to both semantic and keyword search due to strict project filter. - search_repo.py: change `c.project = ?` to `(c.project = ? OR c.project IS NULL)` in 4 locations (semantic, FTS5, text search, post-RRF filter) - store.py: expand embed_pending_chunks to `source IN ('manual', 'mcp')` so BrainBar entries get vector embeddings - README.md: add Groq as primary enrichment backend, update chunk count to 224K Verified: 4/5 test queries now surface BrainBar entries as #1 result (was 0/5). 923 tests pass, 0 regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: clarify enrichment backend precedence in README Reframe Groq as fallback (not primary) to match auto-detect logic: MLX → Ollama → Groq. Addresses CodeRabbit review on #95. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes: - Removed merge conflict marker from BUGBOT_REVIEW_FTS_RECALL.md - Added critical bug warning to exact chunk-ID bypass section Final Status Report: - 3 critical issues confirmed by Cursor Bugbot (independent verification) - Issue #1 (P0): Cross-project data leakage via exact bypass - Issue #2 (P2): Recall regression from phrase matching - Issue #3 (P0): Trigram-only results bypass filters Verdict: MERGE BLOCKED until critical issues fixed Credit: Issues independently confirmed by Cursor's own Bugbot system with actionable fix links provided Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>
* fix: improve FTS recall across exact IDs and aliases * docs: add Bugbot review for FTS recall hardening - Comprehensive review of retrieval correctness across 3 layers - Write safety analysis for schema migrations - MCP stability verification - Performance observations (storage +1.8GB, query latency) - Edge case analysis and recommendations - Approve with confidence for merge Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com> * fix: default exact chunk lookup project label * fix: preserve search routing and trigram repair semantics * docs: update Bugbot review - fix markdown issues and add re-review addendum Fixes: - Corrected chunk-id regex false positive examples (whitespace issue) - Fixed markdown heading spacing (MD022, MD031 compliance) - Added blank lines around headings and code blocks Re-review addendum (commit bcddd14): - Reviewed 6 additional fixes since initial review - All fixes approved: KG error handling, trigram repair, lifecycle filtering, sender/language filters - Critical correctness fixes: exact bypass lifecycle + FTS filter completeness - Updated verdict: APPROVED with increased confidence - Production-ready, ship with confidence Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com> * docs: add Bugbot re-review summary Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com> * fix: satisfy lint and review follow-ups * docs: Bugbot critical issues - 3 bugs must be fixed before merge Critical Issues Identified: 1. P0: Trigram-only results bypass post-RRF filters (search_repo.py:1141) 2. P1: Exact chunk-ID bypass ignores project/filter scope (search_handler.py:389) 3. P2: Alias expansion breaks FTS token-level semantics (search_handler.py:131) All three issues represent real correctness bugs: - Cross-project data leakage via exact bypass - Filter contract violations for trigram hits - Potential recall regression on multi-word queries Verdict: APPROVE WITH MANDATORY FIXES Fixes are straightforward (5-15 min each), must be completed before merge Credit: Issues identified by Macroscope and Codex reviews Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com> * fix: harden scoped exact-id and alias expansion search * docs: fix merge conflict and add final Bugbot status report Fixes: - Removed merge conflict marker from BUGBOT_REVIEW_FTS_RECALL.md - Added critical bug warning to exact chunk-ID bypass section Final Status Report: - 3 critical issues confirmed by Cursor Bugbot (independent verification) - Issue #1 (P0): Cross-project data leakage via exact bypass - Issue #2 (P2): Recall regression from phrase matching - Issue #3 (P0): Trigram-only results bypass filters Verdict: MERGE BLOCKED until critical issues fixed Credit: Issues independently confirmed by Cursor's own Bugbot system with actionable fix links provided Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com> * style: format alias expansion regression test --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>


Summary
paths.py— all 8 modules now resolve to the correct~/.local/share/zikaron/zikaron.dbinstead of the empty brainlayer.dbcreated_atcolumn withdate_from/date_tosearch params in MCP tool, backfill script that searches archivescreated_attimestamps andsource(claude_code/whatsapp/youtube)-Users-etanheyman-Gits-golems→golems) for cleaner filteringdocs/data-locations.md— where all data lives, archive strategy, migration historyBackfill Results
Ran
scripts/backfill-created-at.py— 107,935/268,864 chunks (40.1%) now havecreated_at:Test plan
date_from/date_toparamsBRAINLAYER_DBenv var override works🤖 Generated with Claude Code
Note
Medium Risk
Touches core persistence/search paths and extends the SQLite schema with a new indexed
created_atfield, which can affect query results and indexing behavior across multiple entrypoints. Risk is mitigated by additive schema changes and backfill tooling, but rollout should verify searches still hit the intended DB and filters behave as expected.Overview
Centralizes DB path resolution via new
src/brainlayer/paths.py(supportsBRAINLAYER_DBoverride and prefers the legacy~/.local/share/zikaron/zikaron.dbwhen present), and updates daemon/dashboard/MCP/pipelines/scripts to importDEFAULT_DB_PATHinstead of hardcoding paths.Adds temporal metadata and filtering by introducing a
created_atcolumn + index onchunks, populating it on new ingests (index_new.py), surfacing it (andsource) in search metadata, and extending MCPbrainlayer_searchwithdate_from/date_tothat flow throughVectorStore.search/hybrid_search.Adds a new
scripts/backfill-created-at.pyto backfill timestamps from existing metadata, session JSONL files (including archives/manifests), file mtimes, and a final rowid-based estimate pass; also improves text chunking to split oversized paragraphs on sentence boundaries and documents data/archive locations indocs/data-locations.md.Written by Cursor Bugbot for commit 68869bb. This will update automatically on new commits. Configure here.