Skip to content

fix: Phase 3 core fixes — DB paths, date filtering, search metadata#1

Merged
EtanHey merged 2 commits intomainfrom
fix/db-path-resolution
Feb 19, 2026
Merged

fix: Phase 3 core fixes — DB paths, date filtering, search metadata#1
EtanHey merged 2 commits intomainfrom
fix/db-path-resolution

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 19, 2026

Summary

  • DB path fix: Centralized path resolution in paths.py — all 8 modules now resolve to the correct ~/.local/share/zikaron/zikaron.db instead of the empty brainlayer.db
  • Date filtering: Added created_at column with date_from/date_to search params in MCP tool, backfill script that searches archives
  • Search metadata: Results now include created_at timestamps and source (claude_code/whatsapp/youtube)
  • Project normalization: Decode Claude Code path encoding (-Users-etanheyman-Gits-golemsgolems) for cleaner filtering
  • Chunker fix: Sentence-aware splitting for oversized paragraphs instead of mid-sentence truncation
  • Data documentation: Comprehensive docs/data-locations.md — where all data lives, archive strategy, migration history

Backfill Results

Ran scripts/backfill-created-at.py — 107,935/268,864 chunks (40.1%) now have created_at:

  • 35,339 from metadata timestamps (WhatsApp/YouTube)
  • 72,596 from archived JSONL session files
  • 160,929 remaining are pre-archiver sessions whose JSONL files no longer exist

Test plan

  • Verify MCP search works with date_from/date_to params
  • Verify project normalization in search results
  • Verify BRAINLAYER_DB env var override works
  • Run backfill script on fresh DB to verify archive path scanning

🤖 Generated with Claude Code


Note

Medium Risk
Touches core persistence/search paths and extends the SQLite schema with a new indexed created_at field, which can affect query results and indexing behavior across multiple entrypoints. Risk is mitigated by additive schema changes and backfill tooling, but rollout should verify searches still hit the intended DB and filters behave as expected.

Overview
Centralizes DB path resolution via new src/brainlayer/paths.py (supports BRAINLAYER_DB override and prefers the legacy ~/.local/share/zikaron/zikaron.db when present), and updates daemon/dashboard/MCP/pipelines/scripts to import DEFAULT_DB_PATH instead of hardcoding paths.

Adds temporal metadata and filtering by introducing a created_at column + index on chunks, populating it on new ingests (index_new.py), surfacing it (and source) in search metadata, and extending MCP brainlayer_search with date_from/date_to that flow through VectorStore.search/hybrid_search.

Adds a new scripts/backfill-created-at.py to backfill timestamps from existing metadata, session JSONL files (including archives/manifests), file mtimes, and a final rowid-based estimate pass; also improves text chunking to split oversized paragraphs on sentence boundaries and documents data/archive locations in docs/data-locations.md.

Written by Cursor Bugbot for commit 68869bb. This will update automatically on new commits. Configure here.

Critical fixes:
- Centralize DB path resolution in paths.py (fixes empty DB bug)
- All modules now resolve to ~/.local/share/zikaron/zikaron.db
- ENV override: BRAINLAYER_DB for custom paths

New features:
- Date filtering: created_at column, date_from/date_to search params
- Backfill script for existing chunks (searches archives)
- Search results include created_at and source metadata
- Project name normalization (decode Claude path encoding)
- Sentence-aware text chunking for oversized paragraphs
- Comprehensive data locations documentation

Files: 10 modified, 3 new (paths.py, backfill-created-at.py, data-locations.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

if os.path.isdir(candidate_path):
return candidate_name
# Fallback: return first segment (best guess)
return remaining[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filesystem fallback truncates multi-segment project names

Medium Severity

normalize_project_name falls back to remaining[0] (only the first segment after "Gits") when the os.path.isdir() filesystem lookup fails. For multi-segment project names like rudy-monorepo, this returns "rudy" instead of "rudy-monorepo" when the original directory path no longer exists. Since the PR documents that repos moved from ~/Desktop/Gits/ to ~/Gits/, old encoded paths will fail the lookup against the old gits_dir and produce truncated names, breaking project filtering.

Fix in Cursor Fix in Web

filter_params.append(date_from)
if date_to:
where_clauses.append("c.created_at <= ?")
filter_params.append(date_to)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Date-only date_to excludes entries from that day

Medium Severity

The date_to filter uses created_at <= ? with string comparison, but the MCP tool schema encourages date-only values like '2026-02-19'. Since created_at stores full ISO 8601 timestamps (e.g. "2026-02-19T10:30:00+00:00"), and in lexicographic ordering "2026-02-19T..." > "2026-02-19", all entries actually on the date_to date are excluded. date_from is unaffected because the inequality goes the other direction.

Additional Locations (2)

Fix in Cursor Fix in Web

import re
name = re.sub(r'-(?:nightshift|haiku|worktree)-\d+$', '', name)

return name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate project normalization diverges from CLI implementation

Low Severity

normalize_project_name in mcp/__init__.py reimplements project name normalization that already exists in cli/__init__.py (_clean_project_name + _normalize_project_name). The two implementations differ: the CLI version handles additional path markers (Desktop, projects, config), supports monorepo package mappings and project aliases, and has no filesystem dependency. Having two divergent implementations risks inconsistent behavior between indexing (CLI) and searching (MCP).

Fix in Cursor Fix in Web

For ~160K chunks whose source JSONL files no longer exist (pre-archiver
era), estimate dates using rowid proximity to chunks with known dates.
Chunks are indexed sequentially, so nearby rowids correlate with similar
timestamps.

Result: 268,864/268,864 chunks (100%) now have created_at.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit 700f829 into main Feb 19, 2026
0 of 4 checks passed
@EtanHey EtanHey deleted the fix/db-path-resolution branch February 19, 2026 20:22
EtanHey added a commit that referenced this pull request Mar 19, 2026
BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL —
invisible to both semantic and keyword search due to strict project filter.

- search_repo.py: change `c.project = ?` to `(c.project = ? OR c.project IS NULL)`
  in 4 locations (semantic, FTS5, text search, post-RRF filter)
- store.py: expand embed_pending_chunks to `source IN ('manual', 'mcp')`
  so BrainBar entries get vector embeddings
- README.md: add Groq as primary enrichment backend, update chunk count to 224K

Verified: 4/5 test queries now surface BrainBar entries as #1 result (was 0/5).
923 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EtanHey added a commit that referenced this pull request Mar 19, 2026
)

* fix: include NULL-project entries in search + embed BrainBar chunks

BrainBar MCP entries had source='mcp', project=NULL, created_at=NULL —
invisible to both semantic and keyword search due to strict project filter.

- search_repo.py: change `c.project = ?` to `(c.project = ? OR c.project IS NULL)`
  in 4 locations (semantic, FTS5, text search, post-RRF filter)
- store.py: expand embed_pending_chunks to `source IN ('manual', 'mcp')`
  so BrainBar entries get vector embeddings
- README.md: add Groq as primary enrichment backend, update chunk count to 224K

Verified: 4/5 test queries now surface BrainBar entries as #1 result (was 0/5).
923 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clarify enrichment backend precedence in README

Reframe Groq as fallback (not primary) to match auto-detect logic:
MLX → Ollama → Groq. Addresses CodeRabbit review on #95.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cursor Bot pushed a commit that referenced this pull request Apr 30, 2026
Fixes:
- Removed merge conflict marker from BUGBOT_REVIEW_FTS_RECALL.md
- Added critical bug warning to exact chunk-ID bypass section

Final Status Report:
- 3 critical issues confirmed by Cursor Bugbot (independent verification)
- Issue #1 (P0): Cross-project data leakage via exact bypass
- Issue #2 (P2): Recall regression from phrase matching
- Issue #3 (P0): Trigram-only results bypass filters

Verdict: MERGE BLOCKED until critical issues fixed

Credit: Issues independently confirmed by Cursor's own Bugbot system
with actionable fix links provided

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>
EtanHey added a commit that referenced this pull request Apr 30, 2026
* fix: improve FTS recall across exact IDs and aliases

* docs: add Bugbot review for FTS recall hardening

- Comprehensive review of retrieval correctness across 3 layers
- Write safety analysis for schema migrations
- MCP stability verification
- Performance observations (storage +1.8GB, query latency)
- Edge case analysis and recommendations
- Approve with confidence for merge

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>

* fix: default exact chunk lookup project label

* fix: preserve search routing and trigram repair semantics

* docs: update Bugbot review - fix markdown issues and add re-review addendum

Fixes:
- Corrected chunk-id regex false positive examples (whitespace issue)
- Fixed markdown heading spacing (MD022, MD031 compliance)
- Added blank lines around headings and code blocks

Re-review addendum (commit bcddd14):
- Reviewed 6 additional fixes since initial review
- All fixes approved: KG error handling, trigram repair, lifecycle filtering, sender/language filters
- Critical correctness fixes: exact bypass lifecycle + FTS filter completeness
- Updated verdict: APPROVED with increased confidence
- Production-ready, ship with confidence

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>

* docs: add Bugbot re-review summary

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>

* fix: satisfy lint and review follow-ups

* docs: Bugbot critical issues - 3 bugs must be fixed before merge

Critical Issues Identified:
1. P0: Trigram-only results bypass post-RRF filters (search_repo.py:1141)
2. P1: Exact chunk-ID bypass ignores project/filter scope (search_handler.py:389)
3. P2: Alias expansion breaks FTS token-level semantics (search_handler.py:131)

All three issues represent real correctness bugs:
- Cross-project data leakage via exact bypass
- Filter contract violations for trigram hits
- Potential recall regression on multi-word queries

Verdict: APPROVE WITH MANDATORY FIXES
Fixes are straightforward (5-15 min each), must be completed before merge

Credit: Issues identified by Macroscope and Codex reviews

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>

* fix: harden scoped exact-id and alias expansion search

* docs: fix merge conflict and add final Bugbot status report

Fixes:
- Removed merge conflict marker from BUGBOT_REVIEW_FTS_RECALL.md
- Added critical bug warning to exact chunk-ID bypass section

Final Status Report:
- 3 critical issues confirmed by Cursor Bugbot (independent verification)
- Issue #1 (P0): Cross-project data leakage via exact bypass
- Issue #2 (P2): Recall regression from phrase matching
- Issue #3 (P0): Trigram-only results bypass filters

Verdict: MERGE BLOCKED until critical issues fixed

Credit: Issues independently confirmed by Cursor's own Bugbot system
with actionable fix links provided

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>

* style: format alias expansion regression test

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant