feat: diarization pipeline + diarized transcript re-indexing by EtanHey · Pull Request #34 · EtanHey/brainlayer

EtanHey · 2026-02-26T15:00:43Z

Summary

New script: diarize_episodes.py — batch WhisperX diarization pipeline for 8 Huberman guest episodes (download → transcribe → align → diarize → map speakers → re-index)
Updated: index_youtube.py — added --diarized-transcript and --replace flags for re-indexing with speaker-labeled transcripts
Fixed: Speaker mapping now uses content-based detection ("I'm Andrew Huberman" in intro) instead of speaking-time-only heuristic which produced inverted labels on ep1
Housekeeping: Added data/ to .gitignore (audio WAVs, diarized JSONs)

Test plan

Episode 1 (zEYE-vcVKy8) diarized, labels fixed, re-indexed with 1760 chunks
Verified HUBERMAN: prefix on intro segments, GALPIN: on guest responses
load_diarized_transcript() correctly converts WhisperX JSON to indexing format
delete_video_chunks() properly removes existing chunks before re-index
Remaining 7 episodes currently processing (CPU-only, ~2h each)

🤖 Generated with Claude Code

Note

Medium Risk
Medium risk because it adds new write/delete paths against the vector store (chunk replacement, update/archive/merge) which can alter or remove indexed data if misused, though changes are localized and opt-in via flags/tool calls.

Overview
Adds scripts/diarize_episodes.py, a batch pipeline that downloads episode audio, runs WhisperX transcription/alignment/diarization, maps speaker IDs to named speakers (intro phrase detection with speaking-time fallback), saves diarized JSON, and can trigger re-indexing.

Extends scripts/index_youtube.py to accept --diarized-transcript (load WhisperX diarized JSON and prefix speaker labels into segment text) and --replace (delete existing youtube:{video_id} chunks/vectors before indexing).

Introduces a new MCP write tool brain_update to update content/tags/importance (with re-embedding), archive chunks (soft-delete + remove vectors), or merge duplicates; adds supporting VectorStore methods and updates the MCP tool-count test. Also ignores data/ outputs in .gitignore.

^{Written by Cursor Bugbot for commit 85336e0. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

New Features
- Added speaker diarization pipeline and workflow to process and re-index episodes with speaker-identified transcripts.
- Enabled replacing previously indexed episode content during re-indexing.
- Added memory management tool to update, archive, and merge stored chunks.
- Added chunk-level APIs: update, archive, and fetch individual chunks.
Chores
- Ignore data/ directory in repository.
Tests
- Updated test expectations to reflect the new tool count.

- Add scripts/diarize_episodes.py: batch pipeline for WhisperX speaker diarization of Huberman guest episodes (download → transcribe → align → diarize → map speakers → save JSON → re-index) - Add --diarized-transcript and --replace flags to index_youtube.py for re-indexing with speaker-labeled transcripts - Fix speaker mapping: content-based detection ("I'm Andrew Huberman") instead of speaking-time-only heuristic which produced inverted labels - Add data/ to .gitignore (audio WAVs, diarized JSONs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-26T15:01:00Z

📝 Walkthrough

Walkthrough

Adds a speaker-diarization pipeline and indexing support: new script to download audio, transcribe, diarize, map speakers, and re-index episodes; indexer extended to accept diarized transcripts and to delete/replace existing video chunks; plus .gitignore update to ignore data/.

Changes

Cohort / File(s)	Summary
Git config `\.gitignore`	Added `data/` to ignored paths.
Diarization pipeline `scripts/diarize_episodes.py`	New end-to-end diarization script: downloads audio, transcribes/aligns (WhisperX), diarizes (pyannote), maps speaker labels to identities, caches outputs, and triggers re-indexing. CLI supports batch, dry-run, skip-download, and HF token handling.
Indexing changes `scripts/index_youtube.py`	Added `load_diarized_transcript()` to parse WhisperX JSON and `delete_video_chunks()` to remove a video's chunks; extended CLI and main flow to accept `--diarized-transcript` and `--replace`, enabling re-indexing from diarized segments.
Memory control plane `src/brainlayer/mcp/__init__.py`	Added new MCP tool `brain_update` with actions `update`, `archive`, `merge`; wired dispatch to `_brain_update()` handler and updated tool registry/count.
Vector store API `src/brainlayer/vector_store.py`	Added `update_chunk()`, `archive_chunk()`, and `get_chunk()` to support targeted chunk updates, archival (soft-delete + vector cleanup), and single-chunk retrieval.
Tests `tests/test_think_recall_integration.py`	Adjusted expected MCP tool count from 6 to 7 to account for the new `brain_update` tool.

Sequence Diagram

sequenceDiagram
    participant CLI as User CLI
    participant YTDL as Audio Downloader (yt-dlp)
    participant YouTube as YouTube
    participant WhisperX as WhisperX (transcribe & align)
    participant Pyannote as Pyannote (diarize)
    participant Mapper as Speaker Mapper
    participant Indexer as Indexing Script
    participant Store as VectorStore

    CLI->>YTDL: request audio for video_id
    YTDL->>YouTube: fetch media
    YouTube-->>YTDL: audio file (wav)
    YTDL->>WhisperX: transcribe & align audio
    WhisperX-->>YTDL: aligned segments
    YTDL->>Pyannote: diarize audio
    Pyannote-->>YTDL: segments with SPEAKER_00/01
    YTDL->>Mapper: map SPEAKER_XX -> identities
    Mapper-->>YTDL: segments labeled (Host/Guest)
    YTDL->>Indexer: send diarized segments for chunking
    Indexer->>Store: embed & upsert chunks (or delete if replace)
    Store-->>Indexer: confirm upsert/delete
    Indexer-->>CLI: re-indexing complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: current_context MCP tool — lightweight session awareness #7 — Also modifies src/brainlayer/mcp/__init__.py to add/wire a new MCP tool and update tool registry/tests.
feat(youtube): transcript-api v1.2.4 + Brave cookies #2 — Related changes to scripts/index_youtube.py single-video indexing flow and CLI behavior involving alternate transcript sources.
feat: Phase 4 — Consolidate 14 tools to 3 #24 — Adjusts MCP tool registration/dispatch in src/brainlayer/mcp/__init__.py, touching the same dispatch surface.

Poem

🐇 I hopped through audio, stitched each part,
Speakers split like carrot and tart.
Host and guest, now neatly paired,
Indexes built, data shared—
Hooray! My diary's smart. 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: diarization pipeline implementation and diarized transcript re-indexing support, which are the primary objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/diarization-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/diarize_episodes.py`:
- Around line 195-204: The inline comment "Only check first 60 seconds" is
inconsistent with the conditional that uses seg.get("start", 0) > 120; update
either the comment or the conditional for consistency — e.g., change the
threshold in the loop from 120 to 60 (seg.get("start", 0) > 60) so the code
checks the first 60 seconds, or adjust the comment to "Only check first 120
seconds" if 2 minutes is intended; locate the block using intro_phrases and the
loop over segments to make the change.
- Around line 138-140: The cache directory path is built incorrectly causing a
duplicated "data" segment; change the cache_dir computation to remove the extra
"data" component so it points to data/aligned relative to the audio file (i.e.,
set cache_dir using audio_path.parent.parent / "aligned" instead of
audio_path.parent.parent / "data" / "aligned"); update any related uses of
cache_dir/cache_path (variables cache_dir and cache_path) to match this
corrected path and ensure it still works when audio_path is like
data/audio/VIDEO_ID.wav.
- Around line 267-268: Replace the hardcoded virtualenv Python path used in the
variable venv_python with the current interpreter by using sys.executable;
locate the variables script_path and venv_python in diarize_episodes.py and set
venv_python to sys.executable (import sys at top if missing) so the script
invokes the same Python that's running this module rather than assuming
".venv/bin/python3".

In `@scripts/index_youtube.py`:
- Around line 724-753: The block in index_youtube.py duplicates embedding/upsert
logic from index_single_video; extract a helper function (e.g.
_embed_and_upsert_chunks(chunks, video_id, project, upload_date, store)) that
encapsulates calling embed_chunks, building chunk_data entries (use source_file
= f"youtube:{video_id}", include metadata upload_date and created_at
formatting), collecting embeddings, and calling store.upsert_chunks; replace the
duplicated loops in both index_single_video and the current block with calls to
this new helper to remove duplication and keep behavior identical (preserve
fields: id, content, metadata, source_file, project, content_type, value_type,
char_count, source, conversation_id, position, created_at).
- Around line 697-698: The video ID extraction only matches standard query URLs
and falls back to the whole URL; update the logic around video_id_match/vid to
also detect youtu.be short links and extract their path segment as the ID (e.g.,
check for hostname "youtu.be" or match r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or
parse args.url with urllib.parse to handle both formats and set vid to the
extracted ID rather than the raw URL (adjust the variables video_id_match and
vid accordingly).
- Around line 639-657: The delete_video_chunks function runs DELETE statements
on store.conn via cursor but never commits, so ensure the deletions are
persisted by calling a commit on the same connection after the DELETEs complete
(e.g., call store.conn.commit() right after the second cursor.execute and before
returning); locate the function delete_video_chunks and add the commit using the
existing store.conn (or use a transaction/context manager if preferred) so the
DELETEs to chunk_vectors and chunks are durable.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ebe5df and ef606df.

📒 Files selected for processing (3)

.gitignore
scripts/diarize_episodes.py
scripts/index_youtube.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Cursor Bugbot
GitHub Check: test (3.11)
GitHub Check: test (3.12)
GitHub Check: test (3.13)

🧰 Additional context used

🧬 Code graph analysis (2)

scripts/index_youtube.py (1)

src/brainlayer/vector_store.py (1)

upsert_chunks (490-549)

scripts/diarize_episodes.py (4)

src/brainlayer/storage.py (1)

exists (64-66)

src/brainlayer/pipeline/unified_timeline.py (1)

load (561-569)

src/brainlayer/pipeline/semantic_style.py (1)

model (157-162)

scripts/index_youtube.py (1)

main (660-806)

🔇 Additional comments (8)

.gitignore (1)

28-28: LGTM!

Ignoring data/ is appropriate for excluding generated artifacts (audio WAVs, diarized JSONs) from version control.

scripts/index_youtube.py (2)

611-636: LGTM!

The function cleanly handles both JSON formats (list and dict with segments key), gracefully handles missing speaker labels, and correctly computes duration.

675-678: LGTM!

CLI arguments are appropriately typed and clearly documented.

scripts/diarize_episodes.py (5)

46-59: LGTM!

Episode list and directory path definitions are well-structured. The relative path calculation correctly resolves to {repo}/data/{audio|diarized}.

62-73: LGTM!

Token handling follows standard HuggingFace patterns with clear error messaging.

292-358: LGTM!

The function has good structure with clear stages, appropriate error handling, and effective caching to avoid redundant work.

361-408: LGTM!

Main function has sensible defaults (batch size for thermal management), secure token handling (logs length, not value), and clear progress reporting.

86-93: No action needed. The --remote-components ejs:github flag is a legitimate, documented yt-dlp option for fetching EJS JavaScript components from the yt-dlp-ejs GitHub repository. It is not outdated, renamed, or fork-specific.

Likely an incorrect or invalid review comment.

coderabbitai · 2026-02-26T15:04:41Z

+def delete_video_chunks(store: VectorStore, video_id: str) -> int:
+    """Delete all chunks for a video ID. Returns count deleted."""
+    source_file = f"youtube:{video_id}"
+    cursor = store.conn.cursor()
+    # Count first
+    count = list(cursor.execute(
+        "SELECT COUNT(*) FROM chunks WHERE source_file = ?", (source_file,)
+    ))[0][0]
+    if count == 0:
+        return 0
+    # Delete from vectors first (FK dependency)
+    cursor.execute(
+        "DELETE FROM chunk_vectors WHERE chunk_id IN "
+        "(SELECT id FROM chunks WHERE source_file = ?)",
+        (source_file,),
+    )
+    cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,))
+    log.info(f"  Deleted {count} existing chunks for {video_id}")
+    return count


⚠️ Potential issue | 🟠 Major

Missing transaction commit after delete operations.

The function executes DELETE statements but doesn't commit the transaction. If store.conn uses the default SQLite isolation (no autocommit), these deletions won't persist unless a subsequent operation commits. This could lead to data inconsistency if the script fails between delete and upsert.

🔧 Proposed fix: commit after deletions

cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,)) + store.conn.commit() log.info(f" Deleted {count} existing chunks for {video_id}") return count

#!/bin/bash # Verify VectorStore's connection autocommit behavior and commit patterns ast-grep --pattern $'class VectorStore { $$$ def __init__($$$) { $$$ } $$$ }' # Check if connection uses autocommit or explicit commits rg -n "isolation_level|autocommit|\.commit" src/brainlayer/vector_store.py

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` around lines 639 - 657, The delete_video_chunks function runs DELETE statements on store.conn via cursor but never commits, so ensure the deletions are persisted by calling a commit on the same connection after the DELETEs complete (e.g., call store.conn.commit() right after the second cursor.execute and before returning); locate the function delete_video_chunks and add the commit using the existing store.conn (or use a transaction/context manager if preferred) so the DELETEs to chunk_vectors and chunks are durable.

coderabbitai · 2026-02-26T15:04:41Z

+        video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
+        vid = video_id_match.group(1) if video_id_match else args.url


⚠️ Potential issue | 🟡 Minor

Video ID extraction doesn't handle youtu.be short URLs.

The regex [?&]v=([a-zA-Z0-9_-]+) handles standard YouTube URLs but not short URLs like https://youtu.be/VIDEO_ID. The fallback uses the raw URL as the video ID, which would produce incorrect source_file values.

🔧 Proposed fix: handle both URL formats

- video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url) - vid = video_id_match.group(1) if video_id_match else args.url + # Handle both youtube.com/watch?v=ID and youtu.be/ID formats + video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url) + vid = video_id_match.group(1) if video_id_match else args.url

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)

vid = video_id_match.group(1) if video_id_match else args.url

# Handle both youtube.com/watch?v=ID and youtu.be/ID formats

video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url)

vid = video_id_match.group(1) if video_id_match else args.url

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` around lines 697 - 698, The video ID extraction only matches standard query URLs and falls back to the whole URL; update the logic around video_id_match/vid to also detect youtu.be short links and extract their path segment as the ID (e.g., check for hostname "youtu.be" or match r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or parse args.url with urllib.parse to handle both formats and set vid to the extracted ID rather than the raw URL (adjust the variables video_id_match and vid accordingly).

coderabbitai · 2026-02-26T15:04:41Z

+            else:
+                log.info(f"  Embedding {len(chunks)} chunks...")
+                embedded = embed_chunks(chunks)
+                source_file = f"youtube:{vid}"
+                chunk_data = []
+                embeddings = []
+                for i, ec in enumerate(embedded):
+                    c = ec.chunk
+                    meta = dict(c.metadata)
+                    if upload_date:
+                        meta["upload_date"] = upload_date
+                    created_at = None
+                    if upload_date and len(upload_date) == 8:
+                        created_at = f"{upload_date[:4]}-{upload_date[4:6]}-{upload_date[6:8]}T00:00:00+00:00"
+                    chunk_data.append({
+                        "id": f"{source_file}:{i}",
+                        "content": c.content,
+                        "metadata": meta,
+                        "source_file": source_file,
+                        "project": args.project,
+                        "content_type": c.content_type.value,
+                        "value_type": c.value.value,
+                        "char_count": c.char_count,
+                        "source": "youtube",
+                        "conversation_id": source_file,
+                        "position": i,
+                        "created_at": created_at,
+                    })
+                    embeddings.append(ec.embedding)
+                total_chunks = store.upsert_chunks(chunk_data, embeddings)


🧹 Nitpick | 🔵 Trivial

Significant code duplication with index_single_video.

The embedding and upsert logic (lines 724-753) duplicates code from index_single_video (lines 564-603). This increases maintenance burden and risk of divergence.

Consider extracting a helper function like _embed_and_upsert_chunks(chunks, video_id, project, upload_date, store).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` around lines 724 - 753, The block in index_youtube.py duplicates embedding/upsert logic from index_single_video; extract a helper function (e.g. _embed_and_upsert_chunks(chunks, video_id, project, upload_date, store)) that encapsulates calling embed_chunks, building chunk_data entries (use source_file = f"youtube:{video_id}", include metadata upload_date and created_at formatting), collecting embeddings, and calling store.upsert_chunks; replace the duplicated loops in both index_single_video and the current block with calls to this new helper to remove duplication and keep behavior identical (preserve fields: id, content, metadata, source_file, project, content_type, value_type, char_count, source, conversation_id, position, created_at).

- New MCP tool brain_update with 3 actions: - update: change content (re-embeds), tags, importance on existing chunks - archive: soft-delete (sets value_type=ARCHIVED, removes from vector index) - merge: keep one memory, archive duplicates - VectorStore: add update_chunk(), archive_chunk(), get_chunk() methods - Update tool count: 6 → 7 - Fix test_tool_count assertion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix cache dir double "data" prefix (data/data/aligned → data/aligned) - Fix comment/code mismatch (60s → 2 minutes matches actual 120s check) - Use sys.executable instead of hardcoded .venv path for portability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/mcp/__init__.py`:
- Around line 1607-1692: _brain_update calls json.dumps() but json is not
imported, causing a NameError at runtime; add a local import inside the function
(e.g., add "import json" near the start of _brain_update) or import json at
module scope to match the pattern used by _brain_digest/_brain_entity so
json.dumps calls in _brain_update succeed.
- Around line 1690-1691: The module calls logger.error and logger.warning but
never defines logger; add an import and module-level logger initialization by
importing the standard logging module and creating logger =
logging.getLogger(__name__) near the top of the file so calls like
logger.error("brain_update failed: %s", e) and logger.warning(...) work without
raising NameError; place this initialization alongside other top-level
imports/initializations in the src/brainlayer/mcp/__init__.py module.
- Around line 1623-1626: Several handlers (_brain_update, _file_timeline,
_context, _sessions and others) have inconsistent return types—some return
list[TextContent] on success but CallToolResult on error; choose one consistent
contract and implement it: either (A) change all handler annotations to "->
CallToolResult" and wrap successful list[TextContent] payloads into a
CallToolResult success variant, or (B) change annotations to "->
list[TextContent] | CallToolResult" to reflect reality; locate functions
_brain_update, _file_timeline, _context, _sessions, _brain_digest, _brain_entity
and the helper _error_result, then update their type hints and return statements
accordingly and update any callers to handle the unified return type.

In `@src/brainlayer/vector_store.py`:
- Around line 603-620: get_chunk currently opens a write cursor via
self.conn.cursor() even though it's a read-only operation; change it to use the
read-only helper by acquiring the cursor from self._read_cursor() (i.e., replace
the cursor = self.conn.cursor() call with cursor = self._read_cursor()) so the
method get_chunk uses the same non-blocking read connection pattern as other
read methods and avoids blocking concurrent writes.
- Around line 551-601: The update_chunk and archive_chunk methods lack the
explicit BUSY-retry loop used by update_enrichment/update_sentiment; wrap their
write operations (the SELECT check and subsequent UPDATE/DELETE/INSERT calls
inside update_chunk, and the SELECT plus UPDATE/DELETE in archive_chunk) in the
same retry-on-BUSY pattern (catch sqlite3.OperationalError/BusyError or check
error message for "database is locked"), retry a few times with short
sleeps/backoff before failing, and ensure the cursor/connection handling and
final return behavior remain identical to the existing methods so behavior and
commits remain consistent with the rest of the file.
- Line 566: Remove the unused local variable now in
src/brainlayer/vector_store.py: delete the line that assigns now =
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with
the intended timestamp usage if you actually meant to set/update a timestamp
variable), so no unused variable remains in the scope where it's defined.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef606df and bcecee0.

📒 Files selected for processing (3)

src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py
tests/test_think_recall_integration.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.12)
GitHub Check: test (3.13)
GitHub Check: test (3.11)
GitHub Check: Cursor Bugbot

🧰 Additional context used

📓 Path-based instructions (4)

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py

src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Files:

src/brainlayer/mcp/__init__.py

src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec with APSW for vector storage, WAL mode, and PRAGMA busy_timeout = 5000 for concurrent multi-process safety

Files:

src/brainlayer/vector_store.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests using pytest from the project root

Files:

tests/test_think_recall_integration.py

🧠 Learnings (3)

📚 Learning: 2026-02-23T16:51:38.317Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Applied to files:

src/brainlayer/mcp/__init__.py
tests/test_think_recall_integration.py

📚 Learning: 2026-02-23T16:51:38.317Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/cli/**/*.py : Enable project-specific indexing with `brainlayer index --project <project_name>` and incremental indexing with `brainlayer index-fast`

Applied to files:

src/brainlayer/mcp/__init__.py

📚 Learning: 2026-02-23T16:51:38.317Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/pipeline/enrichment.py : Enrich chunks with 10 metadata fields: summary, tags, importance (1-10), intent, primary_symbols, resolved_query, epistemic_level, version_scope, debt_impact, and external_deps

Applied to files:

src/brainlayer/vector_store.py

🧬 Code graph analysis (3)

src/brainlayer/mcp/__init__.py (1)

src/brainlayer/vector_store.py (3)

archive_chunk (590-601)

get_chunk (603-620)

update_chunk (551-588)

src/brainlayer/vector_store.py (2)

src/brainlayer/clustering.py (1)

serialize_f32 (53-60)

scripts/reembed_bge_m3.py (1)

serialize_f32 (39-41)

tests/test_think_recall_integration.py (1)

src/brainlayer/mcp/__init__.py (1)

list_tools (529-922)

🔇 Additional comments (4)

tests/test_think_recall_integration.py (1)

246-255: LGTM!

The test correctly updates the expected tool count from 6 to 7, and the docstring accurately lists all tools including the new brain_update tool. This aligns with the list_tools() implementation in src/brainlayer/mcp/__init__.py.

src/brainlayer/mcp/__init__.py (3)

50-68: LGTM!

Server instructions correctly updated to reflect 7 tools and include the new brain_update tool description. The description accurately covers all three actions (update, archive, merge).

872-921: LGTM!

The brain_update tool definition is well-structured:

Correctly uses _WRITE annotations for a destructive operation

Clear enum for actions with appropriate descriptions

Input schema properly documents which fields are used for which actions

1067-1075: LGTM!

The routing correctly passes all parameters to _brain_update. Appropriately omits the timeout wrapper, consistent with other write operations (see AIDEV-NOTE at line 1013).

coderabbitai · 2026-02-26T15:13:06Z

+async def _brain_update(
+    action: str,
+    chunk_id: str,
+    content: str | None = None,
+    tags: list[str] | None = None,
+    importance: int | None = None,
+    merge_chunk_ids: list[str] | None = None,
+):
+    """Update, archive, or merge memories."""
+    try:
+        store = _get_vector_store()
+
+        if action == "archive":
+            ok = store.archive_chunk(chunk_id)
+            if not ok:
+                return _error_result(f"Chunk not found: {chunk_id}")
+            return [TextContent(
+                type="text",
+                text=json.dumps({"action": "archived", "chunk_id": chunk_id}),
+            )]
+
+        elif action == "update":
+            # Verify chunk exists
+            existing = store.get_chunk(chunk_id)
+            if not existing:
+                return _error_result(f"Chunk not found: {chunk_id}")
+
+            # Re-embed if content changed
+            embedding = None
+            if content is not None:
+                loop = asyncio.get_running_loop()
+                model = _get_embedding_model()
+                embedding = await loop.run_in_executor(None, model.embed_query, content)
+
+            ok = store.update_chunk(
+                chunk_id=chunk_id,
+                content=content,
+                tags=tags,
+                importance=float(importance) if importance is not None else None,
+                embedding=embedding,
+            )
+            if not ok:
+                return _error_result(f"Update failed for: {chunk_id}")
+
+            result = {"action": "updated", "chunk_id": chunk_id, "fields": []}
+            if content is not None:
+                result["fields"].append("content")
+            if tags is not None:
+                result["fields"].append("tags")
+            if importance is not None:
+                result["fields"].append("importance")
+            return [TextContent(type="text", text=json.dumps(result))]
+
+        elif action == "merge":
+            if not merge_chunk_ids:
+                return _error_result("merge requires merge_chunk_ids (the duplicates to archive)")
+
+            # Verify the keeper exists
+            keeper = store.get_chunk(chunk_id)
+            if not keeper:
+                return _error_result(f"Keeper chunk not found: {chunk_id}")
+
+            archived = []
+            failed = []
+            for dup_id in merge_chunk_ids:
+                ok = store.archive_chunk(dup_id)
+                if ok:
+                    archived.append(dup_id)
+                else:
+                    failed.append(dup_id)
+
+            result = {
+                "action": "merged",
+                "kept": chunk_id,
+                "archived": archived,
+                "failed": failed,
+            }
+            return [TextContent(type="text", text=json.dumps(result))]
+
+        else:
+            return _error_result(f"Unknown action: {action}. Use update, archive, or merge.")
+
+    except Exception as e:
+        logger.error("brain_update failed: %s", e)
+        return _error_result(f"brain_update error: {e}")
+


⚠️ Potential issue | 🔴 Critical

Missing json import will cause NameError at runtime.

The _brain_update function uses json.dumps() at lines 1625, 1658, and 1684, but doesn't import the json module. Other handlers in this file (e.g., _brain_digest at line 1201, _brain_entity at line 1234) import json locally within the function.

🐛 Proposed fix

async def _brain_update( action: str, chunk_id: str, content: str | None = None, tags: list[str] | None = None, importance: int | None = None, merge_chunk_ids: list[str] | None = None, ): """Update, archive, or merge memories.""" + import json + try: store = _get_vector_store()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/__init__.py` around lines 1607 - 1692, _brain_update calls json.dumps() but json is not imported, causing a NameError at runtime; add a local import inside the function (e.g., add "import json" near the start of _brain_update) or import json at module scope to match the pattern used by _brain_digest/_brain_entity so json.dumps calls in _brain_update succeed.

coderabbitai · 2026-02-26T15:13:06Z

+            return [TextContent(
+                type="text",
+                text=json.dumps({"action": "archived", "chunk_id": chunk_id}),
+            )]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check return patterns in the MCP file echo "=== Searching for return [TextContent patterns ===" rg -n "return \[TextContent" src/brainlayer/mcp/__init__.py echo -e "\n=== Searching for return CallToolResult patterns ===" rg -n "return CallToolResult" src/brainlayer/mcp/__init__.py echo -e "\n=== Checking _error_result function ===" rg -n "_error_result" src/brainlayer/mcp/__init__.py -A 3 | head -30

Repository: EtanHey/brainlayer

Length of output: 2888

🏁 Script executed:

#!/bin/bash # Check the _brain_update function and its decorator echo "=== Checking _brain_update function signature and decorator ===" rg -n "async def _brain_update" src/brainlayer/mcp/__init__.py -B 2 -A 20 | head -50 echo -e "\n=== Checking _brain_digest function signature ===" rg -n "async def _brain_digest" src/brainlayer/mcp/__init__.py -B 2 -A 5 echo -e "\n=== Checking _search function signature ===" rg -n "async def _search" src/brainlayer/mcp/__init__.py -B 2 -A 5 echo -e "\n=== Checking call_tool implementation ===" rg -n "async def call_tool" src/brainlayer/mcp/__init__.py -A 30 | head -60

Repository: EtanHey/brainlayer

Length of output: 2993

🏁 Script executed:

#!/bin/bash # Check _with_timeout implementation echo "=== Checking _with_timeout implementation ===" rg -n "def _with_timeout" src/brainlayer/mcp/__init__.py -A 15 echo -e "\n=== Checking _brain_search return type ===" rg -n "async def _brain_search" src/brainlayer/mcp/__init__.py -B 2 -A 2 echo -e "\n=== Checking _brain_store return type ===" rg -n "async def _brain_store" src/brainlayer/mcp/__init__.py -B 2 -A 2 echo -e "\n=== Checking imports for MCP types ===" head -50 src/brainlayer/mcp/__init__.py | grep -E "from|import"

Repository: EtanHey/brainlayer

Length of output: 1332

🏁 Script executed:

#!/bin/bash # Get full MCP type imports echo "=== Full MCP imports ===" head -100 src/brainlayer/mcp/__init__.py | grep -A 10 "from mcp.types import" echo -e "\n=== Return type annotations for handlers ===" rg -n "async def .*->.*:" src/brainlayer/mcp/__init__.py | head -20 echo -e "\n=== Check what _brain_search returns ===" rg -n "return.*TextContent\|return.*CallToolResult" src/brainlayer/mcp/__init__.py | grep -A 2 -B 2 "1360:" sed -n '1360,1950p' src/brainlayer/mcp/__init__.py | rg -n "return " | head -20 echo -e "\n=== Check actual return statements in _brain_search ===" sed -n '1362,1900p' src/brainlayer/mcp/__init__.py | tail -100 | head -50

Repository: EtanHey/brainlayer

Length of output: 4228

🏁 Script executed:

#!/bin/bash # Check the complete _brain_search function to see what it returns echo "=== Full _brain_search implementation (lines 1362-1450) ===" sed -n '1362,1450p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check line endings of _brain_search ===" sed -n '1850,1900p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 5388

🏁 Script executed:

#!/bin/bash # Let's verify what types are actually returned by examining the complete pattern echo "=== Check what _brain_digest actually returns ===" sed -n '1194,1228p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check what _search returns (full) ===" sed -n '1730,1760p' src/brainlayer/mcp/__init__.py | tail -20 echo -e "\n=== Check MCP framework usage and tool registration ===" rg -n "register_tool\|add_tool" src/brainlayer/mcp/__init__.py | head -10

Repository: EtanHey/brainlayer

Length of output: 2035

🏁 Script executed:

#!/bin/bash # Check how call_tool is used and what the server expects echo "=== Check MCP server setup and tool_handler registration ===" rg -n "on_call_tool\|call_tool_handler\|Tool\(" src/brainlayer/mcp/__init__.py | head -20 echo -e "\n=== Check if there's a wrapper around call_tool ===" rg -n "@server\|@app" src/brainlayer/mcp/__init__.py | head -10 echo -e "\n=== Full call_tool wrapper check ===" sed -n '970,1010p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check server initialization ===" rg -n "server = Server\|Server(" src/brainlayer/mcp/__init__.py -A 3

Repository: EtanHey/brainlayer

Length of output: 2105

🏁 Script executed:

#!/bin/bash # Check server initialization with proper regex echo "=== Check server initialization ===" rg -n "server = Server" src/brainlayer/mcp/__init__.py -A 3 echo -e "\n=== Check if call_tool decorator returns are handled ===" sed -n '980,1020p' src/brainlayer/mcp/__init__.py echo -e "\n=== Verify return type of _search function ===" sed -n '1697,1800p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20 echo -e "\n=== Check what _search actually returns at end ===" sed -n '1800,1850p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 4794

🏁 Script executed:

#!/bin/bash # Check what the actual MCP spec expects echo "=== Check MCP library version and CallToolResult usage ===" rg -n "CallToolResult" src/brainlayer/mcp/__init__.py | wc -l echo -e "\n=== All CallToolResult usages ===" rg -n "CallToolResult" src/brainlayer/mcp/__init__.py echo -e "\n=== Check _store_new return type ===" rg -n "async def _store_new" src/brainlayer/mcp/__init__.py -A 2 echo -e "\n=== Check what _store_new returns ===" sed -n '2400,2500p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20 echo -e "\n=== Check the actual return from _brain_update to be sure ===" sed -n '1605,1700p' src/brainlayer/mcp/__init__.py | grep -n "return"

Repository: EtanHey/brainlayer

Length of output: 2291

🏁 Script executed:

#!/bin/bash # Check the full _brain_update function to see all returns echo "=== Full _brain_update function ===" sed -n '1607,1692p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check how _brain_search handles different return types from nested calls ===" sed -n '1410,1450p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 4876

🏁 Script executed:

#!/bin/bash # Check MCP framework expectations and how it handles returns echo "=== Check imports and MCP version ===" grep -E "mcp|MCP" src/brainlayer/mcp/__init__.py | head -20 echo -e "\n=== Check if there's any wrapping logic in the decorator or elsewhere ===" rg -n "class.*Server|def.*call_tool" src/brainlayer/mcp/__init__.py | head -10 echo -e "\n=== Check if there's a README or docs about MCP setup ===" find . -name "README*" -o -name "*.md" | head -10 | xargs grep -l "MCP\|CallToolResult" 2>/dev/null echo -e "\n=== Check requirements or pyproject for MCP version ===" find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -i mcp 2>/dev/null

Repository: EtanHey/brainlayer

Length of output: 1422

🏁 Script executed:

#!/bin/bash # Check if there's any test or documentation about return types echo "=== Check test files for MCP tool handlers ===" find . -name "*test*" -type f | head -10 echo -e "\n=== Check how _brain_search is being called from call_tool ===" sed -n '989,1011p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check if _with_timeout does any transformations ===" sed -n '26,45p' src/brainlayer/mcp/__init__.py echo -e "\n=== Check what happens after call_tool returns ===" sed -n '1100,1140p' src/brainlayer/mcp/__init__.py | head -40

Repository: EtanHey/brainlayer

Length of output: 3623

🏁 Script executed:

#!/bin/bash # Check if there are any runtime errors or issues logged related to return types echo "=== Check if tests exist for MCP return types ===" find . -name "*test*mcp*" -o -name "*mcp*test*" 2>/dev/null | head -10 echo -e "\n=== Check CLAUDE.md or docs for MCP return type guidance ===" grep -A 5 -B 5 "CallToolResult\|return type" ./CLAUDE.md 2>/dev/null | head -30 echo -e "\n=== Check what _brain_search nested calls actually return ===" rg -n "async def _regression|async def _recall|async def _file_timeline" src/brainlayer/mcp/__init__.py -A 1

Repository: EtanHey/brainlayer

Length of output: 414

🏁 Script executed:

#!/bin/bash # Check what these nested functions return echo "=== Check _file_timeline return type ===" sed -n '1922,1960p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> " echo -e "\n=== Check _regression return type ===" sed -n '1991,2030p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> " echo -e "\n=== Check _recall return type ===" sed -n '2153,2200p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> " echo -e "\n=== Look at actual return statements in _file_timeline ===" sed -n '1922,1970p' src/brainlayer/mcp/__init__.py | tail -30 echo -e "\n=== Look at actual return statements in _recall ===" sed -n '2153,2250p' src/brainlayer/mcp/__init__.py | tail -50

Repository: EtanHey/brainlayer

Length of output: 3114

Return types are inconsistent across multiple handlers, not just _brain_update.

While the specific concern about _brain_update (lines 1623–1626, 1658, 1684) returning raw [TextContent] on success but CallToolResult on error is valid, this is part of a broader type annotation mismatch. Many handlers like _file_timeline, _context, and _sessions are annotated -> list[TextContent] but can actually return CallToolResult via _error_result() on error. Only _brain_digest and _brain_entity consistently follow the -> CallToolResult pattern.

To address this properly: either (1) unify all handlers to return CallToolResult consistently and update type annotations, or (2) fix all type annotations to reflect actual return types (list[TextContent] | CallToolResult). The current mixed pattern creates implicit contracts that don't match actual behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/__init__.py` around lines 1623 - 1626, Several handlers (_brain_update, _file_timeline, _context, _sessions and others) have inconsistent return types—some return list[TextContent] on success but CallToolResult on error; choose one consistent contract and implement it: either (A) change all handler annotations to "-> CallToolResult" and wrap successful list[TextContent] payloads into a CallToolResult success variant, or (B) change annotations to "-> list[TextContent] | CallToolResult" to reflect reality; locate functions _brain_update, _file_timeline, _context, _sessions, _brain_digest, _brain_entity and the helper _error_result, then update their type hints and return statements accordingly and update any callers to handle the unified return type.

coderabbitai · 2026-02-26T15:13:06Z

+        logger.error("brain_update failed: %s", e)
+        return _error_result(f"brain_update error: {e}")


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if logger is imported/defined in the MCP module rg -n "^logger|^import.*logging|from.*logging" src/brainlayer/mcp/__init__.py | head -20

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

#!/bin/bash # Search for logger definition/import anywhere in the file echo "=== Searching for 'logger' in file ===" rg -n "logger" src/brainlayer/mcp/__init__.py | head -10 echo "=== Searching for 'logging' imports ===" rg -n "logging" src/brainlayer/mcp/__init__.py | head -10 echo "=== Check first 50 lines for imports ===" head -50 src/brainlayer/mcp/__init__.py | cat -n

Repository: EtanHey/brainlayer

Length of output: 2243

🏁 Script executed:

#!/bin/bash # Check file size and look for logger initialization anywhere echo "=== Total lines and check middle sections ===" wc -l src/brainlayer/mcp/__init__.py echo "=== Search all occurrences of 'logger' ===" rg -n "logger" src/brainlayer/mcp/__init__.py echo "=== Search for any logging-related assignments ===" rg -n "logging\|logger\s*=" src/brainlayer/mcp/__init__.py echo "=== Check if logger is a parameter or global ===" rg -n "def.*logger|logger\s*:" src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 426

Add missing logger import and initialization.

logger.error() is called at line 1690 (and logger.warning() at line 1340), but logger is not imported or defined anywhere in the module. This will raise a NameError at runtime when these code paths execute. Add import logging and initialize logger = logging.getLogger(__name__) at the module level.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/mcp/__init__.py` around lines 1690 - 1691, The module calls logger.error and logger.warning but never defines logger; add an import and module-level logger initialization by importing the standard logging module and creating logger = logging.getLogger(__name__) near the top of the file so calls like logger.error("brain_update failed: %s", e) and logger.warning(...) work without raising NameError; place this initialization alongside other top-level imports/initializations in the src/brainlayer/mcp/__init__.py module.

coderabbitai · 2026-02-26T15:13:06Z

+    def update_chunk(
+        self,
+        chunk_id: str,
+        content: Optional[str] = None,
+        tags: Optional[List[str]] = None,
+        importance: Optional[float] = None,
+        embedding: Optional[List[float]] = None,
+    ) -> bool:
+        """Update fields on an existing chunk. Returns True if chunk was found."""
+        cursor = self.conn.cursor()
+        # Check chunk exists
+        rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,)))
+        if not rows:
+            return False
+
+        now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
+        if content is not None:
+            cursor.execute(
+                "UPDATE chunks SET content = ?, char_count = ?, summary = ? WHERE id = ?",
+                (content, len(content), content[:200], chunk_id),
+            )
+        if tags is not None:
+            cursor.execute(
+                "UPDATE chunks SET tags = ? WHERE id = ?",
+                (json.dumps(tags), chunk_id),
+            )
+        if importance is not None:
+            cursor.execute(
+                "UPDATE chunks SET importance = ? WHERE id = ?",
+                (float(max(1, min(10, importance))), chunk_id),
+            )
+        if embedding is not None:
+            cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
+            cursor.execute(
+                "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
+                (chunk_id, serialize_f32(embedding)),
+            )
+        return True
+
+    def archive_chunk(self, chunk_id: str) -> bool:
+        """Soft-delete a chunk by setting value_type to ARCHIVED."""
+        cursor = self.conn.cursor()
+        rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,)))
+        if not rows:
+            return False
+        cursor.execute(
+            "UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,)
+        )
+        # Remove from vector index so it doesn't appear in searches
+        cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
+        return True


🧹 Nitpick | 🔵 Trivial

Consider adding BUSY retry for consistency with other write methods.

update_chunk and archive_chunk lack the BUSY retry logic that update_enrichment and update_sentiment implement (Lines 1336-1344, 1365-1376). While PRAGMA busy_timeout = 5000 provides base protection, other write methods in this file use explicit retry to handle edge cases under heavy contention.

For consistency, consider adding the same retry pattern, or document why these methods rely solely on the busy_timeout pragma.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/vector_store.py` around lines 551 - 601, The update_chunk and archive_chunk methods lack the explicit BUSY-retry loop used by update_enrichment/update_sentiment; wrap their write operations (the SELECT check and subsequent UPDATE/DELETE/INSERT calls inside update_chunk, and the SELECT plus UPDATE/DELETE in archive_chunk) in the same retry-on-BUSY pattern (catch sqlite3.OperationalError/BusyError or check error message for "database is locked"), retry a few times with short sleeps/backoff before failing, and ensure the cursor/connection handling and final return behavior remain identical to the existing methods so behavior and commits remain consistent with the rest of the file.

coderabbitai · 2026-02-26T15:13:06Z

+        if not rows:
+            return False
+
+        now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")


⚠️ Potential issue | 🟡 Minor

Remove unused variable now.

The variable now is assigned but never used. This appears to be leftover from a planned timestamp update that wasn't implemented.

🧹 Proposed fix

- now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/vector_store.py` at line 566, Remove the unused local variable now in src/brainlayer/vector_store.py: delete the line that assigns now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with the intended timestamp usage if you actually meant to set/update a timestamp variable), so no unused variable remains in the scope where it's defined.

coderabbitai · 2026-02-26T15:13:06Z

+    def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]:
+        """Get a single chunk by ID."""
+        cursor = self.conn.cursor()
+        rows = list(cursor.execute(
+            """SELECT id, content, metadata, source_file, project, content_type,
+                      value_type, tags, importance, created_at, summary
+               FROM chunks WHERE id = ?""",
+            (chunk_id,),
+        ))
+        if not rows:
+            return None
+        r = rows[0]
+        return {
+            "id": r[0], "content": r[1], "metadata": r[2],
+            "source_file": r[3], "project": r[4], "content_type": r[5],
+            "value_type": r[6], "tags": r[7], "importance": r[8],
+            "created_at": r[9], "summary": r[10],
+        }


⚠️ Potential issue | 🟡 Minor

Use _read_cursor() for read-only operation.

get_chunk is a read operation but uses the write connection (self.conn.cursor()). For consistency with other read methods in this file and to avoid blocking concurrent writes, use _read_cursor().

♻️ Proposed fix

def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]: """Get a single chunk by ID.""" - cursor = self.conn.cursor() + cursor = self._read_cursor() rows = list(cursor.execute(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/vector_store.py` around lines 603 - 620, get_chunk currently opens a write cursor via self.conn.cursor() even though it's a read-only operation; change it to use the read-only helper by acquiring the cursor from self._read_cursor() (i.e., replace the cursor = self.conn.cursor() call with cursor = self._read_cursor()) so the method get_chunk uses the same non-blocking read connection pattern as other read methods and avoids blocking concurrent writes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-26T15:23:33Z

+        if not rows:
+            return False
+
+        now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")


Unused now variable in update_chunk method

Low Severity

The now variable is computed via datetime.now(timezone.utc).strftime(...) but never referenced in any of the subsequent SQL UPDATE statements. This looks like it was intended to set a modification timestamp (e.g., enriched_at or an updated_at field) when a chunk is updated, but the actual update was never wired in. Updates silently lose their modification timestamps.

cursor · 2026-02-26T15:23:33Z

+        vid = video_id_match.group(1) if video_id_match else args.url
+
+        if args.replace:
+            delete_video_chunks(store, vid)


Regex video ID extraction mismatches index_single_video extraction

Medium Severity

The vid variable is extracted via a regex matching [?&]v= in the URL. For non-standard YouTube URL formats (e.g., youtu.be/ID or /embed/ID), the regex won't match and vid falls back to the entire URL string. When --replace is used without --diarized-transcript, delete_video_chunks uses this possibly-wrong vid, while index_single_video correctly extracts the real video ID via yt-dlp. This mismatch causes old chunks to not be deleted while new ones are created — resulting in duplicate data.

Additional Locations (1)

scripts/index_youtube.py#L756-L759

cursor · 2026-02-26T15:23:34Z

+                    })
+                    embeddings.append(ec.embedding)
+                total_chunks = store.upsert_chunks(chunk_data, embeddings)
+                log.info(f"  Indexed {total_chunks} chunks for '{title}'")


Duplicated embed-and-upsert logic from index_single_video

Low Severity

The embed → build chunk_data → upsert_chunks block in the diarized-transcript path (inside main()) is nearly a line-for-line copy of the same logic in index_single_video. Any future bug fix or schema change to the chunk-building loop needs to be applied in both places, increasing risk of divergence.

Additional Locations (1)

scripts/index_youtube.py#L563-L603

cursor · 2026-02-26T15:23:34Z

+        cursor.execute("UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,))
+        # Remove from vector index so it doesn't appear in searches
+        cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
+        return True


Archived chunks still appear in FTS search results

Medium Severity

archive_chunk removes the chunk from chunk_vectors (preventing vector search hits) but does not remove it from chunks_fts. The FTS delete trigger only fires on DELETE FROM chunks, not on UPDATE. Since archive_chunk only updates value_type, the FTS entry persists. The hybrid_search FTS path has no value_type != 'ARCHIVED' filter, so archived/merged memories can still surface in keyword-based search results — defeating the purpose of archiving.

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

EtanHey and others added 2 commits February 26, 2026 17:06

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

style: ruff format mcp/__init__.py and vector_store.py

85336e0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EtanHey merged commit 9e1ab46 into main Feb 26, 2026
6 checks passed

EtanHey deleted the feat/diarization-pipeline branch February 26, 2026 15:17

cursor Bot reviewed Feb 26, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Mar 9, 2026

feat: C6 eval coverage expansion — 18 new test cases #75

Merged

4 tasks

coderabbitai Bot mentioned this pull request Mar 26, 2026

feat: chunk lifecycle management — supersede, archive, search filtering #102

Merged

4 tasks

		video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
		vid = video_id_match.group(1) if video_id_match else args.url

		logger.error("brain_update failed: %s", e)
		return _error_result(f"brain_update error: {e}")

Conversation

EtanHey commented Feb 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 26, 2026

Choose a reason for hiding this comment

Unused now variable in update_chunk method

Uh oh!

cursor Bot Feb 26, 2026

Choose a reason for hiding this comment

Regex video ID extraction mismatches index_single_video extraction

Uh oh!

cursor Bot Feb 26, 2026

Choose a reason for hiding this comment

Duplicated embed-and-upsert logic from index_single_video

Uh oh!

cursor Bot Feb 26, 2026

Choose a reason for hiding this comment

Archived chunks still appear in FTS search results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Feb 26, 2026 •

edited by cursor Bot

Loading

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading

Unused `now` variable in `update_chunk` method

Regex video ID extraction mismatches `index_single_video` extraction

Duplicated embed-and-upsert logic from `index_single_video`