Skip to content

feat: diarization pipeline + diarized transcript re-indexing#34

Merged
EtanHey merged 4 commits intomainfrom
feat/diarization-pipeline
Feb 26, 2026
Merged

feat: diarization pipeline + diarized transcript re-indexing#34
EtanHey merged 4 commits intomainfrom
feat/diarization-pipeline

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Feb 26, 2026

Summary

  • New script: diarize_episodes.py — batch WhisperX diarization pipeline for 8 Huberman guest episodes (download → transcribe → align → diarize → map speakers → re-index)
  • Updated: index_youtube.py — added --diarized-transcript and --replace flags for re-indexing with speaker-labeled transcripts
  • Fixed: Speaker mapping now uses content-based detection ("I'm Andrew Huberman" in intro) instead of speaking-time-only heuristic which produced inverted labels on ep1
  • Housekeeping: Added data/ to .gitignore (audio WAVs, diarized JSONs)

Test plan

  • Episode 1 (zEYE-vcVKy8) diarized, labels fixed, re-indexed with 1760 chunks
  • Verified HUBERMAN: prefix on intro segments, GALPIN: on guest responses
  • load_diarized_transcript() correctly converts WhisperX JSON to indexing format
  • delete_video_chunks() properly removes existing chunks before re-index
  • Remaining 7 episodes currently processing (CPU-only, ~2h each)

🤖 Generated with Claude Code


Note

Medium Risk
Medium risk because it adds new write/delete paths against the vector store (chunk replacement, update/archive/merge) which can alter or remove indexed data if misused, though changes are localized and opt-in via flags/tool calls.

Overview
Adds scripts/diarize_episodes.py, a batch pipeline that downloads episode audio, runs WhisperX transcription/alignment/diarization, maps speaker IDs to named speakers (intro phrase detection with speaking-time fallback), saves diarized JSON, and can trigger re-indexing.

Extends scripts/index_youtube.py to accept --diarized-transcript (load WhisperX diarized JSON and prefix speaker labels into segment text) and --replace (delete existing youtube:{video_id} chunks/vectors before indexing).

Introduces a new MCP write tool brain_update to update content/tags/importance (with re-embedding), archive chunks (soft-delete + remove vectors), or merge duplicates; adds supporting VectorStore methods and updates the MCP tool-count test. Also ignores data/ outputs in .gitignore.

Written by Cursor Bugbot for commit 85336e0. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Added speaker diarization pipeline and workflow to process and re-index episodes with speaker-identified transcripts.
    • Enabled replacing previously indexed episode content during re-indexing.
    • Added memory management tool to update, archive, and merge stored chunks.
    • Added chunk-level APIs: update, archive, and fetch individual chunks.
  • Chores

    • Ignore data/ directory in repository.
  • Tests

    • Updated test expectations to reflect the new tool count.

- Add scripts/diarize_episodes.py: batch pipeline for WhisperX speaker
  diarization of Huberman guest episodes (download → transcribe → align →
  diarize → map speakers → save JSON → re-index)
- Add --diarized-transcript and --replace flags to index_youtube.py for
  re-indexing with speaker-labeled transcripts
- Fix speaker mapping: content-based detection ("I'm Andrew Huberman")
  instead of speaking-time-only heuristic which produced inverted labels
- Add data/ to .gitignore (audio WAVs, diarized JSONs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

Adds a speaker-diarization pipeline and indexing support: new script to download audio, transcribe, diarize, map speakers, and re-index episodes; indexer extended to accept diarized transcripts and to delete/replace existing video chunks; plus .gitignore update to ignore data/.

Changes

Cohort / File(s) Summary
Git config
\.gitignore
Added data/ to ignored paths.
Diarization pipeline
scripts/diarize_episodes.py
New end-to-end diarization script: downloads audio, transcribes/aligns (WhisperX), diarizes (pyannote), maps speaker labels to identities, caches outputs, and triggers re-indexing. CLI supports batch, dry-run, skip-download, and HF token handling.
Indexing changes
scripts/index_youtube.py
Added load_diarized_transcript() to parse WhisperX JSON and delete_video_chunks() to remove a video's chunks; extended CLI and main flow to accept --diarized-transcript and --replace, enabling re-indexing from diarized segments.
Memory control plane
src/brainlayer/mcp/__init__.py
Added new MCP tool brain_update with actions update, archive, merge; wired dispatch to _brain_update() handler and updated tool registry/count.
Vector store API
src/brainlayer/vector_store.py
Added update_chunk(), archive_chunk(), and get_chunk() to support targeted chunk updates, archival (soft-delete + vector cleanup), and single-chunk retrieval.
Tests
tests/test_think_recall_integration.py
Adjusted expected MCP tool count from 6 to 7 to account for the new brain_update tool.

Sequence Diagram

sequenceDiagram
    participant CLI as User CLI
    participant YTDL as Audio Downloader (yt-dlp)
    participant YouTube as YouTube
    participant WhisperX as WhisperX (transcribe & align)
    participant Pyannote as Pyannote (diarize)
    participant Mapper as Speaker Mapper
    participant Indexer as Indexing Script
    participant Store as VectorStore

    CLI->>YTDL: request audio for video_id
    YTDL->>YouTube: fetch media
    YouTube-->>YTDL: audio file (wav)
    YTDL->>WhisperX: transcribe & align audio
    WhisperX-->>YTDL: aligned segments
    YTDL->>Pyannote: diarize audio
    Pyannote-->>YTDL: segments with SPEAKER_00/01
    YTDL->>Mapper: map SPEAKER_XX -> identities
    Mapper-->>YTDL: segments labeled (Host/Guest)
    YTDL->>Indexer: send diarized segments for chunking
    Indexer->>Store: embed & upsert chunks (or delete if replace)
    Store-->>Indexer: confirm upsert/delete
    Indexer-->>CLI: re-indexing complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐇 I hopped through audio, stitched each part,
Speakers split like carrot and tart.
Host and guest, now neatly paired,
Indexes built, data shared—
Hooray! My diary's smart. 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: diarization pipeline implementation and diarized transcript re-indexing support, which are the primary objectives.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/diarization-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/diarize_episodes.py`:
- Around line 195-204: The inline comment "Only check first 60 seconds" is
inconsistent with the conditional that uses seg.get("start", 0) > 120; update
either the comment or the conditional for consistency — e.g., change the
threshold in the loop from 120 to 60 (seg.get("start", 0) > 60) so the code
checks the first 60 seconds, or adjust the comment to "Only check first 120
seconds" if 2 minutes is intended; locate the block using intro_phrases and the
loop over segments to make the change.
- Around line 138-140: The cache directory path is built incorrectly causing a
duplicated "data" segment; change the cache_dir computation to remove the extra
"data" component so it points to data/aligned relative to the audio file (i.e.,
set cache_dir using audio_path.parent.parent / "aligned" instead of
audio_path.parent.parent / "data" / "aligned"); update any related uses of
cache_dir/cache_path (variables cache_dir and cache_path) to match this
corrected path and ensure it still works when audio_path is like
data/audio/VIDEO_ID.wav.
- Around line 267-268: Replace the hardcoded virtualenv Python path used in the
variable venv_python with the current interpreter by using sys.executable;
locate the variables script_path and venv_python in diarize_episodes.py and set
venv_python to sys.executable (import sys at top if missing) so the script
invokes the same Python that's running this module rather than assuming
".venv/bin/python3".

In `@scripts/index_youtube.py`:
- Around line 724-753: The block in index_youtube.py duplicates embedding/upsert
logic from index_single_video; extract a helper function (e.g.
_embed_and_upsert_chunks(chunks, video_id, project, upload_date, store)) that
encapsulates calling embed_chunks, building chunk_data entries (use source_file
= f"youtube:{video_id}", include metadata upload_date and created_at
formatting), collecting embeddings, and calling store.upsert_chunks; replace the
duplicated loops in both index_single_video and the current block with calls to
this new helper to remove duplication and keep behavior identical (preserve
fields: id, content, metadata, source_file, project, content_type, value_type,
char_count, source, conversation_id, position, created_at).
- Around line 697-698: The video ID extraction only matches standard query URLs
and falls back to the whole URL; update the logic around video_id_match/vid to
also detect youtu.be short links and extract their path segment as the ID (e.g.,
check for hostname "youtu.be" or match r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or
parse args.url with urllib.parse to handle both formats and set vid to the
extracted ID rather than the raw URL (adjust the variables video_id_match and
vid accordingly).
- Around line 639-657: The delete_video_chunks function runs DELETE statements
on store.conn via cursor but never commits, so ensure the deletions are
persisted by calling a commit on the same connection after the DELETEs complete
(e.g., call store.conn.commit() right after the second cursor.execute and before
returning); locate the function delete_video_chunks and add the commit using the
existing store.conn (or use a transaction/context manager if preferred) so the
DELETEs to chunk_vectors and chunks are durable.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ebe5df and ef606df.

📒 Files selected for processing (3)
  • .gitignore
  • scripts/diarize_episodes.py
  • scripts/index_youtube.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/index_youtube.py (1)
src/brainlayer/vector_store.py (1)
  • upsert_chunks (490-549)
scripts/diarize_episodes.py (4)
src/brainlayer/storage.py (1)
  • exists (64-66)
src/brainlayer/pipeline/unified_timeline.py (1)
  • load (561-569)
src/brainlayer/pipeline/semantic_style.py (1)
  • model (157-162)
scripts/index_youtube.py (1)
  • main (660-806)
🔇 Additional comments (8)
.gitignore (1)

28-28: LGTM!

Ignoring data/ is appropriate for excluding generated artifacts (audio WAVs, diarized JSONs) from version control.

scripts/index_youtube.py (2)

611-636: LGTM!

The function cleanly handles both JSON formats (list and dict with segments key), gracefully handles missing speaker labels, and correctly computes duration.


675-678: LGTM!

CLI arguments are appropriately typed and clearly documented.

scripts/diarize_episodes.py (5)

46-59: LGTM!

Episode list and directory path definitions are well-structured. The relative path calculation correctly resolves to {repo}/data/{audio|diarized}.


62-73: LGTM!

Token handling follows standard HuggingFace patterns with clear error messaging.


292-358: LGTM!

The function has good structure with clear stages, appropriate error handling, and effective caching to avoid redundant work.


361-408: LGTM!

Main function has sensible defaults (batch size for thermal management), secure token handling (logs length, not value), and clear progress reporting.


86-93: No action needed. The --remote-components ejs:github flag is a legitimate, documented yt-dlp option for fetching EJS JavaScript components from the yt-dlp-ejs GitHub repository. It is not outdated, renamed, or fork-specific.

Likely an incorrect or invalid review comment.

Comment thread scripts/diarize_episodes.py
Comment thread scripts/diarize_episodes.py Outdated
Comment thread scripts/diarize_episodes.py Outdated
Comment thread scripts/index_youtube.py
Comment on lines +639 to +657
def delete_video_chunks(store: VectorStore, video_id: str) -> int:
"""Delete all chunks for a video ID. Returns count deleted."""
source_file = f"youtube:{video_id}"
cursor = store.conn.cursor()
# Count first
count = list(cursor.execute(
"SELECT COUNT(*) FROM chunks WHERE source_file = ?", (source_file,)
))[0][0]
if count == 0:
return 0
# Delete from vectors first (FK dependency)
cursor.execute(
"DELETE FROM chunk_vectors WHERE chunk_id IN "
"(SELECT id FROM chunks WHERE source_file = ?)",
(source_file,),
)
cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,))
log.info(f" Deleted {count} existing chunks for {video_id}")
return count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing transaction commit after delete operations.

The function executes DELETE statements but doesn't commit the transaction. If store.conn uses the default SQLite isolation (no autocommit), these deletions won't persist unless a subsequent operation commits. This could lead to data inconsistency if the script fails between delete and upsert.

🔧 Proposed fix: commit after deletions
     cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,))
+    store.conn.commit()
     log.info(f"  Deleted {count} existing chunks for {video_id}")
     return count
#!/bin/bash
# Verify VectorStore's connection autocommit behavior and commit patterns
ast-grep --pattern $'class VectorStore {
  $$$
  def __init__($$$) {
    $$$
  }
  $$$
}'

# Check if connection uses autocommit or explicit commits
rg -n "isolation_level|autocommit|\.commit\(\)" src/brainlayer/vector_store.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/index_youtube.py` around lines 639 - 657, The delete_video_chunks
function runs DELETE statements on store.conn via cursor but never commits, so
ensure the deletions are persisted by calling a commit on the same connection
after the DELETEs complete (e.g., call store.conn.commit() right after the
second cursor.execute and before returning); locate the function
delete_video_chunks and add the commit using the existing store.conn (or use a
transaction/context manager if preferred) so the DELETEs to chunk_vectors and
chunks are durable.

Comment thread scripts/index_youtube.py
Comment on lines +697 to +698
video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
vid = video_id_match.group(1) if video_id_match else args.url
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Video ID extraction doesn't handle youtu.be short URLs.

The regex [?&]v=([a-zA-Z0-9_-]+) handles standard YouTube URLs but not short URLs like https://youtu.be/VIDEO_ID. The fallback uses the raw URL as the video ID, which would produce incorrect source_file values.

🔧 Proposed fix: handle both URL formats
-        video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
-        vid = video_id_match.group(1) if video_id_match else args.url
+        # Handle both youtube.com/watch?v=ID and youtu.be/ID formats
+        video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url)
+        vid = video_id_match.group(1) if video_id_match else args.url
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
vid = video_id_match.group(1) if video_id_match else args.url
# Handle both youtube.com/watch?v=ID and youtu.be/ID formats
video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url)
vid = video_id_match.group(1) if video_id_match else args.url
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/index_youtube.py` around lines 697 - 698, The video ID extraction
only matches standard query URLs and falls back to the whole URL; update the
logic around video_id_match/vid to also detect youtu.be short links and extract
their path segment as the ID (e.g., check for hostname "youtu.be" or match
r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or parse args.url with urllib.parse to
handle both formats and set vid to the extracted ID rather than the raw URL
(adjust the variables video_id_match and vid accordingly).

Comment thread scripts/index_youtube.py
Comment on lines +724 to +753
else:
log.info(f" Embedding {len(chunks)} chunks...")
embedded = embed_chunks(chunks)
source_file = f"youtube:{vid}"
chunk_data = []
embeddings = []
for i, ec in enumerate(embedded):
c = ec.chunk
meta = dict(c.metadata)
if upload_date:
meta["upload_date"] = upload_date
created_at = None
if upload_date and len(upload_date) == 8:
created_at = f"{upload_date[:4]}-{upload_date[4:6]}-{upload_date[6:8]}T00:00:00+00:00"
chunk_data.append({
"id": f"{source_file}:{i}",
"content": c.content,
"metadata": meta,
"source_file": source_file,
"project": args.project,
"content_type": c.content_type.value,
"value_type": c.value.value,
"char_count": c.char_count,
"source": "youtube",
"conversation_id": source_file,
"position": i,
"created_at": created_at,
})
embeddings.append(ec.embedding)
total_chunks = store.upsert_chunks(chunk_data, embeddings)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Significant code duplication with index_single_video.

The embedding and upsert logic (lines 724-753) duplicates code from index_single_video (lines 564-603). This increases maintenance burden and risk of divergence.

Consider extracting a helper function like _embed_and_upsert_chunks(chunks, video_id, project, upload_date, store).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/index_youtube.py` around lines 724 - 753, The block in
index_youtube.py duplicates embedding/upsert logic from index_single_video;
extract a helper function (e.g. _embed_and_upsert_chunks(chunks, video_id,
project, upload_date, store)) that encapsulates calling embed_chunks, building
chunk_data entries (use source_file = f"youtube:{video_id}", include metadata
upload_date and created_at formatting), collecting embeddings, and calling
store.upsert_chunks; replace the duplicated loops in both index_single_video and
the current block with calls to this new helper to remove duplication and keep
behavior identical (preserve fields: id, content, metadata, source_file,
project, content_type, value_type, char_count, source, conversation_id,
position, created_at).

EtanHey and others added 2 commits February 26, 2026 17:06
- New MCP tool brain_update with 3 actions:
  - update: change content (re-embeds), tags, importance on existing chunks
  - archive: soft-delete (sets value_type=ARCHIVED, removes from vector index)
  - merge: keep one memory, archive duplicates
- VectorStore: add update_chunk(), archive_chunk(), get_chunk() methods
- Update tool count: 6 → 7
- Fix test_tool_count assertion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix cache dir double "data" prefix (data/data/aligned → data/aligned)
- Fix comment/code mismatch (60s → 2 minutes matches actual 120s check)
- Use sys.executable instead of hardcoded .venv path for portability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/mcp/__init__.py`:
- Around line 1607-1692: _brain_update calls json.dumps() but json is not
imported, causing a NameError at runtime; add a local import inside the function
(e.g., add "import json" near the start of _brain_update) or import json at
module scope to match the pattern used by _brain_digest/_brain_entity so
json.dumps calls in _brain_update succeed.
- Around line 1690-1691: The module calls logger.error and logger.warning but
never defines logger; add an import and module-level logger initialization by
importing the standard logging module and creating logger =
logging.getLogger(__name__) near the top of the file so calls like
logger.error("brain_update failed: %s", e) and logger.warning(...) work without
raising NameError; place this initialization alongside other top-level
imports/initializations in the src/brainlayer/mcp/__init__.py module.
- Around line 1623-1626: Several handlers (_brain_update, _file_timeline,
_context, _sessions and others) have inconsistent return types—some return
list[TextContent] on success but CallToolResult on error; choose one consistent
contract and implement it: either (A) change all handler annotations to "->
CallToolResult" and wrap successful list[TextContent] payloads into a
CallToolResult success variant, or (B) change annotations to "->
list[TextContent] | CallToolResult" to reflect reality; locate functions
_brain_update, _file_timeline, _context, _sessions, _brain_digest, _brain_entity
and the helper _error_result, then update their type hints and return statements
accordingly and update any callers to handle the unified return type.

In `@src/brainlayer/vector_store.py`:
- Around line 603-620: get_chunk currently opens a write cursor via
self.conn.cursor() even though it's a read-only operation; change it to use the
read-only helper by acquiring the cursor from self._read_cursor() (i.e., replace
the cursor = self.conn.cursor() call with cursor = self._read_cursor()) so the
method get_chunk uses the same non-blocking read connection pattern as other
read methods and avoids blocking concurrent writes.
- Around line 551-601: The update_chunk and archive_chunk methods lack the
explicit BUSY-retry loop used by update_enrichment/update_sentiment; wrap their
write operations (the SELECT check and subsequent UPDATE/DELETE/INSERT calls
inside update_chunk, and the SELECT plus UPDATE/DELETE in archive_chunk) in the
same retry-on-BUSY pattern (catch sqlite3.OperationalError/BusyError or check
error message for "database is locked"), retry a few times with short
sleeps/backoff before failing, and ensure the cursor/connection handling and
final return behavior remain identical to the existing methods so behavior and
commits remain consistent with the rest of the file.
- Line 566: Remove the unused local variable now in
src/brainlayer/vector_store.py: delete the line that assigns now =
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with
the intended timestamp usage if you actually meant to set/update a timestamp
variable), so no unused variable remains in the scope where it's defined.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef606df and bcecee0.

📒 Files selected for processing (3)
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
  • tests/test_think_recall_integration.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use ruff check src/ for linting and ruff format src/ for code formatting

Files:

  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/vector_store.py
src/brainlayer/mcp/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Files:

  • src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sqlite-vec with APSW for vector storage, WAL mode, and PRAGMA busy_timeout = 5000 for concurrent multi-process safety

Files:

  • src/brainlayer/vector_store.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run tests using pytest from the project root

Files:

  • tests/test_think_recall_integration.py
🧠 Learnings (3)
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names

Applied to files:

  • src/brainlayer/mcp/__init__.py
  • tests/test_think_recall_integration.py
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/cli/**/*.py : Enable project-specific indexing with `brainlayer index --project <project_name>` and incremental indexing with `brainlayer index-fast`

Applied to files:

  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/pipeline/enrichment.py : Enrich chunks with 10 metadata fields: summary, tags, importance (1-10), intent, primary_symbols, resolved_query, epistemic_level, version_scope, debt_impact, and external_deps

Applied to files:

  • src/brainlayer/vector_store.py
🧬 Code graph analysis (3)
src/brainlayer/mcp/__init__.py (1)
src/brainlayer/vector_store.py (3)
  • archive_chunk (590-601)
  • get_chunk (603-620)
  • update_chunk (551-588)
src/brainlayer/vector_store.py (2)
src/brainlayer/clustering.py (1)
  • serialize_f32 (53-60)
scripts/reembed_bge_m3.py (1)
  • serialize_f32 (39-41)
tests/test_think_recall_integration.py (1)
src/brainlayer/mcp/__init__.py (1)
  • list_tools (529-922)
🔇 Additional comments (4)
tests/test_think_recall_integration.py (1)

246-255: LGTM!

The test correctly updates the expected tool count from 6 to 7, and the docstring accurately lists all tools including the new brain_update tool. This aligns with the list_tools() implementation in src/brainlayer/mcp/__init__.py.

src/brainlayer/mcp/__init__.py (3)

50-68: LGTM!

Server instructions correctly updated to reflect 7 tools and include the new brain_update tool description. The description accurately covers all three actions (update, archive, merge).


872-921: LGTM!

The brain_update tool definition is well-structured:

  • Correctly uses _WRITE annotations for a destructive operation
  • Clear enum for actions with appropriate descriptions
  • Input schema properly documents which fields are used for which actions

1067-1075: LGTM!

The routing correctly passes all parameters to _brain_update. Appropriately omits the timeout wrapper, consistent with other write operations (see AIDEV-NOTE at line 1013).

Comment on lines +1607 to +1692
async def _brain_update(
action: str,
chunk_id: str,
content: str | None = None,
tags: list[str] | None = None,
importance: int | None = None,
merge_chunk_ids: list[str] | None = None,
):
"""Update, archive, or merge memories."""
try:
store = _get_vector_store()

if action == "archive":
ok = store.archive_chunk(chunk_id)
if not ok:
return _error_result(f"Chunk not found: {chunk_id}")
return [TextContent(
type="text",
text=json.dumps({"action": "archived", "chunk_id": chunk_id}),
)]

elif action == "update":
# Verify chunk exists
existing = store.get_chunk(chunk_id)
if not existing:
return _error_result(f"Chunk not found: {chunk_id}")

# Re-embed if content changed
embedding = None
if content is not None:
loop = asyncio.get_running_loop()
model = _get_embedding_model()
embedding = await loop.run_in_executor(None, model.embed_query, content)

ok = store.update_chunk(
chunk_id=chunk_id,
content=content,
tags=tags,
importance=float(importance) if importance is not None else None,
embedding=embedding,
)
if not ok:
return _error_result(f"Update failed for: {chunk_id}")

result = {"action": "updated", "chunk_id": chunk_id, "fields": []}
if content is not None:
result["fields"].append("content")
if tags is not None:
result["fields"].append("tags")
if importance is not None:
result["fields"].append("importance")
return [TextContent(type="text", text=json.dumps(result))]

elif action == "merge":
if not merge_chunk_ids:
return _error_result("merge requires merge_chunk_ids (the duplicates to archive)")

# Verify the keeper exists
keeper = store.get_chunk(chunk_id)
if not keeper:
return _error_result(f"Keeper chunk not found: {chunk_id}")

archived = []
failed = []
for dup_id in merge_chunk_ids:
ok = store.archive_chunk(dup_id)
if ok:
archived.append(dup_id)
else:
failed.append(dup_id)

result = {
"action": "merged",
"kept": chunk_id,
"archived": archived,
"failed": failed,
}
return [TextContent(type="text", text=json.dumps(result))]

else:
return _error_result(f"Unknown action: {action}. Use update, archive, or merge.")

except Exception as e:
logger.error("brain_update failed: %s", e)
return _error_result(f"brain_update error: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing json import will cause NameError at runtime.

The _brain_update function uses json.dumps() at lines 1625, 1658, and 1684, but doesn't import the json module. Other handlers in this file (e.g., _brain_digest at line 1201, _brain_entity at line 1234) import json locally within the function.

🐛 Proposed fix
 async def _brain_update(
     action: str,
     chunk_id: str,
     content: str | None = None,
     tags: list[str] | None = None,
     importance: int | None = None,
     merge_chunk_ids: list[str] | None = None,
 ):
     """Update, archive, or merge memories."""
+    import json
+
     try:
         store = _get_vector_store()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/__init__.py` around lines 1607 - 1692, _brain_update calls
json.dumps() but json is not imported, causing a NameError at runtime; add a
local import inside the function (e.g., add "import json" near the start of
_brain_update) or import json at module scope to match the pattern used by
_brain_digest/_brain_entity so json.dumps calls in _brain_update succeed.

Comment thread src/brainlayer/mcp/__init__.py Outdated
Comment on lines +1623 to +1626
return [TextContent(
type="text",
text=json.dumps({"action": "archived", "chunk_id": chunk_id}),
)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check return patterns in the MCP file
echo "=== Searching for return [TextContent patterns ==="
rg -n "return \[TextContent" src/brainlayer/mcp/__init__.py

echo -e "\n=== Searching for return CallToolResult patterns ==="
rg -n "return CallToolResult" src/brainlayer/mcp/__init__.py

echo -e "\n=== Checking _error_result function ==="
rg -n "_error_result" src/brainlayer/mcp/__init__.py -A 3 | head -30

Repository: EtanHey/brainlayer

Length of output: 2888


🏁 Script executed:

#!/bin/bash
# Check the _brain_update function and its decorator
echo "=== Checking _brain_update function signature and decorator ==="
rg -n "async def _brain_update" src/brainlayer/mcp/__init__.py -B 2 -A 20 | head -50

echo -e "\n=== Checking _brain_digest function signature ==="
rg -n "async def _brain_digest" src/brainlayer/mcp/__init__.py -B 2 -A 5

echo -e "\n=== Checking _search function signature ==="
rg -n "async def _search" src/brainlayer/mcp/__init__.py -B 2 -A 5

echo -e "\n=== Checking call_tool implementation ==="
rg -n "async def call_tool" src/brainlayer/mcp/__init__.py -A 30 | head -60

Repository: EtanHey/brainlayer

Length of output: 2993


🏁 Script executed:

#!/bin/bash
# Check _with_timeout implementation
echo "=== Checking _with_timeout implementation ==="
rg -n "def _with_timeout" src/brainlayer/mcp/__init__.py -A 15

echo -e "\n=== Checking _brain_search return type ==="
rg -n "async def _brain_search" src/brainlayer/mcp/__init__.py -B 2 -A 2

echo -e "\n=== Checking _brain_store return type ==="
rg -n "async def _brain_store" src/brainlayer/mcp/__init__.py -B 2 -A 2

echo -e "\n=== Checking imports for MCP types ==="
head -50 src/brainlayer/mcp/__init__.py | grep -E "from|import"

Repository: EtanHey/brainlayer

Length of output: 1332


🏁 Script executed:

#!/bin/bash
# Get full MCP type imports
echo "=== Full MCP imports ==="
head -100 src/brainlayer/mcp/__init__.py | grep -A 10 "from mcp.types import"

echo -e "\n=== Return type annotations for handlers ==="
rg -n "async def .*->.*:" src/brainlayer/mcp/__init__.py | head -20

echo -e "\n=== Check what _brain_search returns ==="
rg -n "return.*TextContent\|return.*CallToolResult" src/brainlayer/mcp/__init__.py | grep -A 2 -B 2 "1360:"
sed -n '1360,1950p' src/brainlayer/mcp/__init__.py | rg -n "return " | head -20

echo -e "\n=== Check actual return statements in _brain_search ==="
sed -n '1362,1900p' src/brainlayer/mcp/__init__.py | tail -100 | head -50

Repository: EtanHey/brainlayer

Length of output: 4228


🏁 Script executed:

#!/bin/bash
# Check the complete _brain_search function to see what it returns
echo "=== Full _brain_search implementation (lines 1362-1450) ==="
sed -n '1362,1450p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check line endings of _brain_search ==="
sed -n '1850,1900p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 5388


🏁 Script executed:

#!/bin/bash
# Let's verify what types are actually returned by examining the complete pattern
echo "=== Check what _brain_digest actually returns ==="
sed -n '1194,1228p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check what _search returns (full) ==="
sed -n '1730,1760p' src/brainlayer/mcp/__init__.py | tail -20

echo -e "\n=== Check MCP framework usage and tool registration ==="
rg -n "register_tool\|add_tool" src/brainlayer/mcp/__init__.py | head -10

Repository: EtanHey/brainlayer

Length of output: 2035


🏁 Script executed:

#!/bin/bash
# Check how call_tool is used and what the server expects
echo "=== Check MCP server setup and tool_handler registration ==="
rg -n "on_call_tool\|call_tool_handler\|Tool\(" src/brainlayer/mcp/__init__.py | head -20

echo -e "\n=== Check if there's a wrapper around call_tool ==="
rg -n "@server\|@app" src/brainlayer/mcp/__init__.py | head -10

echo -e "\n=== Full call_tool wrapper check ==="
sed -n '970,1010p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check server initialization ==="
rg -n "server = Server\|Server(" src/brainlayer/mcp/__init__.py -A 3

Repository: EtanHey/brainlayer

Length of output: 2105


🏁 Script executed:

#!/bin/bash
# Check server initialization with proper regex
echo "=== Check server initialization ==="
rg -n "server = Server" src/brainlayer/mcp/__init__.py -A 3

echo -e "\n=== Check if call_tool decorator returns are handled ==="
sed -n '980,1020p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Verify return type of _search function ==="
sed -n '1697,1800p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20

echo -e "\n=== Check what _search actually returns at end ==="
sed -n '1800,1850p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 4794


🏁 Script executed:

#!/bin/bash
# Check what the actual MCP spec expects
echo "=== Check MCP library version and CallToolResult usage ==="
rg -n "CallToolResult" src/brainlayer/mcp/__init__.py | wc -l

echo -e "\n=== All CallToolResult usages ==="
rg -n "CallToolResult" src/brainlayer/mcp/__init__.py

echo -e "\n=== Check _store_new return type ==="
rg -n "async def _store_new" src/brainlayer/mcp/__init__.py -A 2

echo -e "\n=== Check what _store_new returns ==="
sed -n '2400,2500p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20

echo -e "\n=== Check the actual return from _brain_update to be sure ==="
sed -n '1605,1700p' src/brainlayer/mcp/__init__.py | grep -n "return"

Repository: EtanHey/brainlayer

Length of output: 2291


🏁 Script executed:

#!/bin/bash
# Check the full _brain_update function to see all returns
echo "=== Full _brain_update function ==="
sed -n '1607,1692p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check how _brain_search handles different return types from nested calls ==="
sed -n '1410,1450p' src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 4876


🏁 Script executed:

#!/bin/bash
# Check MCP framework expectations and how it handles returns
echo "=== Check imports and MCP version ==="
grep -E "mcp|MCP" src/brainlayer/mcp/__init__.py | head -20

echo -e "\n=== Check if there's any wrapping logic in the decorator or elsewhere ==="
rg -n "class.*Server|def.*call_tool" src/brainlayer/mcp/__init__.py | head -10

echo -e "\n=== Check if there's a README or docs about MCP setup ==="
find . -name "README*" -o -name "*.md" | head -10 | xargs grep -l "MCP\|CallToolResult" 2>/dev/null

echo -e "\n=== Check requirements or pyproject for MCP version ==="
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -i mcp 2>/dev/null

Repository: EtanHey/brainlayer

Length of output: 1422


🏁 Script executed:

#!/bin/bash
# Check if there's any test or documentation about return types
echo "=== Check test files for MCP tool handlers ==="
find . -name "*test*" -type f | head -10

echo -e "\n=== Check how _brain_search is being called from call_tool ==="
sed -n '989,1011p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check if _with_timeout does any transformations ==="
sed -n '26,45p' src/brainlayer/mcp/__init__.py

echo -e "\n=== Check what happens after call_tool returns ==="
sed -n '1100,1140p' src/brainlayer/mcp/__init__.py | head -40

Repository: EtanHey/brainlayer

Length of output: 3623


🏁 Script executed:

#!/bin/bash
# Check if there are any runtime errors or issues logged related to return types
echo "=== Check if tests exist for MCP return types ==="
find . -name "*test*mcp*" -o -name "*mcp*test*" 2>/dev/null | head -10

echo -e "\n=== Check CLAUDE.md or docs for MCP return type guidance ==="
grep -A 5 -B 5 "CallToolResult\|return type" ./CLAUDE.md 2>/dev/null | head -30

echo -e "\n=== Check what _brain_search nested calls actually return ==="
rg -n "async def _regression|async def _recall|async def _file_timeline" src/brainlayer/mcp/__init__.py -A 1

Repository: EtanHey/brainlayer

Length of output: 414


🏁 Script executed:

#!/bin/bash
# Check what these nested functions return
echo "=== Check _file_timeline return type ==="
sed -n '1922,1960p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "

echo -e "\n=== Check _regression return type ==="
sed -n '1991,2030p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "

echo -e "\n=== Check _recall return type ==="
sed -n '2153,2200p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "

echo -e "\n=== Look at actual return statements in _file_timeline ==="
sed -n '1922,1970p' src/brainlayer/mcp/__init__.py | tail -30

echo -e "\n=== Look at actual return statements in _recall ==="
sed -n '2153,2250p' src/brainlayer/mcp/__init__.py | tail -50

Repository: EtanHey/brainlayer

Length of output: 3114


Return types are inconsistent across multiple handlers, not just _brain_update.

While the specific concern about _brain_update (lines 1623–1626, 1658, 1684) returning raw [TextContent] on success but CallToolResult on error is valid, this is part of a broader type annotation mismatch. Many handlers like _file_timeline, _context, and _sessions are annotated -> list[TextContent] but can actually return CallToolResult via _error_result() on error. Only _brain_digest and _brain_entity consistently follow the -> CallToolResult pattern.

To address this properly: either (1) unify all handlers to return CallToolResult consistently and update type annotations, or (2) fix all type annotations to reflect actual return types (list[TextContent] | CallToolResult). The current mixed pattern creates implicit contracts that don't match actual behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/__init__.py` around lines 1623 - 1626, Several handlers
(_brain_update, _file_timeline, _context, _sessions and others) have
inconsistent return types—some return list[TextContent] on success but
CallToolResult on error; choose one consistent contract and implement it: either
(A) change all handler annotations to "-> CallToolResult" and wrap successful
list[TextContent] payloads into a CallToolResult success variant, or (B) change
annotations to "-> list[TextContent] | CallToolResult" to reflect reality;
locate functions _brain_update, _file_timeline, _context, _sessions,
_brain_digest, _brain_entity and the helper _error_result, then update their
type hints and return statements accordingly and update any callers to handle
the unified return type.

Comment on lines +1690 to +1691
logger.error("brain_update failed: %s", e)
return _error_result(f"brain_update error: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if logger is imported/defined in the MCP module
rg -n "^logger|^import.*logging|from.*logging" src/brainlayer/mcp/__init__.py | head -20

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

#!/bin/bash
# Search for logger definition/import anywhere in the file
echo "=== Searching for 'logger' in file ==="
rg -n "logger" src/brainlayer/mcp/__init__.py | head -10

echo "=== Searching for 'logging' imports ==="
rg -n "logging" src/brainlayer/mcp/__init__.py | head -10

echo "=== Check first 50 lines for imports ==="
head -50 src/brainlayer/mcp/__init__.py | cat -n

Repository: EtanHey/brainlayer

Length of output: 2243


🏁 Script executed:

#!/bin/bash
# Check file size and look for logger initialization anywhere
echo "=== Total lines and check middle sections ==="
wc -l src/brainlayer/mcp/__init__.py

echo "=== Search all occurrences of 'logger' ==="
rg -n "logger" src/brainlayer/mcp/__init__.py

echo "=== Search for any logging-related assignments ==="
rg -n "logging\|logger\s*=" src/brainlayer/mcp/__init__.py

echo "=== Check if logger is a parameter or global ==="
rg -n "def.*logger|logger\s*:" src/brainlayer/mcp/__init__.py

Repository: EtanHey/brainlayer

Length of output: 426


Add missing logger import and initialization.

logger.error() is called at line 1690 (and logger.warning() at line 1340), but logger is not imported or defined anywhere in the module. This will raise a NameError at runtime when these code paths execute. Add import logging and initialize logger = logging.getLogger(__name__) at the module level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/mcp/__init__.py` around lines 1690 - 1691, The module calls
logger.error and logger.warning but never defines logger; add an import and
module-level logger initialization by importing the standard logging module and
creating logger = logging.getLogger(__name__) near the top of the file so calls
like logger.error("brain_update failed: %s", e) and logger.warning(...) work
without raising NameError; place this initialization alongside other top-level
imports/initializations in the src/brainlayer/mcp/__init__.py module.

Comment on lines +551 to +601
def update_chunk(
self,
chunk_id: str,
content: Optional[str] = None,
tags: Optional[List[str]] = None,
importance: Optional[float] = None,
embedding: Optional[List[float]] = None,
) -> bool:
"""Update fields on an existing chunk. Returns True if chunk was found."""
cursor = self.conn.cursor()
# Check chunk exists
rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,)))
if not rows:
return False

now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
if content is not None:
cursor.execute(
"UPDATE chunks SET content = ?, char_count = ?, summary = ? WHERE id = ?",
(content, len(content), content[:200], chunk_id),
)
if tags is not None:
cursor.execute(
"UPDATE chunks SET tags = ? WHERE id = ?",
(json.dumps(tags), chunk_id),
)
if importance is not None:
cursor.execute(
"UPDATE chunks SET importance = ? WHERE id = ?",
(float(max(1, min(10, importance))), chunk_id),
)
if embedding is not None:
cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
cursor.execute(
"INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, serialize_f32(embedding)),
)
return True

def archive_chunk(self, chunk_id: str) -> bool:
"""Soft-delete a chunk by setting value_type to ARCHIVED."""
cursor = self.conn.cursor()
rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,)))
if not rows:
return False
cursor.execute(
"UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,)
)
# Remove from vector index so it doesn't appear in searches
cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding BUSY retry for consistency with other write methods.

update_chunk and archive_chunk lack the BUSY retry logic that update_enrichment and update_sentiment implement (Lines 1336-1344, 1365-1376). While PRAGMA busy_timeout = 5000 provides base protection, other write methods in this file use explicit retry to handle edge cases under heavy contention.

For consistency, consider adding the same retry pattern, or document why these methods rely solely on the busy_timeout pragma.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/vector_store.py` around lines 551 - 601, The update_chunk and
archive_chunk methods lack the explicit BUSY-retry loop used by
update_enrichment/update_sentiment; wrap their write operations (the SELECT
check and subsequent UPDATE/DELETE/INSERT calls inside update_chunk, and the
SELECT plus UPDATE/DELETE in archive_chunk) in the same retry-on-BUSY pattern
(catch sqlite3.OperationalError/BusyError or check error message for "database
is locked"), retry a few times with short sleeps/backoff before failing, and
ensure the cursor/connection handling and final return behavior remain identical
to the existing methods so behavior and commits remain consistent with the rest
of the file.

if not rows:
return False

now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable now.

The variable now is assigned but never used. This appears to be leftover from a planned timestamp update that wasn't implemented.

🧹 Proposed fix
-        now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/vector_store.py` at line 566, Remove the unused local variable
now in src/brainlayer/vector_store.py: delete the line that assigns now =
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with
the intended timestamp usage if you actually meant to set/update a timestamp
variable), so no unused variable remains in the scope where it's defined.

Comment on lines +603 to +620
def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]:
"""Get a single chunk by ID."""
cursor = self.conn.cursor()
rows = list(cursor.execute(
"""SELECT id, content, metadata, source_file, project, content_type,
value_type, tags, importance, created_at, summary
FROM chunks WHERE id = ?""",
(chunk_id,),
))
if not rows:
return None
r = rows[0]
return {
"id": r[0], "content": r[1], "metadata": r[2],
"source_file": r[3], "project": r[4], "content_type": r[5],
"value_type": r[6], "tags": r[7], "importance": r[8],
"created_at": r[9], "summary": r[10],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use _read_cursor() for read-only operation.

get_chunk is a read operation but uses the write connection (self.conn.cursor()). For consistency with other read methods in this file and to avoid blocking concurrent writes, use _read_cursor().

♻️ Proposed fix
     def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]:
         """Get a single chunk by ID."""
-        cursor = self.conn.cursor()
+        cursor = self._read_cursor()
         rows = list(cursor.execute(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/vector_store.py` around lines 603 - 620, get_chunk currently
opens a write cursor via self.conn.cursor() even though it's a read-only
operation; change it to use the read-only helper by acquiring the cursor from
self._read_cursor() (i.e., replace the cursor = self.conn.cursor() call with
cursor = self._read_cursor()) so the method get_chunk uses the same non-blocking
read connection pattern as other read methods and avoids blocking concurrent
writes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey EtanHey merged commit 9e1ab46 into main Feb 26, 2026
6 checks passed
@EtanHey EtanHey deleted the feat/diarization-pipeline branch February 26, 2026 15:17
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

if not rows:
return False

now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused now variable in update_chunk method

Low Severity

The now variable is computed via datetime.now(timezone.utc).strftime(...) but never referenced in any of the subsequent SQL UPDATE statements. This looks like it was intended to set a modification timestamp (e.g., enriched_at or an updated_at field) when a chunk is updated, but the actual update was never wired in. Updates silently lose their modification timestamps.

Fix in Cursor Fix in Web

Comment thread scripts/index_youtube.py
vid = video_id_match.group(1) if video_id_match else args.url

if args.replace:
delete_video_chunks(store, vid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex video ID extraction mismatches index_single_video extraction

Medium Severity

The vid variable is extracted via a regex matching [?&]v= in the URL. For non-standard YouTube URL formats (e.g., youtu.be/ID or /embed/ID), the regex won't match and vid falls back to the entire URL string. When --replace is used without --diarized-transcript, delete_video_chunks uses this possibly-wrong vid, while index_single_video correctly extracts the real video ID via yt-dlp. This mismatch causes old chunks to not be deleted while new ones are created — resulting in duplicate data.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment thread scripts/index_youtube.py
})
embeddings.append(ec.embedding)
total_chunks = store.upsert_chunks(chunk_data, embeddings)
log.info(f" Indexed {total_chunks} chunks for '{title}'")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated embed-and-upsert logic from index_single_video

Low Severity

The embed → build chunk_dataupsert_chunks block in the diarized-transcript path (inside main()) is nearly a line-for-line copy of the same logic in index_single_video. Any future bug fix or schema change to the chunk-building loop needs to be applied in both places, increasing risk of divergence.

Additional Locations (1)

Fix in Cursor Fix in Web

cursor.execute("UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,))
# Remove from vector index so it doesn't appear in searches
cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,))
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archived chunks still appear in FTS search results

Medium Severity

archive_chunk removes the chunk from chunk_vectors (preventing vector search hits) but does not remove it from chunks_fts. The FTS delete trigger only fires on DELETE FROM chunks, not on UPDATE. Since archive_chunk only updates value_type, the FTS entry persists. The hybrid_search FTS path has no value_type != 'ARCHIVED' filter, so archived/merged memories can still surface in keyword-based search results — defeating the purpose of archiving.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant