feat: diarization pipeline + diarized transcript re-indexing#34
feat: diarization pipeline + diarized transcript re-indexing#34
Conversation
- Add scripts/diarize_episodes.py: batch pipeline for WhisperX speaker
diarization of Huberman guest episodes (download → transcribe → align →
diarize → map speakers → save JSON → re-index)
- Add --diarized-transcript and --replace flags to index_youtube.py for
re-indexing with speaker-labeled transcripts
- Fix speaker mapping: content-based detection ("I'm Andrew Huberman")
instead of speaking-time-only heuristic which produced inverted labels
- Add data/ to .gitignore (audio WAVs, diarized JSONs)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a speaker-diarization pipeline and indexing support: new script to download audio, transcribe, diarize, map speakers, and re-index episodes; indexer extended to accept diarized transcripts and to delete/replace existing video chunks; plus .gitignore update to ignore Changes
Sequence DiagramsequenceDiagram
participant CLI as User CLI
participant YTDL as Audio Downloader (yt-dlp)
participant YouTube as YouTube
participant WhisperX as WhisperX (transcribe & align)
participant Pyannote as Pyannote (diarize)
participant Mapper as Speaker Mapper
participant Indexer as Indexing Script
participant Store as VectorStore
CLI->>YTDL: request audio for video_id
YTDL->>YouTube: fetch media
YouTube-->>YTDL: audio file (wav)
YTDL->>WhisperX: transcribe & align audio
WhisperX-->>YTDL: aligned segments
YTDL->>Pyannote: diarize audio
Pyannote-->>YTDL: segments with SPEAKER_00/01
YTDL->>Mapper: map SPEAKER_XX -> identities
Mapper-->>YTDL: segments labeled (Host/Guest)
YTDL->>Indexer: send diarized segments for chunking
Indexer->>Store: embed & upsert chunks (or delete if replace)
Store-->>Indexer: confirm upsert/delete
Indexer-->>CLI: re-indexing complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/diarize_episodes.py`:
- Around line 195-204: The inline comment "Only check first 60 seconds" is
inconsistent with the conditional that uses seg.get("start", 0) > 120; update
either the comment or the conditional for consistency — e.g., change the
threshold in the loop from 120 to 60 (seg.get("start", 0) > 60) so the code
checks the first 60 seconds, or adjust the comment to "Only check first 120
seconds" if 2 minutes is intended; locate the block using intro_phrases and the
loop over segments to make the change.
- Around line 138-140: The cache directory path is built incorrectly causing a
duplicated "data" segment; change the cache_dir computation to remove the extra
"data" component so it points to data/aligned relative to the audio file (i.e.,
set cache_dir using audio_path.parent.parent / "aligned" instead of
audio_path.parent.parent / "data" / "aligned"); update any related uses of
cache_dir/cache_path (variables cache_dir and cache_path) to match this
corrected path and ensure it still works when audio_path is like
data/audio/VIDEO_ID.wav.
- Around line 267-268: Replace the hardcoded virtualenv Python path used in the
variable venv_python with the current interpreter by using sys.executable;
locate the variables script_path and venv_python in diarize_episodes.py and set
venv_python to sys.executable (import sys at top if missing) so the script
invokes the same Python that's running this module rather than assuming
".venv/bin/python3".
In `@scripts/index_youtube.py`:
- Around line 724-753: The block in index_youtube.py duplicates embedding/upsert
logic from index_single_video; extract a helper function (e.g.
_embed_and_upsert_chunks(chunks, video_id, project, upload_date, store)) that
encapsulates calling embed_chunks, building chunk_data entries (use source_file
= f"youtube:{video_id}", include metadata upload_date and created_at
formatting), collecting embeddings, and calling store.upsert_chunks; replace the
duplicated loops in both index_single_video and the current block with calls to
this new helper to remove duplication and keep behavior identical (preserve
fields: id, content, metadata, source_file, project, content_type, value_type,
char_count, source, conversation_id, position, created_at).
- Around line 697-698: The video ID extraction only matches standard query URLs
and falls back to the whole URL; update the logic around video_id_match/vid to
also detect youtu.be short links and extract their path segment as the ID (e.g.,
check for hostname "youtu.be" or match r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or
parse args.url with urllib.parse to handle both formats and set vid to the
extracted ID rather than the raw URL (adjust the variables video_id_match and
vid accordingly).
- Around line 639-657: The delete_video_chunks function runs DELETE statements
on store.conn via cursor but never commits, so ensure the deletions are
persisted by calling a commit on the same connection after the DELETEs complete
(e.g., call store.conn.commit() right after the second cursor.execute and before
returning); locate the function delete_video_chunks and add the commit using the
existing store.conn (or use a transaction/context manager if preferred) so the
DELETEs to chunk_vectors and chunks are durable.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
.gitignorescripts/diarize_episodes.pyscripts/index_youtube.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Cursor Bugbot
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/index_youtube.py (1)
src/brainlayer/vector_store.py (1)
upsert_chunks(490-549)
scripts/diarize_episodes.py (4)
src/brainlayer/storage.py (1)
exists(64-66)src/brainlayer/pipeline/unified_timeline.py (1)
load(561-569)src/brainlayer/pipeline/semantic_style.py (1)
model(157-162)scripts/index_youtube.py (1)
main(660-806)
🔇 Additional comments (8)
.gitignore (1)
28-28: LGTM!Ignoring
data/is appropriate for excluding generated artifacts (audio WAVs, diarized JSONs) from version control.scripts/index_youtube.py (2)
611-636: LGTM!The function cleanly handles both JSON formats (list and dict with segments key), gracefully handles missing speaker labels, and correctly computes duration.
675-678: LGTM!CLI arguments are appropriately typed and clearly documented.
scripts/diarize_episodes.py (5)
46-59: LGTM!Episode list and directory path definitions are well-structured. The relative path calculation correctly resolves to
{repo}/data/{audio|diarized}.
62-73: LGTM!Token handling follows standard HuggingFace patterns with clear error messaging.
292-358: LGTM!The function has good structure with clear stages, appropriate error handling, and effective caching to avoid redundant work.
361-408: LGTM!Main function has sensible defaults (batch size for thermal management), secure token handling (logs length, not value), and clear progress reporting.
86-93: No action needed. The--remote-components ejs:githubflag is a legitimate, documented yt-dlp option for fetching EJS JavaScript components from the yt-dlp-ejs GitHub repository. It is not outdated, renamed, or fork-specific.Likely an incorrect or invalid review comment.
| def delete_video_chunks(store: VectorStore, video_id: str) -> int: | ||
| """Delete all chunks for a video ID. Returns count deleted.""" | ||
| source_file = f"youtube:{video_id}" | ||
| cursor = store.conn.cursor() | ||
| # Count first | ||
| count = list(cursor.execute( | ||
| "SELECT COUNT(*) FROM chunks WHERE source_file = ?", (source_file,) | ||
| ))[0][0] | ||
| if count == 0: | ||
| return 0 | ||
| # Delete from vectors first (FK dependency) | ||
| cursor.execute( | ||
| "DELETE FROM chunk_vectors WHERE chunk_id IN " | ||
| "(SELECT id FROM chunks WHERE source_file = ?)", | ||
| (source_file,), | ||
| ) | ||
| cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,)) | ||
| log.info(f" Deleted {count} existing chunks for {video_id}") | ||
| return count |
There was a problem hiding this comment.
Missing transaction commit after delete operations.
The function executes DELETE statements but doesn't commit the transaction. If store.conn uses the default SQLite isolation (no autocommit), these deletions won't persist unless a subsequent operation commits. This could lead to data inconsistency if the script fails between delete and upsert.
🔧 Proposed fix: commit after deletions
cursor.execute("DELETE FROM chunks WHERE source_file = ?", (source_file,))
+ store.conn.commit()
log.info(f" Deleted {count} existing chunks for {video_id}")
return count#!/bin/bash
# Verify VectorStore's connection autocommit behavior and commit patterns
ast-grep --pattern $'class VectorStore {
$$$
def __init__($$$) {
$$$
}
$$$
}'
# Check if connection uses autocommit or explicit commits
rg -n "isolation_level|autocommit|\.commit\(\)" src/brainlayer/vector_store.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/index_youtube.py` around lines 639 - 657, The delete_video_chunks
function runs DELETE statements on store.conn via cursor but never commits, so
ensure the deletions are persisted by calling a commit on the same connection
after the DELETEs complete (e.g., call store.conn.commit() right after the
second cursor.execute and before returning); locate the function
delete_video_chunks and add the commit using the existing store.conn (or use a
transaction/context manager if preferred) so the DELETEs to chunk_vectors and
chunks are durable.
| video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url) | ||
| vid = video_id_match.group(1) if video_id_match else args.url |
There was a problem hiding this comment.
Video ID extraction doesn't handle youtu.be short URLs.
The regex [?&]v=([a-zA-Z0-9_-]+) handles standard YouTube URLs but not short URLs like https://youtu.be/VIDEO_ID. The fallback uses the raw URL as the video ID, which would produce incorrect source_file values.
🔧 Proposed fix: handle both URL formats
- video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url)
- vid = video_id_match.group(1) if video_id_match else args.url
+ # Handle both youtube.com/watch?v=ID and youtu.be/ID formats
+ video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url)
+ vid = video_id_match.group(1) if video_id_match else args.url📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| video_id_match = re.search(r"[?&]v=([a-zA-Z0-9_-]+)", args.url) | |
| vid = video_id_match.group(1) if video_id_match else args.url | |
| # Handle both youtube.com/watch?v=ID and youtu.be/ID formats | |
| video_id_match = re.search(r"(?:[?&]v=|youtu\.be/)([a-zA-Z0-9_-]+)", args.url) | |
| vid = video_id_match.group(1) if video_id_match else args.url |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/index_youtube.py` around lines 697 - 698, The video ID extraction
only matches standard query URLs and falls back to the whole URL; update the
logic around video_id_match/vid to also detect youtu.be short links and extract
their path segment as the ID (e.g., check for hostname "youtu.be" or match
r"(?:youtu\.be/)([A-Za-z0-9_-]+)"), or parse args.url with urllib.parse to
handle both formats and set vid to the extracted ID rather than the raw URL
(adjust the variables video_id_match and vid accordingly).
| else: | ||
| log.info(f" Embedding {len(chunks)} chunks...") | ||
| embedded = embed_chunks(chunks) | ||
| source_file = f"youtube:{vid}" | ||
| chunk_data = [] | ||
| embeddings = [] | ||
| for i, ec in enumerate(embedded): | ||
| c = ec.chunk | ||
| meta = dict(c.metadata) | ||
| if upload_date: | ||
| meta["upload_date"] = upload_date | ||
| created_at = None | ||
| if upload_date and len(upload_date) == 8: | ||
| created_at = f"{upload_date[:4]}-{upload_date[4:6]}-{upload_date[6:8]}T00:00:00+00:00" | ||
| chunk_data.append({ | ||
| "id": f"{source_file}:{i}", | ||
| "content": c.content, | ||
| "metadata": meta, | ||
| "source_file": source_file, | ||
| "project": args.project, | ||
| "content_type": c.content_type.value, | ||
| "value_type": c.value.value, | ||
| "char_count": c.char_count, | ||
| "source": "youtube", | ||
| "conversation_id": source_file, | ||
| "position": i, | ||
| "created_at": created_at, | ||
| }) | ||
| embeddings.append(ec.embedding) | ||
| total_chunks = store.upsert_chunks(chunk_data, embeddings) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Significant code duplication with index_single_video.
The embedding and upsert logic (lines 724-753) duplicates code from index_single_video (lines 564-603). This increases maintenance burden and risk of divergence.
Consider extracting a helper function like _embed_and_upsert_chunks(chunks, video_id, project, upload_date, store).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/index_youtube.py` around lines 724 - 753, The block in
index_youtube.py duplicates embedding/upsert logic from index_single_video;
extract a helper function (e.g. _embed_and_upsert_chunks(chunks, video_id,
project, upload_date, store)) that encapsulates calling embed_chunks, building
chunk_data entries (use source_file = f"youtube:{video_id}", include metadata
upload_date and created_at formatting), collecting embeddings, and calling
store.upsert_chunks; replace the duplicated loops in both index_single_video and
the current block with calls to this new helper to remove duplication and keep
behavior identical (preserve fields: id, content, metadata, source_file,
project, content_type, value_type, char_count, source, conversation_id,
position, created_at).
- New MCP tool brain_update with 3 actions: - update: change content (re-embeds), tags, importance on existing chunks - archive: soft-delete (sets value_type=ARCHIVED, removes from vector index) - merge: keep one memory, archive duplicates - VectorStore: add update_chunk(), archive_chunk(), get_chunk() methods - Update tool count: 6 → 7 - Fix test_tool_count assertion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix cache dir double "data" prefix (data/data/aligned → data/aligned) - Fix comment/code mismatch (60s → 2 minutes matches actual 120s check) - Use sys.executable instead of hardcoded .venv path for portability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/mcp/__init__.py`:
- Around line 1607-1692: _brain_update calls json.dumps() but json is not
imported, causing a NameError at runtime; add a local import inside the function
(e.g., add "import json" near the start of _brain_update) or import json at
module scope to match the pattern used by _brain_digest/_brain_entity so
json.dumps calls in _brain_update succeed.
- Around line 1690-1691: The module calls logger.error and logger.warning but
never defines logger; add an import and module-level logger initialization by
importing the standard logging module and creating logger =
logging.getLogger(__name__) near the top of the file so calls like
logger.error("brain_update failed: %s", e) and logger.warning(...) work without
raising NameError; place this initialization alongside other top-level
imports/initializations in the src/brainlayer/mcp/__init__.py module.
- Around line 1623-1626: Several handlers (_brain_update, _file_timeline,
_context, _sessions and others) have inconsistent return types—some return
list[TextContent] on success but CallToolResult on error; choose one consistent
contract and implement it: either (A) change all handler annotations to "->
CallToolResult" and wrap successful list[TextContent] payloads into a
CallToolResult success variant, or (B) change annotations to "->
list[TextContent] | CallToolResult" to reflect reality; locate functions
_brain_update, _file_timeline, _context, _sessions, _brain_digest, _brain_entity
and the helper _error_result, then update their type hints and return statements
accordingly and update any callers to handle the unified return type.
In `@src/brainlayer/vector_store.py`:
- Around line 603-620: get_chunk currently opens a write cursor via
self.conn.cursor() even though it's a read-only operation; change it to use the
read-only helper by acquiring the cursor from self._read_cursor() (i.e., replace
the cursor = self.conn.cursor() call with cursor = self._read_cursor()) so the
method get_chunk uses the same non-blocking read connection pattern as other
read methods and avoids blocking concurrent writes.
- Around line 551-601: The update_chunk and archive_chunk methods lack the
explicit BUSY-retry loop used by update_enrichment/update_sentiment; wrap their
write operations (the SELECT check and subsequent UPDATE/DELETE/INSERT calls
inside update_chunk, and the SELECT plus UPDATE/DELETE in archive_chunk) in the
same retry-on-BUSY pattern (catch sqlite3.OperationalError/BusyError or check
error message for "database is locked"), retry a few times with short
sleeps/backoff before failing, and ensure the cursor/connection handling and
final return behavior remain identical to the existing methods so behavior and
commits remain consistent with the rest of the file.
- Line 566: Remove the unused local variable now in
src/brainlayer/vector_store.py: delete the line that assigns now =
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with
the intended timestamp usage if you actually meant to set/update a timestamp
variable), so no unused variable remains in the scope where it's defined.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.pytests/test_think_recall_integration.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
- GitHub Check: test (3.11)
- GitHub Check: Cursor Bugbot
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
ruff check src/for linting andruff format src/for code formatting
Files:
src/brainlayer/mcp/__init__.pysrc/brainlayer/vector_store.py
src/brainlayer/mcp/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names
Files:
src/brainlayer/mcp/__init__.py
src/brainlayer/vector_store.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use sqlite-vec with APSW for vector storage, WAL mode, and
PRAGMA busy_timeout = 5000for concurrent multi-process safety
Files:
src/brainlayer/vector_store.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run tests using
pytestfrom the project root
Files:
tests/test_think_recall_integration.py
🧠 Learnings (3)
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/mcp/**/*.py : Implement MCP server with brain_search, brain_store, and brain_recall tools, maintaining backward compatibility with old brainlayer_* tool names
Applied to files:
src/brainlayer/mcp/__init__.pytests/test_think_recall_integration.py
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/cli/**/*.py : Enable project-specific indexing with `brainlayer index --project <project_name>` and incremental indexing with `brainlayer index-fast`
Applied to files:
src/brainlayer/mcp/__init__.py
📚 Learning: 2026-02-23T16:51:38.317Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T16:51:38.317Z
Learning: Applies to src/brainlayer/pipeline/enrichment.py : Enrich chunks with 10 metadata fields: summary, tags, importance (1-10), intent, primary_symbols, resolved_query, epistemic_level, version_scope, debt_impact, and external_deps
Applied to files:
src/brainlayer/vector_store.py
🧬 Code graph analysis (3)
src/brainlayer/mcp/__init__.py (1)
src/brainlayer/vector_store.py (3)
archive_chunk(590-601)get_chunk(603-620)update_chunk(551-588)
src/brainlayer/vector_store.py (2)
src/brainlayer/clustering.py (1)
serialize_f32(53-60)scripts/reembed_bge_m3.py (1)
serialize_f32(39-41)
tests/test_think_recall_integration.py (1)
src/brainlayer/mcp/__init__.py (1)
list_tools(529-922)
🔇 Additional comments (4)
tests/test_think_recall_integration.py (1)
246-255: LGTM!The test correctly updates the expected tool count from 6 to 7, and the docstring accurately lists all tools including the new
brain_updatetool. This aligns with thelist_tools()implementation insrc/brainlayer/mcp/__init__.py.src/brainlayer/mcp/__init__.py (3)
50-68: LGTM!Server instructions correctly updated to reflect 7 tools and include the new
brain_updatetool description. The description accurately covers all three actions (update, archive, merge).
872-921: LGTM!The
brain_updatetool definition is well-structured:
- Correctly uses
_WRITEannotations for a destructive operation- Clear enum for actions with appropriate descriptions
- Input schema properly documents which fields are used for which actions
1067-1075: LGTM!The routing correctly passes all parameters to
_brain_update. Appropriately omits the timeout wrapper, consistent with other write operations (see AIDEV-NOTE at line 1013).
| async def _brain_update( | ||
| action: str, | ||
| chunk_id: str, | ||
| content: str | None = None, | ||
| tags: list[str] | None = None, | ||
| importance: int | None = None, | ||
| merge_chunk_ids: list[str] | None = None, | ||
| ): | ||
| """Update, archive, or merge memories.""" | ||
| try: | ||
| store = _get_vector_store() | ||
|
|
||
| if action == "archive": | ||
| ok = store.archive_chunk(chunk_id) | ||
| if not ok: | ||
| return _error_result(f"Chunk not found: {chunk_id}") | ||
| return [TextContent( | ||
| type="text", | ||
| text=json.dumps({"action": "archived", "chunk_id": chunk_id}), | ||
| )] | ||
|
|
||
| elif action == "update": | ||
| # Verify chunk exists | ||
| existing = store.get_chunk(chunk_id) | ||
| if not existing: | ||
| return _error_result(f"Chunk not found: {chunk_id}") | ||
|
|
||
| # Re-embed if content changed | ||
| embedding = None | ||
| if content is not None: | ||
| loop = asyncio.get_running_loop() | ||
| model = _get_embedding_model() | ||
| embedding = await loop.run_in_executor(None, model.embed_query, content) | ||
|
|
||
| ok = store.update_chunk( | ||
| chunk_id=chunk_id, | ||
| content=content, | ||
| tags=tags, | ||
| importance=float(importance) if importance is not None else None, | ||
| embedding=embedding, | ||
| ) | ||
| if not ok: | ||
| return _error_result(f"Update failed for: {chunk_id}") | ||
|
|
||
| result = {"action": "updated", "chunk_id": chunk_id, "fields": []} | ||
| if content is not None: | ||
| result["fields"].append("content") | ||
| if tags is not None: | ||
| result["fields"].append("tags") | ||
| if importance is not None: | ||
| result["fields"].append("importance") | ||
| return [TextContent(type="text", text=json.dumps(result))] | ||
|
|
||
| elif action == "merge": | ||
| if not merge_chunk_ids: | ||
| return _error_result("merge requires merge_chunk_ids (the duplicates to archive)") | ||
|
|
||
| # Verify the keeper exists | ||
| keeper = store.get_chunk(chunk_id) | ||
| if not keeper: | ||
| return _error_result(f"Keeper chunk not found: {chunk_id}") | ||
|
|
||
| archived = [] | ||
| failed = [] | ||
| for dup_id in merge_chunk_ids: | ||
| ok = store.archive_chunk(dup_id) | ||
| if ok: | ||
| archived.append(dup_id) | ||
| else: | ||
| failed.append(dup_id) | ||
|
|
||
| result = { | ||
| "action": "merged", | ||
| "kept": chunk_id, | ||
| "archived": archived, | ||
| "failed": failed, | ||
| } | ||
| return [TextContent(type="text", text=json.dumps(result))] | ||
|
|
||
| else: | ||
| return _error_result(f"Unknown action: {action}. Use update, archive, or merge.") | ||
|
|
||
| except Exception as e: | ||
| logger.error("brain_update failed: %s", e) | ||
| return _error_result(f"brain_update error: {e}") | ||
|
|
There was a problem hiding this comment.
Missing json import will cause NameError at runtime.
The _brain_update function uses json.dumps() at lines 1625, 1658, and 1684, but doesn't import the json module. Other handlers in this file (e.g., _brain_digest at line 1201, _brain_entity at line 1234) import json locally within the function.
🐛 Proposed fix
async def _brain_update(
action: str,
chunk_id: str,
content: str | None = None,
tags: list[str] | None = None,
importance: int | None = None,
merge_chunk_ids: list[str] | None = None,
):
"""Update, archive, or merge memories."""
+ import json
+
try:
store = _get_vector_store()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/__init__.py` around lines 1607 - 1692, _brain_update calls
json.dumps() but json is not imported, causing a NameError at runtime; add a
local import inside the function (e.g., add "import json" near the start of
_brain_update) or import json at module scope to match the pattern used by
_brain_digest/_brain_entity so json.dumps calls in _brain_update succeed.
| return [TextContent( | ||
| type="text", | ||
| text=json.dumps({"action": "archived", "chunk_id": chunk_id}), | ||
| )] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check return patterns in the MCP file
echo "=== Searching for return [TextContent patterns ==="
rg -n "return \[TextContent" src/brainlayer/mcp/__init__.py
echo -e "\n=== Searching for return CallToolResult patterns ==="
rg -n "return CallToolResult" src/brainlayer/mcp/__init__.py
echo -e "\n=== Checking _error_result function ==="
rg -n "_error_result" src/brainlayer/mcp/__init__.py -A 3 | head -30Repository: EtanHey/brainlayer
Length of output: 2888
🏁 Script executed:
#!/bin/bash
# Check the _brain_update function and its decorator
echo "=== Checking _brain_update function signature and decorator ==="
rg -n "async def _brain_update" src/brainlayer/mcp/__init__.py -B 2 -A 20 | head -50
echo -e "\n=== Checking _brain_digest function signature ==="
rg -n "async def _brain_digest" src/brainlayer/mcp/__init__.py -B 2 -A 5
echo -e "\n=== Checking _search function signature ==="
rg -n "async def _search" src/brainlayer/mcp/__init__.py -B 2 -A 5
echo -e "\n=== Checking call_tool implementation ==="
rg -n "async def call_tool" src/brainlayer/mcp/__init__.py -A 30 | head -60Repository: EtanHey/brainlayer
Length of output: 2993
🏁 Script executed:
#!/bin/bash
# Check _with_timeout implementation
echo "=== Checking _with_timeout implementation ==="
rg -n "def _with_timeout" src/brainlayer/mcp/__init__.py -A 15
echo -e "\n=== Checking _brain_search return type ==="
rg -n "async def _brain_search" src/brainlayer/mcp/__init__.py -B 2 -A 2
echo -e "\n=== Checking _brain_store return type ==="
rg -n "async def _brain_store" src/brainlayer/mcp/__init__.py -B 2 -A 2
echo -e "\n=== Checking imports for MCP types ==="
head -50 src/brainlayer/mcp/__init__.py | grep -E "from|import"Repository: EtanHey/brainlayer
Length of output: 1332
🏁 Script executed:
#!/bin/bash
# Get full MCP type imports
echo "=== Full MCP imports ==="
head -100 src/brainlayer/mcp/__init__.py | grep -A 10 "from mcp.types import"
echo -e "\n=== Return type annotations for handlers ==="
rg -n "async def .*->.*:" src/brainlayer/mcp/__init__.py | head -20
echo -e "\n=== Check what _brain_search returns ==="
rg -n "return.*TextContent\|return.*CallToolResult" src/brainlayer/mcp/__init__.py | grep -A 2 -B 2 "1360:"
sed -n '1360,1950p' src/brainlayer/mcp/__init__.py | rg -n "return " | head -20
echo -e "\n=== Check actual return statements in _brain_search ==="
sed -n '1362,1900p' src/brainlayer/mcp/__init__.py | tail -100 | head -50Repository: EtanHey/brainlayer
Length of output: 4228
🏁 Script executed:
#!/bin/bash
# Check the complete _brain_search function to see what it returns
echo "=== Full _brain_search implementation (lines 1362-1450) ==="
sed -n '1362,1450p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check line endings of _brain_search ==="
sed -n '1850,1900p' src/brainlayer/mcp/__init__.pyRepository: EtanHey/brainlayer
Length of output: 5388
🏁 Script executed:
#!/bin/bash
# Let's verify what types are actually returned by examining the complete pattern
echo "=== Check what _brain_digest actually returns ==="
sed -n '1194,1228p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check what _search returns (full) ==="
sed -n '1730,1760p' src/brainlayer/mcp/__init__.py | tail -20
echo -e "\n=== Check MCP framework usage and tool registration ==="
rg -n "register_tool\|add_tool" src/brainlayer/mcp/__init__.py | head -10Repository: EtanHey/brainlayer
Length of output: 2035
🏁 Script executed:
#!/bin/bash
# Check how call_tool is used and what the server expects
echo "=== Check MCP server setup and tool_handler registration ==="
rg -n "on_call_tool\|call_tool_handler\|Tool\(" src/brainlayer/mcp/__init__.py | head -20
echo -e "\n=== Check if there's a wrapper around call_tool ==="
rg -n "@server\|@app" src/brainlayer/mcp/__init__.py | head -10
echo -e "\n=== Full call_tool wrapper check ==="
sed -n '970,1010p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check server initialization ==="
rg -n "server = Server\|Server(" src/brainlayer/mcp/__init__.py -A 3Repository: EtanHey/brainlayer
Length of output: 2105
🏁 Script executed:
#!/bin/bash
# Check server initialization with proper regex
echo "=== Check server initialization ==="
rg -n "server = Server" src/brainlayer/mcp/__init__.py -A 3
echo -e "\n=== Check if call_tool decorator returns are handled ==="
sed -n '980,1020p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Verify return type of _search function ==="
sed -n '1697,1800p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20
echo -e "\n=== Check what _search actually returns at end ==="
sed -n '1800,1850p' src/brainlayer/mcp/__init__.pyRepository: EtanHey/brainlayer
Length of output: 4794
🏁 Script executed:
#!/bin/bash
# Check what the actual MCP spec expects
echo "=== Check MCP library version and CallToolResult usage ==="
rg -n "CallToolResult" src/brainlayer/mcp/__init__.py | wc -l
echo -e "\n=== All CallToolResult usages ==="
rg -n "CallToolResult" src/brainlayer/mcp/__init__.py
echo -e "\n=== Check _store_new return type ==="
rg -n "async def _store_new" src/brainlayer/mcp/__init__.py -A 2
echo -e "\n=== Check what _store_new returns ==="
sed -n '2400,2500p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return " | head -20
echo -e "\n=== Check the actual return from _brain_update to be sure ==="
sed -n '1605,1700p' src/brainlayer/mcp/__init__.py | grep -n "return"Repository: EtanHey/brainlayer
Length of output: 2291
🏁 Script executed:
#!/bin/bash
# Check the full _brain_update function to see all returns
echo "=== Full _brain_update function ==="
sed -n '1607,1692p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check how _brain_search handles different return types from nested calls ==="
sed -n '1410,1450p' src/brainlayer/mcp/__init__.pyRepository: EtanHey/brainlayer
Length of output: 4876
🏁 Script executed:
#!/bin/bash
# Check MCP framework expectations and how it handles returns
echo "=== Check imports and MCP version ==="
grep -E "mcp|MCP" src/brainlayer/mcp/__init__.py | head -20
echo -e "\n=== Check if there's any wrapping logic in the decorator or elsewhere ==="
rg -n "class.*Server|def.*call_tool" src/brainlayer/mcp/__init__.py | head -10
echo -e "\n=== Check if there's a README or docs about MCP setup ==="
find . -name "README*" -o -name "*.md" | head -10 | xargs grep -l "MCP\|CallToolResult" 2>/dev/null
echo -e "\n=== Check requirements or pyproject for MCP version ==="
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -i mcp 2>/dev/nullRepository: EtanHey/brainlayer
Length of output: 1422
🏁 Script executed:
#!/bin/bash
# Check if there's any test or documentation about return types
echo "=== Check test files for MCP tool handlers ==="
find . -name "*test*" -type f | head -10
echo -e "\n=== Check how _brain_search is being called from call_tool ==="
sed -n '989,1011p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check if _with_timeout does any transformations ==="
sed -n '26,45p' src/brainlayer/mcp/__init__.py
echo -e "\n=== Check what happens after call_tool returns ==="
sed -n '1100,1140p' src/brainlayer/mcp/__init__.py | head -40Repository: EtanHey/brainlayer
Length of output: 3623
🏁 Script executed:
#!/bin/bash
# Check if there are any runtime errors or issues logged related to return types
echo "=== Check if tests exist for MCP return types ==="
find . -name "*test*mcp*" -o -name "*mcp*test*" 2>/dev/null | head -10
echo -e "\n=== Check CLAUDE.md or docs for MCP return type guidance ==="
grep -A 5 -B 5 "CallToolResult\|return type" ./CLAUDE.md 2>/dev/null | head -30
echo -e "\n=== Check what _brain_search nested calls actually return ==="
rg -n "async def _regression|async def _recall|async def _file_timeline" src/brainlayer/mcp/__init__.py -A 1Repository: EtanHey/brainlayer
Length of output: 414
🏁 Script executed:
#!/bin/bash
# Check what these nested functions return
echo "=== Check _file_timeline return type ==="
sed -n '1922,1960p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "
echo -e "\n=== Check _regression return type ==="
sed -n '1991,2030p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "
echo -e "\n=== Check _recall return type ==="
sed -n '2153,2200p' src/brainlayer/mcp/__init__.py | grep -E "^async def|return \[|return \(|-> "
echo -e "\n=== Look at actual return statements in _file_timeline ==="
sed -n '1922,1970p' src/brainlayer/mcp/__init__.py | tail -30
echo -e "\n=== Look at actual return statements in _recall ==="
sed -n '2153,2250p' src/brainlayer/mcp/__init__.py | tail -50Repository: EtanHey/brainlayer
Length of output: 3114
Return types are inconsistent across multiple handlers, not just _brain_update.
While the specific concern about _brain_update (lines 1623–1626, 1658, 1684) returning raw [TextContent] on success but CallToolResult on error is valid, this is part of a broader type annotation mismatch. Many handlers like _file_timeline, _context, and _sessions are annotated -> list[TextContent] but can actually return CallToolResult via _error_result() on error. Only _brain_digest and _brain_entity consistently follow the -> CallToolResult pattern.
To address this properly: either (1) unify all handlers to return CallToolResult consistently and update type annotations, or (2) fix all type annotations to reflect actual return types (list[TextContent] | CallToolResult). The current mixed pattern creates implicit contracts that don't match actual behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/__init__.py` around lines 1623 - 1626, Several handlers
(_brain_update, _file_timeline, _context, _sessions and others) have
inconsistent return types—some return list[TextContent] on success but
CallToolResult on error; choose one consistent contract and implement it: either
(A) change all handler annotations to "-> CallToolResult" and wrap successful
list[TextContent] payloads into a CallToolResult success variant, or (B) change
annotations to "-> list[TextContent] | CallToolResult" to reflect reality;
locate functions _brain_update, _file_timeline, _context, _sessions,
_brain_digest, _brain_entity and the helper _error_result, then update their
type hints and return statements accordingly and update any callers to handle
the unified return type.
| logger.error("brain_update failed: %s", e) | ||
| return _error_result(f"brain_update error: {e}") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if logger is imported/defined in the MCP module
rg -n "^logger|^import.*logging|from.*logging" src/brainlayer/mcp/__init__.py | head -20Repository: EtanHey/brainlayer
Length of output: 44
🏁 Script executed:
#!/bin/bash
# Search for logger definition/import anywhere in the file
echo "=== Searching for 'logger' in file ==="
rg -n "logger" src/brainlayer/mcp/__init__.py | head -10
echo "=== Searching for 'logging' imports ==="
rg -n "logging" src/brainlayer/mcp/__init__.py | head -10
echo "=== Check first 50 lines for imports ==="
head -50 src/brainlayer/mcp/__init__.py | cat -nRepository: EtanHey/brainlayer
Length of output: 2243
🏁 Script executed:
#!/bin/bash
# Check file size and look for logger initialization anywhere
echo "=== Total lines and check middle sections ==="
wc -l src/brainlayer/mcp/__init__.py
echo "=== Search all occurrences of 'logger' ==="
rg -n "logger" src/brainlayer/mcp/__init__.py
echo "=== Search for any logging-related assignments ==="
rg -n "logging\|logger\s*=" src/brainlayer/mcp/__init__.py
echo "=== Check if logger is a parameter or global ==="
rg -n "def.*logger|logger\s*:" src/brainlayer/mcp/__init__.pyRepository: EtanHey/brainlayer
Length of output: 426
Add missing logger import and initialization.
logger.error() is called at line 1690 (and logger.warning() at line 1340), but logger is not imported or defined anywhere in the module. This will raise a NameError at runtime when these code paths execute. Add import logging and initialize logger = logging.getLogger(__name__) at the module level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/__init__.py` around lines 1690 - 1691, The module calls
logger.error and logger.warning but never defines logger; add an import and
module-level logger initialization by importing the standard logging module and
creating logger = logging.getLogger(__name__) near the top of the file so calls
like logger.error("brain_update failed: %s", e) and logger.warning(...) work
without raising NameError; place this initialization alongside other top-level
imports/initializations in the src/brainlayer/mcp/__init__.py module.
| def update_chunk( | ||
| self, | ||
| chunk_id: str, | ||
| content: Optional[str] = None, | ||
| tags: Optional[List[str]] = None, | ||
| importance: Optional[float] = None, | ||
| embedding: Optional[List[float]] = None, | ||
| ) -> bool: | ||
| """Update fields on an existing chunk. Returns True if chunk was found.""" | ||
| cursor = self.conn.cursor() | ||
| # Check chunk exists | ||
| rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,))) | ||
| if not rows: | ||
| return False | ||
|
|
||
| now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") | ||
| if content is not None: | ||
| cursor.execute( | ||
| "UPDATE chunks SET content = ?, char_count = ?, summary = ? WHERE id = ?", | ||
| (content, len(content), content[:200], chunk_id), | ||
| ) | ||
| if tags is not None: | ||
| cursor.execute( | ||
| "UPDATE chunks SET tags = ? WHERE id = ?", | ||
| (json.dumps(tags), chunk_id), | ||
| ) | ||
| if importance is not None: | ||
| cursor.execute( | ||
| "UPDATE chunks SET importance = ? WHERE id = ?", | ||
| (float(max(1, min(10, importance))), chunk_id), | ||
| ) | ||
| if embedding is not None: | ||
| cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) | ||
| cursor.execute( | ||
| "INSERT INTO chunk_vectors (chunk_id, embedding) VALUES (?, ?)", | ||
| (chunk_id, serialize_f32(embedding)), | ||
| ) | ||
| return True | ||
|
|
||
| def archive_chunk(self, chunk_id: str) -> bool: | ||
| """Soft-delete a chunk by setting value_type to ARCHIVED.""" | ||
| cursor = self.conn.cursor() | ||
| rows = list(cursor.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,))) | ||
| if not rows: | ||
| return False | ||
| cursor.execute( | ||
| "UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,) | ||
| ) | ||
| # Remove from vector index so it doesn't appear in searches | ||
| cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) | ||
| return True |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding BUSY retry for consistency with other write methods.
update_chunk and archive_chunk lack the BUSY retry logic that update_enrichment and update_sentiment implement (Lines 1336-1344, 1365-1376). While PRAGMA busy_timeout = 5000 provides base protection, other write methods in this file use explicit retry to handle edge cases under heavy contention.
For consistency, consider adding the same retry pattern, or document why these methods rely solely on the busy_timeout pragma.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/vector_store.py` around lines 551 - 601, The update_chunk and
archive_chunk methods lack the explicit BUSY-retry loop used by
update_enrichment/update_sentiment; wrap their write operations (the SELECT
check and subsequent UPDATE/DELETE/INSERT calls inside update_chunk, and the
SELECT plus UPDATE/DELETE in archive_chunk) in the same retry-on-BUSY pattern
(catch sqlite3.OperationalError/BusyError or check error message for "database
is locked"), retry a few times with short sleeps/backoff before failing, and
ensure the cursor/connection handling and final return behavior remain identical
to the existing methods so behavior and commits remain consistent with the rest
of the file.
| if not rows: | ||
| return False | ||
|
|
||
| now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") |
There was a problem hiding this comment.
Remove unused variable now.
The variable now is assigned but never used. This appears to be leftover from a planned timestamp update that wasn't implemented.
🧹 Proposed fix
- now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/vector_store.py` at line 566, Remove the unused local variable
now in src/brainlayer/vector_store.py: delete the line that assigns now =
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") (or replace it with
the intended timestamp usage if you actually meant to set/update a timestamp
variable), so no unused variable remains in the scope where it's defined.
| def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]: | ||
| """Get a single chunk by ID.""" | ||
| cursor = self.conn.cursor() | ||
| rows = list(cursor.execute( | ||
| """SELECT id, content, metadata, source_file, project, content_type, | ||
| value_type, tags, importance, created_at, summary | ||
| FROM chunks WHERE id = ?""", | ||
| (chunk_id,), | ||
| )) | ||
| if not rows: | ||
| return None | ||
| r = rows[0] | ||
| return { | ||
| "id": r[0], "content": r[1], "metadata": r[2], | ||
| "source_file": r[3], "project": r[4], "content_type": r[5], | ||
| "value_type": r[6], "tags": r[7], "importance": r[8], | ||
| "created_at": r[9], "summary": r[10], | ||
| } |
There was a problem hiding this comment.
Use _read_cursor() for read-only operation.
get_chunk is a read operation but uses the write connection (self.conn.cursor()). For consistency with other read methods in this file and to avoid blocking concurrent writes, use _read_cursor().
♻️ Proposed fix
def get_chunk(self, chunk_id: str) -> Optional[Dict[str, Any]]:
"""Get a single chunk by ID."""
- cursor = self.conn.cursor()
+ cursor = self._read_cursor()
rows = list(cursor.execute(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/vector_store.py` around lines 603 - 620, get_chunk currently
opens a write cursor via self.conn.cursor() even though it's a read-only
operation; change it to use the read-only helper by acquiring the cursor from
self._read_cursor() (i.e., replace the cursor = self.conn.cursor() call with
cursor = self._read_cursor()) so the method get_chunk uses the same non-blocking
read connection pattern as other read methods and avoids blocking concurrent
writes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| if not rows: | ||
| return False | ||
|
|
||
| now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ") |
There was a problem hiding this comment.
Unused now variable in update_chunk method
Low Severity
The now variable is computed via datetime.now(timezone.utc).strftime(...) but never referenced in any of the subsequent SQL UPDATE statements. This looks like it was intended to set a modification timestamp (e.g., enriched_at or an updated_at field) when a chunk is updated, but the actual update was never wired in. Updates silently lose their modification timestamps.
| vid = video_id_match.group(1) if video_id_match else args.url | ||
|
|
||
| if args.replace: | ||
| delete_video_chunks(store, vid) |
There was a problem hiding this comment.
Regex video ID extraction mismatches index_single_video extraction
Medium Severity
The vid variable is extracted via a regex matching [?&]v= in the URL. For non-standard YouTube URL formats (e.g., youtu.be/ID or /embed/ID), the regex won't match and vid falls back to the entire URL string. When --replace is used without --diarized-transcript, delete_video_chunks uses this possibly-wrong vid, while index_single_video correctly extracts the real video ID via yt-dlp. This mismatch causes old chunks to not be deleted while new ones are created — resulting in duplicate data.
Additional Locations (1)
| }) | ||
| embeddings.append(ec.embedding) | ||
| total_chunks = store.upsert_chunks(chunk_data, embeddings) | ||
| log.info(f" Indexed {total_chunks} chunks for '{title}'") |
There was a problem hiding this comment.
Duplicated embed-and-upsert logic from index_single_video
Low Severity
The embed → build chunk_data → upsert_chunks block in the diarized-transcript path (inside main()) is nearly a line-for-line copy of the same logic in index_single_video. Any future bug fix or schema change to the chunk-building loop needs to be applied in both places, increasing risk of divergence.
Additional Locations (1)
| cursor.execute("UPDATE chunks SET value_type = 'ARCHIVED' WHERE id = ?", (chunk_id,)) | ||
| # Remove from vector index so it doesn't appear in searches | ||
| cursor.execute("DELETE FROM chunk_vectors WHERE chunk_id = ?", (chunk_id,)) | ||
| return True |
There was a problem hiding this comment.
Archived chunks still appear in FTS search results
Medium Severity
archive_chunk removes the chunk from chunk_vectors (preventing vector search hits) but does not remove it from chunks_fts. The FTS delete trigger only fires on DELETE FROM chunks, not on UPDATE. Since archive_chunk only updates value_type, the FTS entry persists. The hybrid_search FTS path has no value_type != 'ARCHIVED' filter, so archived/merged memories can still surface in keyword-based search results — defeating the purpose of archiving.


Summary
diarize_episodes.py— batch WhisperX diarization pipeline for 8 Huberman guest episodes (download → transcribe → align → diarize → map speakers → re-index)index_youtube.py— added--diarized-transcriptand--replaceflags for re-indexing with speaker-labeled transcriptsdata/to.gitignore(audio WAVs, diarized JSONs)Test plan
HUBERMAN:prefix on intro segments,GALPIN:on guest responsesload_diarized_transcript()correctly converts WhisperX JSON to indexing formatdelete_video_chunks()properly removes existing chunks before re-index🤖 Generated with Claude Code
Note
Medium Risk
Medium risk because it adds new write/delete paths against the vector store (chunk replacement, update/archive/merge) which can alter or remove indexed data if misused, though changes are localized and opt-in via flags/tool calls.
Overview
Adds
scripts/diarize_episodes.py, a batch pipeline that downloads episode audio, runs WhisperX transcription/alignment/diarization, maps speaker IDs to named speakers (intro phrase detection with speaking-time fallback), saves diarized JSON, and can trigger re-indexing.Extends
scripts/index_youtube.pyto accept--diarized-transcript(load WhisperX diarized JSON and prefix speaker labels into segment text) and--replace(delete existingyoutube:{video_id}chunks/vectors before indexing).Introduces a new MCP write tool
brain_updateto update content/tags/importance (with re-embedding), archive chunks (soft-delete + remove vectors), or merge duplicates; adds supportingVectorStoremethods and updates the MCP tool-count test. Also ignoresdata/outputs in.gitignore.Written by Cursor Bugbot for commit 85336e0. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Chores
Tests