feat(youtube): transcript-api v1.2.4 + Brave cookies by EtanHey · Pull Request #2 · EtanHey/brainlayer

EtanHey · 2026-02-19T18:23:12Z

Summary

Use youtube-transcript-api v1.2.4 (instance-based) as primary transcript source — different YouTube endpoint, avoids yt-dlp 429 rate limits
Add Brave browser cookies to all yt-dlp calls
Fix DB path to use centralized paths.py
Add created_at from upload_date to chunk data

Test plan

python3 scripts/index_youtube.py --dry-run https://youtube.com/watch?v=<id> works
Transcript extraction falls back gracefully if API unavailable

🤖 Generated with Claude Code

Note

Medium Risk
Moderate risk because it changes transcript-fetch order/behavior and chunk fields written to the DB (created_at), which can affect indexing outcomes and downstream filtering.

Overview
Updates the YouTube indexing script to prefer youtube-transcript-api transcripts (manual then generated) with a fallback chain to yt-dlp download and then info-dict extraction, reducing failures from yt-dlp 429s.

All yt-dlp calls now load Brave browser cookies (and extract_info uses process=False), the default --db path is switched to brainlayer.paths.DEFAULT_DB_PATH, and indexed YouTube chunks now include a created_at timestamp derived from upload_date.

Adds a new optional dependency group youtube in pyproject.toml for yt-dlp, youtube-transcript-api, and certifi.

^{Written by Cursor Bugbot for commit ce008ee. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

Release Notes

New Features
- Enhanced YouTube indexing with improved transcript extraction leveraging multiple sources for better reliability and broader content coverage
- Added creation date metadata to YouTube video chunks for improved organization and temporal tracking of indexed content
Chores
- Added optional dependencies for YouTube transcript extraction and content processing capabilities

…ookies - Add youtube-transcript-api (instance-based v1.2.4) as primary transcript source — uses different YouTube endpoint, avoids yt-dlp 429 rate limits - Add Brave browser cookies to all yt-dlp calls for bypass - Add process=False to extract_video_info (avoids format selection errors) - Fix DB path to use centralized paths.py - Add created_at from upload_date to chunk data - Add youtube optional dependency group in pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-19T18:36:40Z

📝 Walkthrough

Walkthrough

Added YouTube as an optional dependency group with three packages (yt-dlp, youtube-transcript-api, certifi). Enhanced the YouTube indexing script with API-based transcript fetching, fallback mechanisms to download/info extraction, Brave browser cookies for IP ban bypass, and ISO-formatted creation timestamps in chunk metadata.

Changes

Cohort / File(s)	Summary
Dependency Management `pyproject.toml`	Added new `youtube` optional dependency group containing yt-dlp (≥2024.1.0), youtube-transcript-api (≥1.2.4), and certifi packages.
YouTube Indexing Enhancement `scripts/index_youtube.py`	Introduced `get_transcript_via_api()` function for API-based transcript fetching; implemented three-tier transcript sourcing (API → download → info-based extraction); added Brave browser cookies across yt-dlp operations to bypass IP bans; integrated `created_at` metadata field (ISO-formatted from upload date) into chunks; replaced default database path with `DEFAULT_DB_PATH` constant.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as YouTube API
    participant YTDlp as yt-dlp
    participant DB as Database
    
    Client->>Client: index_single_video(video_id)
    
    rect rgba(100, 150, 200, 0.5)
    Note over API,YTDlp: Transcript Extraction (Fallback Chain)
    Client->>API: get_transcript_via_api(video_id)
    alt API Success
        API-->>Client: transcript
    else API Fails
        Client->>YTDlp: get_transcript_via_download(video_id)
        alt Download Success
            YTDlp-->>Client: transcript
        else Download Fails
            Client->>Client: get_transcript_from_info(info)
            Client-->>Client: transcript
        end
    end
    end
    
    rect rgba(150, 200, 100, 0.5)
    Note over Client,DB: Metadata Enrichment & Storage
    Client->>Client: Extract created_at from upload_date
    Client->>Client: Attach created_at to chunk metadata
    Client->>DB: Upsert chunks with created_at
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 With API first and fallbacks tight,
Transcripts flow from YouTube's might,
Brave cookies guard the path we take,
While timestamps bloom for data's sake,
Hop hop, this script's now more complete! 🎬✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: introducing youtube-transcript-api v1.2.4 and adding Brave browser cookies to the indexer script.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/youtube-indexer-v124

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/index_youtube.py (1)

580-597: 🧹 Nitpick | 🔵 Trivial

created_at is recomputed on every iteration despite being constant for the video.

upload_date doesn't change between chunks. Moving the conversion outside the loop avoids repeated string slicing on every chunk upsert.

♻️ Proposed fix

+    # Convert upload_date (YYYYMMDD) to ISO timestamp for created_at
+    created_at = None
+    if upload_date and len(upload_date) == 8:
+        created_at = f"{upload_date[:4]}-{upload_date[4:6]}-{upload_date[6:8]}T00:00:00+00:00"
+
     for i, ec in enumerate(embedded):
         c = ec.chunk
         # Add upload_date to metadata
         meta = dict(c.metadata)
         if upload_date:
             meta["upload_date"] = upload_date
 
-        # Convert upload_date (YYYYMMDD) to ISO timestamp for created_at
-        created_at = None
-        if upload_date and len(upload_date) == 8:
-            created_at = f"{upload_date[:4]}-{upload_date[4:6]}-{upload_date[6:8]}T00:00:00+00:00"
-
         chunk_data.append({

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/index_youtube.py` around lines 580 - 597, Move the upload_date →
created_at conversion out of the per-chunk loop so it’s computed once per video
instead of on every iteration; compute created_at once (using the same YYYYMMDD
to ISO logic) before the loop that builds chunk_data, then reuse that created_at
variable inside chunk_data.append (where variables like c, i, meta, source_file,
project, content_type, value, char_count are referenced) to avoid repeated
string slicing and allocations.

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bcff8b and ce008ee.

📒 Files selected for processing (2)

pyproject.toml
scripts/index_youtube.py

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/index_youtube.py (1)

src/brainlayer/embeddings.py (2)

embed_chunks (44-85)

embed_chunks (117-125)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: test (3.11)
GitHub Check: test (3.12)
GitHub Check: Cursor Bugbot

🔇 Additional comments (1)

scripts/index_youtube.py (1)

619-619: LGTM — centralised DEFAULT_DB_PATH correctly replaces the previous hardcoded path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 60-64: The certifi entry in the youtube dependency list has no
version constraint; update the pyproject.toml so the "youtube" array's "certifi"
element includes a minimum version (e.g., change "certifi" to a pinned minimum
like "certifi>=<appropriate-version>") to match the other dependencies and
ensure reproducible installs; locate the "youtube" list in pyproject.toml and
modify the "certifi" item accordingly.

In `@scripts/index_youtube.py`:
- Line 115: The hardcoded cookiesfrombrowser option ("brave") causes failures on
systems without Brave; update scripts/index_youtube.py so cookie options are
optional and gracefully fall back to no cookies: extract the cookie option into
a helper (e.g., build_cookie_opts or get_cookie_opts) and return
{"cookiesfrombrowser": (browser,)} only when a browser is configured/available
(or when a new CLI flag args.cookies_from_browser is set), otherwise return {};
then use that helper in extract_video_info, get_transcript_via_download,
list_channel_videos and list_playlist_videos so yt-dlp calls get an empty
options dict instead of always forcing "brave".
- Line 58: The YouTubeTranscriptApi instance (_yt_api) is still using the system
SSL context and ignores the certifi fix; create a requests.Session configured to
use certifi (set session.verify = certifi.where()) and pass that session into
YouTubeTranscriptApi via its http_client parameter when instantiating _yt_api so
the primary API path uses the certifi-backed SSL; reuse this session for the
singleton _yt_api and keep the existing urllib.request-based certifi usage in
_fetch_json3_transcript/_fetch_vtt_transcript unchanged.
- Around line 68-77: The nested bare "except Exception" blocks around
transcript_list.find_manually_created_transcript and find_generated_transcript
mask real errors; update the code to either call transcript_list.find_transcript
(which already prefers manual over generated) or explicitly catch only
youtube_transcript_api exceptions like NoTranscriptFound (and optionally
TranscriptsDisabled) from "from youtube_transcript_api import NoTranscriptFound,
TranscriptsDisabled", then handle NoTranscriptFound by falling back to the
generated transcript or logging, but let other exceptions (e.g.,
ConnectionError/Timeout) propagate; locate references to _yt_api.list,
transcript_list.find_manually_created_transcript,
transcript_list.find_generated_transcript, and transcript_list.find_transcript
to implement this change.
- Around line 102-122: The opts dict in extract_video_info contains
subtitle-related keys that are ignored when using process=False; remove
writesubtitles, writeautomaticsub, and subtitleslangs from opts to avoid
confusion and keep only effective options (e.g., skip_download, quiet,
no_warnings, cookiesfrombrowser), while preserving process=False in
ydl.extract_info and relying on the existing language-selection logic in
get_transcript_from_info to pick subtitles from the raw extractor fields.

---

Outside diff comments:
In `@scripts/index_youtube.py`:
- Around line 580-597: Move the upload_date → created_at conversion out of the
per-chunk loop so it’s computed once per video instead of on every iteration;
compute created_at once (using the same YYYYMMDD to ISO logic) before the loop
that builds chunk_data, then reuse that created_at variable inside
chunk_data.append (where variables like c, i, meta, source_file, project,
content_type, value, char_count are referenced) to avoid repeated string slicing
and allocations.

coderabbitai · 2026-02-19T18:47:44Z

+youtube = [
+    "yt-dlp>=2024.1.0",  # YouTube video metadata + subtitle download
+    "youtube-transcript-api>=1.2.4",  # Primary transcript API (instance-based)
+    "certifi",  # SSL certificates for macOS
+]


🧹 Nitpick | 🔵 Trivial

certifi lacks a version constraint.

All other packages in this group specify a minimum version. certifi without any bound can pull in any release, including future ones with breaking changes, reducing reproducibility.

📌 Proposed fix

- "certifi", # SSL certificates for macOS + "certifi>=2023.0.0", # SSL certificates for macOS

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

youtube = [

"yt-dlp>=2024.1.0", # YouTube video metadata + subtitle download

"youtube-transcript-api>=1.2.4", # Primary transcript API (instance-based)

"certifi", # SSL certificates for macOS

]

youtube = [

"yt-dlp>=2024.1.0", # YouTube video metadata + subtitle download

"youtube-transcript-api>=1.2.4", # Primary transcript API (instance-based)

"certifi>=2023.0.0", # SSL certificates for macOS

]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 60 - 64, The certifi entry in the youtube dependency list has no version constraint; update the pyproject.toml so the "youtube" array's "certifi" element includes a minimum version (e.g., change "certifi" to a pinned minimum like "certifi>=<appropriate-version>") to match the other dependencies and ensure reproducible installs; locate the "youtube" list in pyproject.toml and modify the "certifi" item accordingly.

coderabbitai · 2026-02-19T18:47:44Z

+# Transcript extraction via youtube-transcript-api (primary — avoids yt-dlp 429s)
+# ---------------------------------------------------------------------------
+
+_yt_api = YouTubeTranscriptApi()  # Reuse single instance across calls


⚠️ Potential issue | 🟡 Minor

_yt_api doesn't benefit from the certifi SSL fix it's supposed to use.

The PR explicitly adds certifi to fix macOS SSL certificate errors. However, _yt_api uses requests internally with the default system SSL context. The certifi-backed context is only applied to the urllib.request.urlopen calls in _fetch_json3_transcript / _fetch_vtt_transcript. On macOS without proper certificates, the primary API path will still fail with SSL errors.

The library supports passing a custom requests.Session via http_client to fix this:

🛡️ Proposed fix to wire certifi into the primary API path

-_yt_api = YouTubeTranscriptApi() # Reuse single instance across calls +try: + import certifi + from requests import Session as _Session + _http = _Session() + _http.verify = certifi.where() + _yt_api = YouTubeTranscriptApi(http_client=_http) +except ImportError: + _yt_api = YouTubeTranscriptApi()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` at line 58, The YouTubeTranscriptApi instance (_yt_api) is still using the system SSL context and ignores the certifi fix; create a requests.Session configured to use certifi (set session.verify = certifi.where()) and pass that session into YouTubeTranscriptApi via its http_client parameter when instantiating _yt_api so the primary API path uses the certifi-backed SSL; reuse this session for the singleton _yt_api and keep the existing urllib.request-based certifi usage in _fetch_json3_transcript/_fetch_vtt_transcript unchanged.

coderabbitai · 2026-02-19T18:47:44Z

+    try:
+        transcript_list = _yt_api.list(video_id)
+        transcript = None
+        try:
+            transcript = transcript_list.find_manually_created_transcript(["en"])
+        except Exception:
+            try:
+                transcript = transcript_list.find_generated_transcript(["en"])
+            except Exception:
+                pass


🧹 Nitpick | 🔵 Trivial

Bare except Exception in the inner blocks masks network errors and other non-transcript-availability failures.

If find_manually_created_transcript raises a network error (e.g., ConnectionError, Timeout), execution silently falls through to find_generated_transcript instead of propagating the real failure. The correct exceptions to catch are NoTranscriptFound (and optionally TranscriptsDisabled), available via from youtube_transcript_api import NoTranscriptFound.

Additionally, since find_transcript already prefers manually created transcripts over generated ones by default, the entire nested try/except can be collapsed significantly.

♻️ Proposed refactor

+from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound def get_transcript_via_api(video_id: str) -> list[dict] | None: try: - transcript_list = _yt_api.list(video_id) - transcript = None - try: - transcript = transcript_list.find_manually_created_transcript(["en"]) - except Exception: - try: - transcript = transcript_list.find_generated_transcript(["en"]) - except Exception: - pass - - if not transcript: - return None - - fetched = transcript.fetch() + fetched = _yt_api.fetch(video_id, languages=["en"]) segments = [] for entry in fetched:

If you need to distinguish manual vs. generated for logging, use NoTranscriptFound specifically:

+ try: + transcript = transcript_list.find_manually_created_transcript(["en"]) + except NoTranscriptFound: + transcript = transcript_list.find_generated_transcript(["en"])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` around lines 68 - 77, The nested bare "except Exception" blocks around transcript_list.find_manually_created_transcript and find_generated_transcript mask real errors; update the code to either call transcript_list.find_transcript (which already prefers manual over generated) or explicitly catch only youtube_transcript_api exceptions like NoTranscriptFound (and optionally TranscriptsDisabled) from "from youtube_transcript_api import NoTranscriptFound, TranscriptsDisabled", then handle NoTranscriptFound by falling back to the generated transcript or logging, but let other exceptions (e.g., ConnectionError/Timeout) propagate; locate references to _yt_api.list, transcript_list.find_manually_created_transcript, transcript_list.find_generated_transcript, and transcript_list.find_transcript to implement this change.

coderabbitai · 2026-02-19T18:47:44Z

 def extract_video_info(video_url: str) -> dict[str, Any] | None:
-    """Extract metadata + subtitles for a single video."""
+    """Extract metadata + subtitles for a single video.
+
+    Uses Brave browser cookies to bypass YouTube IP bans on subtitle requests.
+    Uses process=False to skip format selection (avoids 'format not available' errors).
+    """
    opts = {
        "skip_download": True,
        "writesubtitles": True,
        "writeautomaticsub": True,
        "subtitleslangs": ["en", "en-orig", "en.*"],
        "quiet": True,
        "no_warnings": True,
+        "cookiesfrombrowser": ("brave",),
    }
    try:
        with yt_dlp.YoutubeDL(opts) as ydl:
-            return ydl.extract_info(video_url, download=False)
+            return ydl.extract_info(video_url, download=False, process=False)
    except Exception as e:
        log.warning(f"Failed to extract {video_url}: {e}")
        return None


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check the repository structure and find yt-dlp integration find . -type f -name "*.py" | grep -E "(youtube|yt_dlp)" | head -20

Repository: EtanHey/brainlayer

Length of output: 88

🏁 Script executed:

# Read the full context of scripts/index_youtube.py to understand how extract_video_info is used wc -l scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 90

🏁 Script executed:

# Look at the get_transcript_from_info function at line 543 sed -n '540,570p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1018

🏁 Script executed:

# Check if yt-dlp is vendored or external find . -type d -name "yt_dlp" | head -5

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

# Look for requirements or dependencies find . -type f -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs grep -l "yt-dlp" 2>/dev/null

Repository: EtanHey/brainlayer

Length of output: 78

🏁 Script executed:

# Check pyproject.toml for yt-dlp dependency cat pyproject.toml | grep -A5 -B5 "yt-dlp"

Repository: EtanHey/brainlayer

Length of output: 568

🏁 Script executed:

# Look at the full get_transcript_from_info function sed -n '543,600p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1917

🏁 Script executed:

# Check how extract_video_info is called and what info is used grep -n "extract_video_info" scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 175

🏁 Script executed:

# Get more context around the function at lines 102-122 and 543 sed -n '95,130p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1342

🏁 Script executed:

# Find the get_transcript_from_info function definition grep -n "def get_transcript_from_info" scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 128

🏁 Script executed:

# Get the full implementation of get_transcript_from_info sed -n '470,510p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1493

🏁 Script executed:

# Check if there are any other references or documentation about process=False grep -n "process" scripts/index_youtube.py | head -20

Repository: EtanHey/brainlayer

Length of output: 301

🏁 Script executed:

# Look at the imports to understand what yt_dlp is being used head -50 scripts/index_youtube.py | grep -E "^import|^from"

Repository: EtanHey/brainlayer

Length of output: 506

🏁 Script executed:

# Get the full implementation of get_transcript_from_info sed -n '170,220p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1992

🏁 Script executed:

# Check how the info dict is used in the entire flow grep -n "info\[" scripts/index_youtube.py | head -20

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

# Search for how extract_info result is used elsewhere grep -n "\.extract_info" scripts/index_youtube.py -A5

Repository: EtanHey/brainlayer

Length of output: 779

🏁 Script executed:

# Check the full context of the index_video function where extract_video_info is called sed -n '505,560p' scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 1877

🌐 Web query:

yt-dlp process=False subtitles automatic_captions raw extractor output

💡 Result:

In the Python API, YoutubeDL.extract_info(..., process=False) returns the raw extractor result (i.e., it does not “resolve” / post-process URL references, playlist items, etc.). yt-dlp’s own docstring describes process as: “resolve all unresolved references (URLs, playlist items)” and notes it “must be True for download to work.” [1]

What you get for subtitles / automatic captions in “raw extractor output”

Even with process=False, extractors commonly include these keys in the returned dict:

info_dict["subtitles"]: human-provided subtitles (if any)

info_dict["automatic_captions"]: auto-generated captions (if any)

These are typically shaped like:

{ "en": [{"ext": "vtt", "url": "...", ...}, {"ext": "ttml", "url": "...", ...}], "ja": [...], }

Why process=False can change what you see/receive

Subtitle selection/merging (e.g., honoring writesubtitles, writeautomaticsub, subtitleslangs, regex selection, etc.) is performed by yt-dlp’s subtitle processing logic (it merges normal subs + auto subs depending on params, then filters to requested languages). [2] If you’re trying to see exactly what the extractor surfaced, process=False is appropriate; if you want yt-dlp to apply your subtitle options, use the default process=True.

Minimal example: “raw extractor output” for subs/captions

import yt_dlp with yt_dlp.YoutubeDL({"skip_download": True}) as ydl: info = ydl.extract_info(URL, download=False, process=False) subs = info.get("subtitles", {}) autos = info.get("automatic_captions", {})

(Also note yt-dlp itself uses process=False internally when resolving some url_transparent results, specifically to avoid resolving/processing at that step.) [3]

Sources: [1] [3] [2]

🌐 Web query:

yt-dlp YouTube extractor subtitles automatic_captions raw output without process

💡 Result:

To get YouTube auto-captions in a “raw” (YouTube-native) format with yt-dlp (i.e., download as-served, without converting/post-processing), use an auto-sub format like json3 or srv3, and do not use --convert-subs.

1) Download raw auto-captions file (no video)

yt-dlp --skip-download --write-auto-subs --sub-lang en --sub-format json3 URL

--write-auto-subs downloads auto-generated captions, and YouTube auto-captions commonly support vtt, ttml, srv3, srv2, srv1, json3 formats (use json3/srv3 for the most “raw” feel). [1]

2) See what subtitle formats/languages are available (incl. auto)

yt-dlp --list-subs URL

This prints available subtitles/automatic captions and their formats. [2]

3) Get the direct “timedtext” URL (so you can fetch it yourself)

If you want the extractor’s direct caption URL (and then you can curl/wget it), dump metadata and look at automatic_captions:

yt-dlp --dump-json URL

The JSON includes an automatic_captions object when available. [3]

Sources:
[1] yt-dlp issue showing available automatic caption formats (incl. json3, srv3) yt-dlp/yt-dlp#4090
[2] yt-dlp README (subtitle options incl. --list-subs, --write-auto-subs) https://github.com/yt-dlp/yt-dlp
[3] yt-dlp issue discussing --dump-json and automatic_captions field yt-dlp/yt-dlp#5414

🏁 Script executed:

# Check if there's any error handling or logging about missing subtitles grep -n "subtitles\|automatic_captions" scripts/index_youtube.py

Repository: EtanHey/brainlayer

Length of output: 535

The subtitle options set in opts won't actually be applied with process=False, requiring manual language selection in get_transcript_from_info() to work.

With process=False, yt-dlp skips post-processing including subtitle filtering/selection logic. The options writesubtitles, writeautomaticsub, and subtitleslangs won't filter the output. However, the subtitles and automatic_captions fields are typically included in the raw extractor output. The current implementation handles this by manually iterating through language variants (en, en-orig, en-US, en-GB) in get_transcript_from_info(), which is appropriate. The process=False usage is reasonable since it avoids format selection errors, but the subtitle options set in the opts dict are superfluous and could be removed for clarity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` around lines 102 - 122, The opts dict in extract_video_info contains subtitle-related keys that are ignored when using process=False; remove writesubtitles, writeautomaticsub, and subtitleslangs from opts to avoid confusion and keep only effective options (e.g., skip_download, quiet, no_warnings, cookiesfrombrowser), while preserving process=False in ydl.extract_info and relying on the existing language-selection logic in get_transcript_from_info to pick subtitles from the raw extractor fields.

coderabbitai · 2026-02-19T18:47:44Z

        "subtitleslangs": ["en", "en-orig", "en.*"],
        "quiet": True,
        "no_warnings": True,
+        "cookiesfrombrowser": ("brave",),


⚠️ Potential issue | 🟠 Major

Hardcoded cookiesfrombrowser: ("brave",) silently breaks all yt-dlp operations on non-Brave systems.

Every yt-dlp code path — extract_video_info, get_transcript_via_download, list_channel_videos, and list_playlist_videos — hardcodes Brave. yt-dlp supports multiple browsers including chrome, chromium, edge, firefox, opera, safari, vivaldi, and whale, but a system without Brave installed will raise an error during cookie extraction. The downstream effects are:

extract_video_info → catches exception → returns None → video_id = "unknown", no metadata

get_transcript_via_download → catches exception → returns None (expected fallback)

list_channel_videos / list_playlist_videos → catches exception → returns [], causing the entire channel/playlist batch run to process zero videos with no actionable error

The last point is the most severe: the tool appears to succeed but indexes nothing when run against a channel/playlist on a machine without Brave.

🔧 Proposed fix — make Brave optional with graceful fallback

Extract cookie opts into a helper and fall back to no cookies:

+def _cookie_opts() -> dict: + """Returns cookiesfrombrowser opts if Brave is available, else empty dict.""" + import shutil + # Quick heuristic: check if a brave binary or profile dir exists + # yt-dlp will raise if the profile is missing; we catch that upstream. + # Returning the opt unconditionally is also fine since all callers already + # have try/except — this comment documents the intentional env-specificity. + return {"cookiesfrombrowser": ("brave",)}

Or alternatively, expose a --cookiesfrombrowser CLI flag so the caller can choose their browser:

+parser.add_argument( + "--cookies-from-browser", + metavar="BROWSER", + default=None, + help="Browser to extract cookies from (brave/chrome/firefox/etc). Default: none.", +)

and pass {"cookiesfrombrowser": (args.cookies_from_browser,)} if args.cookies_from_browser else {} to yt-dlp opts.

Also applies to: 144-144, 320-320, 347-347

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/index_youtube.py` at line 115, The hardcoded cookiesfrombrowser option ("brave") causes failures on systems without Brave; update scripts/index_youtube.py so cookie options are optional and gracefully fall back to no cookies: extract the cookie option into a helper (e.g., build_cookie_opts or get_cookie_opts) and return {"cookiesfrombrowser": (browser,)} only when a browser is configured/available (or when a new CLI flag args.cookies_from_browser is set), otherwise return {}; then use that helper in extract_video_info, get_transcript_via_download, list_channel_videos and list_playlist_videos so yt-dlp calls get an empty options dict instead of always forcing "brave".

EtanHey force-pushed the fix/youtube-indexer-v124 branch from 07eeaa2 to e48f198 Compare February 19, 2026 18:30

EtanHey force-pushed the fix/youtube-indexer-v124 branch from e48f198 to ce008ee Compare February 19, 2026 18:36

EtanHey merged commit e69e287 into main Feb 19, 2026
4 of 5 checks passed

coderabbitai Bot reviewed Feb 19, 2026

View reviewed changes

EtanHey deleted the fix/youtube-indexer-v124 branch February 19, 2026 20:22

coderabbitai Bot mentioned this pull request Feb 26, 2026

feat: diarization pipeline + diarized transcript re-indexing #34

Merged

5 tasks

cursor Bot mentioned this pull request Mar 9, 2026

feat: eval suite + entity injection in prompt hook #72

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(youtube): transcript-api v1.2.4 + Brave cookies#2

feat(youtube): transcript-api v1.2.4 + Brave cookies#2
EtanHey merged 1 commit intomainfrom
fix/youtube-indexer-v124

EtanHey commented Feb 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Feb 19, 2026

Uh oh!

coderabbitai Bot Feb 19, 2026

Uh oh!

coderabbitai Bot Feb 19, 2026

Uh oh!

coderabbitai Bot Feb 19, 2026

Uh oh!

coderabbitai Bot Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Feb 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 19, 2026

Choose a reason for hiding this comment

What you get for subtitles / automatic captions in “raw extractor output”

Why process=False can change what you see/receive

Minimal example: “raw extractor output” for subs/captions

1) Download raw auto-captions file (no video)

2) See what subtitle formats/languages are available (incl. auto)

3) Get the direct “timedtext” URL (so you can fetch it yourself)

Uh oh!

coderabbitai Bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Feb 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 19, 2026 •

edited

Loading

Why `process=False` can change what you see/receive