Turn any YouTube video into a VLM-ready context pack — a timed transcript, the frames that actually matter, and the cinematic grammar underneath, compiled into copy-paste artifacts your coding agents can build from.
Overview · Quick start · Web app · CLI · MCP server · HTTP API · Contributing · Roadmap
yt2ctx is a pipeline that watches a YouTube video the way a film editor
would, then writes down what it learned. It is not just a transcript tool — the
goal is to turn reference cinema into executable production grammar for
coding agents and downstream generation systems.
Given a URL, it:
- downloads the video and extracts a compressed audio track
- transcribes speech with per-segment timestamps
- samples candidate frames across the timeline
- describes and scores every frame with OpenAI vision + embeddings
- selects the most representative frames (top-k or salience-density)
- compiles a style bible, Blender/Remotion-ready shot specs, a Codex/Claude implementation prompt, and anti-slop validators
- writes Markdown, JSON, the selected frame JPGs, and a ZIP bundle
It ships as three interfaces over one pipeline — a web app, a CLI, and an MCP stdio server.
URL ─▶ download ─▶ audio ─▶ transcribe ─▶ sample frames ─▶ vision + embeddings
─▶ score & select ─▶ compile cinematic grammar ─▶ artifacts ( md · json · jpg · zip )
- Agent-ready outputs: produces Markdown, JSON, selected frames, shot specs, and implementation prompts instead of a transcript alone.
- One core pipeline: the web app, CLI, HTTP API, and MCP server all share the same analysis logic.
- Portable artifacts: every run writes a self-contained job folder and ZIP bundle that can be shared with humans or passed to downstream tools.
- OSS-friendly by default: typed TypeScript, explicit environment config, issue templates, CI, Dependabot, and contribution guidance are included.
| Stage | What happens |
|---|---|
| Download | yt-dlp (bundled via youtube-dl-exec) fetches the best MP4. |
| Audio | Bundled ffmpeg demuxes a 16 kHz mono speech track. |
| Transcribe | OpenAI transcription returns verbose JSON with segment timestamps. |
| Sample | ffmpeg extracts candidate frames at a configurable interval. |
| Vision | Each frame is described, tagged, and scored for salience. |
| Embed | Frame descriptions are embedded to measure semantic novelty. |
| Select | A weighted score picks top-k or density-sampled frames. |
| Compile | A vision model extracts the reusable cinematic grammar. |
| Package | Everything is written to disk and zipped. |
- Node.js 20+
- An
OPENAI_API_KEY - A Postgres
DATABASE_URLfor the authenticated web app - A Vercel Blob
BLOB_READ_WRITE_TOKENfor generated frames and ZIPs - Network access to YouTube and OpenAI
ffmpeg and ffprobe are bundled — no system install required.
git clone https://github.com/JacobFV/yt2ctx.git
cd yt2ctx
npm install
cp .env.example .env
# open .env and set OPENAI_API_KEY, DATABASE_URL, and BLOB_READ_WRITE_TOKEN
npm run dev # web app at http://localhost:3000Or skip the browser and go straight to the CLI:
npm run cli -- "https://www.youtube.com/watch?v=VIDEO_ID"npm install
cp .env.example .env
npm run devUseful checks:
npm run typecheck
npm run lint
npm run buildThe build runs the Next.js app and compiles the CLI/MCP binaries into dist/.
Generated analysis output is written to .yt2ctx/ by default and should not be
committed.
| Path | Purpose |
|---|---|
src/core/ |
Shared download, transcription, frame analysis, scoring, rendering, and packaging logic. |
src/app/ |
Next.js web app, docs pages, and HTTP API route. |
src/cli.ts |
Command-line interface over the core pipeline. |
src/mcp.ts |
MCP stdio server exposing watch_youtube. |
assets/ |
README and product artwork. |
.github/ |
Issue forms, PR template, CI, Dependabot, labels, and repo assets. |
npm run devOpen http://localhost:3000, create an account or sign in, paste a YouTube URL,
and run analysis. The interface is a single editorial experience — "The
Reference Monograph" — styled as a printed film publication: warm paper, ink,
one printer's red, and the two moments you watch (the processing frame and the
lightbox) drop to theater black.
- a URL composer that detects the video and shows its thumbnail before you run
- account auth with HttpOnly sessions, a Postgres-backed video library, and Blob-backed frame/ZIP storage
- a collapsible Tuning panel for frame count, selection mode, and sampling
- live pipeline progress — every stage reports in real time with an overall percentage, an elapsed clock, and per-frame counts, instead of a blind spinner
- a result view with tabs for the watch pack, frames, style bible, shot specs, Codex prompt, and slop warnings
- rendered Markdown with a Reading/Raw toggle, per-tab copy, and
.mddownload - a frame gallery with a keyboard-navigable lightbox and per-frame downloads
- a one-click artifact ZIP download
- responsive across desktop and mobile, with reduced-motion support
npm run cli -- "https://www.youtube.com/watch?v=VIDEO_ID" -k 8 --mode allUseful options:
npm run cli -- "<url>" \
--output .yt2ctx \
--top-k 10 \
--selection-mode density \
--mode style \
--candidate-interval 6 \
--max-candidates 48 \
--frame-width 768 \
--cookies-from-browser chrome \
--quiet| Option | Default | Description |
|---|---|---|
-k, --top-k <n> |
8 |
Number of frames to select. |
-m, --mode <mode> |
all |
Output: watch, style, shot-specs, prompt, all. |
--selection-mode <mode> |
density |
Frame selection: density or top-k. |
--candidate-interval <s> |
8 |
Seconds between sampled frames. |
--max-candidates <n> |
36 |
Candidate frames sent to vision analysis. |
--frame-width <px> |
768 |
Extracted frame width. |
--cookies <path> |
— | Netscape cookies.txt file to pass to yt-dlp. |
--cookies-from-browser <browser> |
— | Browser cookie source to pass to yt-dlp, such as chrome or firefox. |
-o, --output <dir> |
.yt2ctx |
Output directory. |
--json |
— | Print JSON metadata instead of Markdown. |
--with-data-urls |
— | Include base64 data URLs in JSON output. |
--quiet |
— | Suppress the live progress display. |
The CLI renders a live progress bar on stderr as it moves through each pipeline stage. stdout only ever receives the requested artifact text or JSON, so it stays safe to pipe.
If YouTube returns "Sign in to confirm you're not a bot", use a signed-in browser session:
npm run cli -- "<url>" --cookies-from-browser chromeFor server, API, or MCP runs, set YT2CTX_YTDLP_COOKIES to a cookies.txt path
or YT2CTX_YTDLP_COOKIES_FROM_BROWSER to a browser name in the environment.
yt2ctx exposes the pipeline to MCP clients (Claude Desktop, Claude Code, and
any other agent that speaks MCP) as a single tool: watch_youtube.
npm install # if you have not already
npm run build:bin # produces dist/mcp.jsThis compiles a standalone stdio server to dist/mcp.js.
The server needs OPENAI_API_KEY in its environment. It will read a .env file
in its working directory if one exists, but because MCP clients launch the
process from an arbitrary directory, passing the key explicitly in the client
config is recommended.
Claude Desktop
Edit the config file:
- macOS —
~/Library/Application Support/Claude/claude_desktop_config.json - Windows —
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"yt2ctx": {
"command": "node",
"args": ["/absolute/path/to/yt2ctx/dist/mcp.js"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}Restart Claude Desktop. watch_youtube will appear in the tools list.
Claude Code
claude mcp add yt2ctx \
--env OPENAI_API_KEY=sk-... \
-- node /absolute/path/to/yt2ctx/dist/mcp.jsVerify with claude mcp list.
Any other MCP client
Launch this command as a stdio MCP server, with OPENAI_API_KEY in its
environment:
node /absolute/path/to/yt2ctx/dist/mcp.js| Argument | Default | Description |
|---|---|---|
url |
(required) | YouTube video URL. |
topK |
8 |
Number of frames to select. |
mode |
density |
Frame selection: density or top-k. |
outputMode |
all |
watch, style, prompt, shot-specs, or all. |
candidateIntervalSeconds |
8 |
Seconds between sampled frames. |
maxCandidateFrames |
36 |
Candidate frames sent to vision analysis. |
frameWidth |
768 |
Extracted frame width. |
outputDir |
(optional) | Where to write artifacts. |
The tool returns the requested text artifact plus the selected frames as MCP image content, and also writes the full artifact set to disk.
GET /api/analyze returns the endpoint contract as JSON, so the API is
self-documenting. POST /api/analyze requires an authenticated web session,
runs the pipeline, saves the completed analysis to the signed-in user's Postgres
video library, and content-negotiates its response so the same endpoint
serves the browser and headless agents:
Accept: application/x-ndjson— streams newline-delimited JSON. Zero or more{"type":"progress","stage":"vision","pct":0.71,...}events, then one{"type":"result","result":{...}}line. Failures arrive as{"type":"error","message":"..."}. The web app uses this for live progress.- Any other
Accept— returns a single buffered JSON result object, or{"error":"..."}with HTTP 400. The simplest thing for an agent tofetch().then(r => r.json()).
Request body (JSON): url (required), topK, mode (density |
top-k), candidateIntervalSeconds, maxCandidateFrames, frameWidth.
Result: metadata, markdown, frames (each with imageUrl and
imageDownloadUrl), cinematic artifacts, zipUrl, and zipDownloadUrl.
# Discover the contract
curl -s http://localhost:3000/api/analyze
# Headless agent — one buffered JSON result
curl -s -X POST http://localhost:3000/api/analyze \
-H 'Content-Type: application/json' \
-d '{"url":"https://youtu.be/VIDEO_ID"}'
# Live streaming progress
curl -N -X POST http://localhost:3000/api/analyze \
-H 'Content-Type: application/json' \
-H 'Accept: application/x-ndjson' \
-d '{"url":"https://youtu.be/VIDEO_ID"}'Every run writes a job folder under the output directory (.yt2ctx/<job-id>/):
| File | Contents |
|---|---|
watch.md |
Timed transcript plus representative frame metadata. |
style-bible.md |
The extracted cinematic production grammar. |
shot-specs.md / shot-specs.json |
Blender/Remotion-ready shot specs. |
codex-prompt.md |
A direct implementation prompt for coding agents. |
metadata.json |
The full structured analysis result. |
frames/*.jpg |
The selected frame images. |
yt2ctx-artifacts.zip |
Everything above, bundled. |
The extra outputs are designed for downstream generation systems.
style-bible.md extracts the production grammar: cinematic ontology,
reference lineage, camera/lens/lighting/material/edit/typography/sound language,
narration register and forbidden phrases, reusable shot patterns, and transfer
rules for new products.
shot-specs.json makes the reference executable: source frame and
timestamp, shot type and purpose, lens/focal length/aperture/rig/movement/focus
behavior, lighting setup, material emphasis, Blender render passes, diffusion
finishing intent, Remotion role, and anti-slop forbidden moves.
codex-prompt.md is a direct implementation prompt for coding agents. It
tells Codex/Claude to build a physically grounded Blender-first pipeline with
diffusion as finishing and Remotion as editorial assembly — not as the visual
substrate.
slopWarnings are validator-ready rules that catch presentation-deck
failure modes: arbitrary floating UI, LinkedIn announcement language, missing
lens metadata, and ungrounded abstract AI visuals.
top-ksorts candidate frames by score and returns the highest scoring.densitytreats salience as a timeline density and samples across weighted buckets — usually a more representative sequence across the whole video while still preferring information-rich moments.
The score combines OpenAI vision salience, semantic novelty from frame descriptions, visual scene-change novelty, nearby transcript density, and colorfulness.
Environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | Your OpenAI API key. |
DATABASE_URL |
(required for web) | Postgres connection string for accounts, sessions, and saved video analyses. |
BLOB_READ_WRITE_TOKEN |
(required for web) | Vercel Blob token for generated frame JPGs and artifact ZIPs. |
OPENAI_TRANSCRIBE_MODEL |
whisper-1 |
Transcription model. |
OPENAI_VISION_MODEL |
gpt-4.1-mini |
Vision + grammar model. |
OPENAI_EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model. |
YT2CTX_OUTPUT_DIR |
.yt2ctx |
Default artifact directory. |
YT2CTX_YTDLP_COOKIES |
— | Optional Netscape cookies.txt path for yt-dlp. |
YT2CTX_YTDLP_COOKIES_FROM_BROWSER |
— | Optional browser cookie source for yt-dlp, such as chrome or firefox. |
whisper-1 is the default because it supports verbose JSON with segment
timestamps.
This repo is intended to deploy through a linked GitHub repository on Vercel. Push to the configured production branch and let Vercel build automatically.
Set OPENAI_API_KEY in the Vercel project settings before relying on automatic
deployments. The web app also requires DATABASE_URL and
BLOB_READ_WRITE_TOKEN; the project is designed to use Vercel Marketplace
Postgres plus Vercel Blob, which inject connection environment variables when
connected to the project. The analyze route is configured for the Node.js
runtime with a 300 second function duration. Serverless limits still apply —
long videos are better processed through the CLI or MCP server; short videos and
clips fit the hosted web path.
Important
The hosted web app requires authentication, but each analysis still costs real
OpenAI usage. Before widening access to a public deployment, add usage limits
and billing controls — see TODO.md.
Planned work — including gating the public web app behind authentication and
billing so it cannot run up unbounded OpenAI spend — is tracked in
TODO.md.
Issues and pull requests are welcome. Good contributions include bug fixes, documentation improvements, sharper prompts/artifact schemas, better frame selection heuristics, and MCP/client compatibility work.
Before opening a PR:
- Search existing issues to avoid duplicate work.
- Keep the change focused and include screenshots or sample artifacts for UI and output changes.
- Run
npm run typecheck,npm run lint, and, when practical,npm run build. - Do not commit
.env, API keys, private video URLs, or generated.yt2ctx/output.
Use the issue templates for bugs and feature requests, and the PR template for review context. Security reports should be opened privately through GitHub Security Advisories rather than public issues.
This repository includes:
- bug and feature request issue forms
- a pull request template
- CI for typecheck, lint, and build
- Dependabot configuration for npm and GitHub Actions
- a label set for triage
- a funding placeholder for future sponsorship setup
Only process videos you have the right to download and analyze. YouTube
availability and extractor behavior can change; youtube-dl-exec bundles
yt-dlp, which is more robust than browser-only download libraries.