GitHub - JacobFV/yt2ctx

The yt2ctx web app — “The Reference Monograph”

Turn any YouTube video into a VLM-ready context pack — a timed transcript, the frames that actually matter, and the cinematic grammar underneath, compiled into copy-paste artifacts your coding agents can build from.

Overview · Quick start · Web app · CLI · MCP server · HTTP API · Contributing · Roadmap

Overview

yt2ctx is a pipeline that watches a YouTube video the way a film editor would, then writes down what it learned. It is not just a transcript tool — the goal is to turn reference cinema into executable production grammar for coding agents and downstream generation systems.

Given a URL, it:

downloads the video and extracts a compressed audio track
transcribes speech with per-segment timestamps
samples candidate frames across the timeline
describes and scores every frame with OpenAI vision + embeddings
selects the most representative frames (top-k or salience-density)
compiles a style bible, Blender/Remotion-ready shot specs, a Codex/Claude implementation prompt, and anti-slop validators
writes Markdown, JSON, the selected frame JPGs, and a ZIP bundle

It ships as three interfaces over one pipeline — a web app, a CLI, and an MCP stdio server.

URL ─▶ download ─▶ audio ─▶ transcribe ─▶ sample frames ─▶ vision + embeddings
    ─▶ score & select ─▶ compile cinematic grammar ─▶ artifacts ( md · json · jpg · zip )

Why use it?

Agent-ready outputs: produces Markdown, JSON, selected frames, shot specs, and implementation prompts instead of a transcript alone.
One core pipeline: the web app, CLI, HTTP API, and MCP server all share the same analysis logic.
Portable artifacts: every run writes a self-contained job folder and ZIP bundle that can be shared with humans or passed to downstream tools.
OSS-friendly by default: typed TypeScript, explicit environment config, issue templates, CI, Dependabot, and contribution guidance are included.

How it works

Stage	What happens
Download	`yt-dlp` (bundled via `youtube-dl-exec`) fetches the best MP4.
Audio	Bundled `ffmpeg` demuxes a 16 kHz mono speech track.
Transcribe	OpenAI transcription returns verbose JSON with segment timestamps.
Sample	`ffmpeg` extracts candidate frames at a configurable interval.
Vision	Each frame is described, tagged, and scored for salience.
Embed	Frame descriptions are embedded to measure semantic novelty.
Select	A weighted score picks top-k or density-sampled frames.
Compile	A vision model extracts the reusable cinematic grammar.
Package	Everything is written to disk and zipped.

Requirements

Node.js 20+
An OPENAI_API_KEY
A Postgres DATABASE_URL for the authenticated web app
A Vercel Blob BLOB_READ_WRITE_TOKEN for generated frames and ZIPs
Network access to YouTube and OpenAI

ffmpeg and ffprobe are bundled — no system install required.

Quick start

git clone https://github.com/JacobFV/yt2ctx.git
cd yt2ctx
npm install

cp .env.example .env
# open .env and set OPENAI_API_KEY, DATABASE_URL, and BLOB_READ_WRITE_TOKEN

npm run dev        # web app at http://localhost:3000

Or skip the browser and go straight to the CLI:

npm run cli -- "https://www.youtube.com/watch?v=VIDEO_ID"

Local development

npm install
cp .env.example .env
npm run dev

Useful checks:

npm run typecheck
npm run lint
npm run build

The build runs the Next.js app and compiles the CLI/MCP binaries into dist/. Generated analysis output is written to .yt2ctx/ by default and should not be committed.

Project layout

Path	Purpose
`src/core/`	Shared download, transcription, frame analysis, scoring, rendering, and packaging logic.
`src/app/`	Next.js web app, docs pages, and HTTP API route.
`src/cli.ts`	Command-line interface over the core pipeline.
`src/mcp.ts`	MCP stdio server exposing `watch_youtube`.
`assets/`	README and product artwork.
`.github/`	Issue forms, PR template, CI, Dependabot, labels, and repo assets.

Web app

npm run dev

Open http://localhost:3000, create an account or sign in, paste a YouTube URL, and run analysis. The interface is a single editorial experience — "The Reference Monograph" — styled as a printed film publication: warm paper, ink, one printer's red, and the two moments you watch (the processing frame and the lightbox) drop to theater black.

a URL composer that detects the video and shows its thumbnail before you run
account auth with HttpOnly sessions, a Postgres-backed video library, and Blob-backed frame/ZIP storage
a collapsible Tuning panel for frame count, selection mode, and sampling
live pipeline progress — every stage reports in real time with an overall percentage, an elapsed clock, and per-frame counts, instead of a blind spinner
a result view with tabs for the watch pack, frames, style bible, shot specs, Codex prompt, and slop warnings
rendered Markdown with a Reading/Raw toggle, per-tab copy, and .md download
a frame gallery with a keyboard-navigable lightbox and per-frame downloads
a one-click artifact ZIP download
responsive across desktop and mobile, with reduced-motion support

CLI

npm run cli -- "https://www.youtube.com/watch?v=VIDEO_ID" -k 8 --mode all

Useful options:

npm run cli -- "<url>" \
  --output .yt2ctx \
  --top-k 10 \
  --selection-mode density \
  --mode style \
  --candidate-interval 6 \
  --max-candidates 48 \
  --frame-width 768 \
  --cookies-from-browser chrome \
  --quiet

Option	Default	Description
`-k, --top-k <n>`	`8`	Number of frames to select.
`-m, --mode <mode>`	`all`	Output: `watch`, `style`, `shot-specs`, `prompt`, `all`.
`--selection-mode <mode>`	`density`	Frame selection: `density` or `top-k`.
`--candidate-interval <s>`	`8`	Seconds between sampled frames.
`--max-candidates <n>`	`36`	Candidate frames sent to vision analysis.
`--frame-width <px>`	`768`	Extracted frame width.
`--cookies <path>`	—	Netscape cookies.txt file to pass to `yt-dlp`.
`--cookies-from-browser <browser>`	—	Browser cookie source to pass to `yt-dlp`, such as `chrome` or `firefox`.
`-o, --output <dir>`	`.yt2ctx`	Output directory.
`--json`	—	Print JSON metadata instead of Markdown.
`--with-data-urls`	—	Include base64 data URLs in JSON output.
`--quiet`	—	Suppress the live progress display.

The CLI renders a live progress bar on stderr as it moves through each pipeline stage. stdout only ever receives the requested artifact text or JSON, so it stays safe to pipe.

If YouTube returns "Sign in to confirm you're not a bot", use a signed-in browser session:

npm run cli -- "<url>" --cookies-from-browser chrome

For server, API, or MCP runs, set YT2CTX_YTDLP_COOKIES to a cookies.txt path or YT2CTX_YTDLP_COOKIES_FROM_BROWSER to a browser name in the environment.

MCP server

yt2ctx exposes the pipeline to MCP clients (Claude Desktop, Claude Code, and any other agent that speaks MCP) as a single tool: watch_youtube.

1. Build the server

npm install        # if you have not already
npm run build:bin  # produces dist/mcp.js

This compiles a standalone stdio server to dist/mcp.js.

2. Register it with a client

The server needs OPENAI_API_KEY in its environment. It will read a .env file in its working directory if one exists, but because MCP clients launch the process from an arbitrary directory, passing the key explicitly in the client config is recommended.

Claude Desktop

Edit the config file:

macOS — ~/Library/Application Support/Claude/claude_desktop_config.json
Windows — %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "yt2ctx": {
      "command": "node",
      "args": ["/absolute/path/to/yt2ctx/dist/mcp.js"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Restart Claude Desktop. watch_youtube will appear in the tools list.

Claude Code

claude mcp add yt2ctx \
  --env OPENAI_API_KEY=sk-... \
  -- node /absolute/path/to/yt2ctx/dist/mcp.js

Verify with claude mcp list.

Any other MCP client

Launch this command as a stdio MCP server, with OPENAI_API_KEY in its environment:

node /absolute/path/to/yt2ctx/dist/mcp.js

3. `watch_youtube` arguments

Argument	Default	Description
`url`	(required)	YouTube video URL.
`topK`	`8`	Number of frames to select.
`mode`	`density`	Frame selection: `density` or `top-k`.
`outputMode`	`all`	`watch`, `style`, `prompt`, `shot-specs`, or `all`.
`candidateIntervalSeconds`	`8`	Seconds between sampled frames.
`maxCandidateFrames`	`36`	Candidate frames sent to vision analysis.
`frameWidth`	`768`	Extracted frame width.
`outputDir`	(optional)	Where to write artifacts.

The tool returns the requested text artifact plus the selected frames as MCP image content, and also writes the full artifact set to disk.

HTTP API

GET /api/analyze returns the endpoint contract as JSON, so the API is self-documenting. POST /api/analyze requires an authenticated web session, runs the pipeline, saves the completed analysis to the signed-in user's Postgres video library, and content-negotiates its response so the same endpoint serves the browser and headless agents:

Accept: application/x-ndjson — streams newline-delimited JSON. Zero or more {"type":"progress","stage":"vision","pct":0.71,...} events, then one {"type":"result","result":{...}} line. Failures arrive as {"type":"error","message":"..."}. The web app uses this for live progress.
Any other Accept — returns a single buffered JSON result object, or {"error":"..."} with HTTP 400. The simplest thing for an agent to fetch().then(r => r.json()).

Request body (JSON): url (required), topK, mode (density | top-k), candidateIntervalSeconds, maxCandidateFrames, frameWidth.

Result: metadata, markdown, frames (each with imageUrl and imageDownloadUrl), cinematic artifacts, zipUrl, and zipDownloadUrl.

# Discover the contract
curl -s http://localhost:3000/api/analyze

# Headless agent — one buffered JSON result
curl -s -X POST http://localhost:3000/api/analyze \
  -H 'Content-Type: application/json' \
  -d '{"url":"https://youtu.be/VIDEO_ID"}'

# Live streaming progress
curl -N -X POST http://localhost:3000/api/analyze \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/x-ndjson' \
  -d '{"url":"https://youtu.be/VIDEO_ID"}'

Artifacts

Every run writes a job folder under the output directory (.yt2ctx/<job-id>/):

File	Contents
`watch.md`	Timed transcript plus representative frame metadata.
`style-bible.md`	The extracted cinematic production grammar.
`shot-specs.md` / `shot-specs.json`	Blender/Remotion-ready shot specs.
`codex-prompt.md`	A direct implementation prompt for coding agents.
`metadata.json`	The full structured analysis result.
`frames/*.jpg`	The selected frame images.
`yt2ctx-artifacts.zip`	Everything above, bundled.

Cinematic grammar compiler

The extra outputs are designed for downstream generation systems.

style-bible.md extracts the production grammar: cinematic ontology, reference lineage, camera/lens/lighting/material/edit/typography/sound language, narration register and forbidden phrases, reusable shot patterns, and transfer rules for new products.

shot-specs.json makes the reference executable: source frame and timestamp, shot type and purpose, lens/focal length/aperture/rig/movement/focus behavior, lighting setup, material emphasis, Blender render passes, diffusion finishing intent, Remotion role, and anti-slop forbidden moves.

codex-prompt.md is a direct implementation prompt for coding agents. It tells Codex/Claude to build a physically grounded Blender-first pipeline with diffusion as finishing and Remotion as editorial assembly — not as the visual substrate.

slopWarnings are validator-ready rules that catch presentation-deck failure modes: arbitrary floating UI, LinkedIn announcement language, missing lens metadata, and ungrounded abstract AI visuals.

Frame selection

top-k sorts candidate frames by score and returns the highest scoring.
density treats salience as a timeline density and samples across weighted buckets — usually a more representative sequence across the whole video while still preferring information-rich moments.

The score combines OpenAI vision salience, semantic novelty from frame descriptions, visual scene-change novelty, nearby transcript density, and colorfulness.

Configuration

Environment variables (see .env.example):

Variable	Default	Description
`OPENAI_API_KEY`	(required)	Your OpenAI API key.
`DATABASE_URL`	(required for web)	Postgres connection string for accounts, sessions, and saved video analyses.
`BLOB_READ_WRITE_TOKEN`	(required for web)	Vercel Blob token for generated frame JPGs and artifact ZIPs.
`OPENAI_TRANSCRIBE_MODEL`	`whisper-1`	Transcription model.
`OPENAI_VISION_MODEL`	`gpt-4.1-mini`	Vision + grammar model.
`OPENAI_EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model.
`YT2CTX_OUTPUT_DIR`	`.yt2ctx`	Default artifact directory.
`YT2CTX_YTDLP_COOKIES`	—	Optional Netscape cookies.txt path for `yt-dlp`.
`YT2CTX_YTDLP_COOKIES_FROM_BROWSER`	—	Optional browser cookie source for `yt-dlp`, such as `chrome` or `firefox`.

whisper-1 is the default because it supports verbose JSON with segment timestamps.

Deployment

This repo is intended to deploy through a linked GitHub repository on Vercel. Push to the configured production branch and let Vercel build automatically.

Set OPENAI_API_KEY in the Vercel project settings before relying on automatic deployments. The web app also requires DATABASE_URL and BLOB_READ_WRITE_TOKEN; the project is designed to use Vercel Marketplace Postgres plus Vercel Blob, which inject connection environment variables when connected to the project. The analyze route is configured for the Node.js runtime with a 300 second function duration. Serverless limits still apply — long videos are better processed through the CLI or MCP server; short videos and clips fit the hosted web path.

Important

The hosted web app requires authentication, but each analysis still costs real OpenAI usage. Before widening access to a public deployment, add usage limits and billing controls — see TODO.md.

Roadmap

Planned work — including gating the public web app behind authentication and billing so it cannot run up unbounded OpenAI spend — is tracked in TODO.md.

Contributing

Issues and pull requests are welcome. Good contributions include bug fixes, documentation improvements, sharper prompts/artifact schemas, better frame selection heuristics, and MCP/client compatibility work.

Before opening a PR:

Search existing issues to avoid duplicate work.
Keep the change focused and include screenshots or sample artifacts for UI and output changes.
Run npm run typecheck, npm run lint, and, when practical, npm run build.
Do not commit .env, API keys, private video URLs, or generated .yt2ctx/ output.

Use the issue templates for bugs and feature requests, and the PR template for review context. Security reports should be opened privately through GitHub Security Advisories rather than public issues.

Community health

This repository includes:

bug and feature request issue forms
a pull request template
CI for typecheck, lint, and build
Dependabot configuration for npm and GitHub Actions
a label set for triage
a funding placeholder for future sponsorship setup

Notes

Only process videos you have the right to download and analyze. YouTube availability and extractor behavior can change; youtube-dl-exec bundles yt-dlp, which is more robust than browser-only download libraries.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
assets		assets
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
eslint.config.mjs		eslint.config.mjs
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Why use it?

How it works

Requirements

Quick start

Local development

Project layout

Web app

CLI

MCP server

1. Build the server

2. Register it with a client

3. `watch_youtube` arguments

HTTP API

Artifacts

Cinematic grammar compiler

Frame selection

Configuration

Deployment

Roadmap

Contributing

Community health

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Why use it?

How it works

Requirements

Quick start

Local development

Project layout

Web app

CLI

MCP server

1. Build the server

2. Register it with a client

3. watch_youtube arguments

HTTP API

Artifacts

Cinematic grammar compiler

Frame selection

Configuration

Deployment

Roadmap

Contributing

Community health

Notes

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. `watch_youtube` arguments

Packages