AI-powered video analysis pipeline. Give it a YouTube link or a local file — get back a structured, readable summary with chapters, key frames, and optional semantic search.
video-analyzer run https://youtube.com/watch?v=dQw4w9WgXcQ
- Profiles
- Setup by Profile
- Quick Start
- Setup Wizard
- Running the Pipeline
- Configuration Reference
- Output Formats
- Semantic Search
- Advanced Usage
- Development
The pipeline scales from a fast transcript-only summary up to a full visual-analysis RAG system.
| Profile | What it does | Time per video |
|---|---|---|
simple |
Transcript → AI summary | ~30 s |
standard |
+ VLM frame descriptions + chapter detection | ~2–5 min |
full |
+ image/text embeddings + semantic search | ~5–15 min |
Start with simple. Upgrade when you need more.
The lightest setup. No local models required.
pip install video-analyzerYou need one of:
| Summarizer | What you need |
|---|---|
| Claude Code (recommended) | claude CLI already installed — nothing else |
| Anthropic API | ANTHROPIC_API_KEY environment variable |
| OpenAI API | OPENAI_API_KEY environment variable |
| Ollama | ollama serve running locally |
video-analyzer setup # wizard guides you through the rest
video-analyzer run https://youtube.com/watch?v=...Describes what's happening in key frames before summarising. Produces better summaries for visual content (slides, demos, diagrams).
pip install "video-analyzer[standard]"Additional requirement — a VLM to describe frames:
| VLM option | Setup |
|---|---|
| Ollama (recommended) | ollama pull qwen2.5-vl:7b then ollama serve |
| Anthropic Vision | Reuses your ANTHROPIC_API_KEY — no extra setup |
| HuggingFace transformers | Downloads ~6 GB model on first run, no Ollama needed |
video-analyzer setup # wizard detects your VLM choice and verifies the connection
video-analyzer run https://youtube.com/watch?v=...Everything in standard, plus image and text embeddings stored in LanceDB so you can query across videos later.
pip install "video-analyzer[full]"Additional requirements:
- Everything from
standardabove - ~1.5 GB embedding model download (SigLIP2 + BGE-M3) — the setup wizard offers to do this for you
video-analyzer setup # offers to pre-download the embedding models
video-analyzer run https://youtube.com/watch?v=...
video-analyzer query ./output/My_Video "what did they say about the roadmap?"pip install "video-analyzer[full,cuda]" # or [standard,cuda]Adds the CUDA libraries that faster-whisper needs. PyTorch will use CUDA automatically if a GPU is detected.
If you have Claude Code installed, this is the zero-config path:
pip install video-analyzer
video-analyzer setup # choose: simple → claude_code
video-analyzer run https://youtube.com/watch?v=...No API key. No Ollama. Claude Code acts as both the summarizer and (via vision) replaces the need for a separate VLM — so simple profile gives you frame-aware summaries automatically.
pip install video-analyzer # or [standard] / [full]
video-analyzer setup
video-analyzer run https://youtube.com/watch?v=...Output lands in ./output/<video-title>/report.md.
video-analyzer setupWalks you through profile → summarizer → VLM (if needed) → model download (if needed). Verifies all connections before writing the config.
╔══════════════════════════════════╗
║ video-analyzer · setup wizard ║
╚══════════════════════════════════╝
Step 1 · Choose a profile
1) simple transcript → AI summary (API key only, ~30 s/video)
2) standard + visual frame analysis via VLM (~2–5 min/video)
3) full + semantic search / RAG (~5–15 min + ~1.5 GB)
Profile [1]: _
If the required packages aren't installed for your chosen profile, the wizard tells you exactly what to run before continuing.
Config is written to ./video-analyzer.yaml by default. To choose a different location:
video-analyzer setup --output ~/.config/video-analyzer/config.yamlRe-run at any time to reconfigure.
video-analyzer run https://youtube.com/watch?v=...
video-analyzer run /path/to/video.mp4# Override what you want from the video
video-analyzer run <url> --goal "extract all action items and decisions"
# Override profile for this run only
video-analyzer run <url> --profile standard
# Output format
video-analyzer run <url> --format html # markdown | html | json
# Output directory
video-analyzer run <url> --output ./reports
# All-local models (no API key)
video-analyzer run <url> --local
# Estimate cost before committing to LLM calls
video-analyzer run <url> --preflightStages are cached automatically — re-running the same video skips completed work.
# Invalidate specific stages and re-run them
video-analyzer run <url> --redo vlm # redo frame descriptions only
video-analyzer run <url> --redo vlm,transcript # redo two stages
video-analyzer run <url> --redo embeddings # recompute embeddings
# Valid stages: frames, transcript, vlm, embeddings
# Bypass all caches
video-analyzer run <url> --no-cachevideo-analyzer check
video-analyzer check --config ./my-config.yamlVerifies installed packages, API keys, Ollama connectivity, and device (CPU/CUDA/MPS) for your current profile.
Config is loaded in this order (first match wins):
--config <path>(explicit CLI flag)./video-analyzer.yaml./config.yaml~/.config/video-analyzer/config.yaml- Built-in
simpledefaults (with a hint to runsetup)
# video-analyzer.yaml
profile: standard # simple | standard | full
goal: "extract key decisions and action items"
transcript:
providers: # tried in order; first success wins
- youtube_captions # instant for YouTube content
- whisper # local fallback, works on any audio
whisper_model: base # tiny, base, small, medium, large-v3
summarizer:
backend: anthropic # anthropic | openai_compat | claude_code
model: claude-sonnet-4-6
vision_mode: auto # auto | text | vision
chunk_size: 10 # frames per map-reduce chunk
max_context_fraction: 0.8
vlm: # standard / full profiles only
enabled: true
backend: openai_compat # openai_compat | anthropic | transformers
model: qwen2.5-vl:7b
base_url: http://localhost:11434/v1
concurrency: 4
embeddings: # full profile only
image_encoder: siglip2 # siglip2 | clip
text_encoder: bge-m3 # bge-m3 | sentence-transformers
frames:
extractor: pixel_diff # pixel_diff | pyscenedetect | fixed_interval
pixel_diff_threshold: 0.05
min_interval_secs: 5.0
dedup:
phash: true
phash_threshold: 5
chapters:
breakpoint_threshold: 0.85
min_chapter_duration: 30.0
output:
format: markdown # markdown | html | json
path: ./output
images: files # files | inline (html only)
storage: # full profile only
enabled: true
backend: lancedbprofile is the master switch. It controls which pipeline stages run regardless of other flags:
simple— transcript → summary. VLM and embeddings never run.standard— adds VLM frame descriptions and content-based chapter detection.full— adds embeddings and a LanceDB store for semantic search.
vision_mode controls how frames reach the summarizer:
auto— if the LLM supports vision (Anthropic, Claude Code), send frames directly and skip the VLM stage. Otherwise use VLM text descriptions.text— always use VLM text descriptions, never send images to the summarizer.vision— always send frames directly to the summarizer, skip VLM entirely.
autois the recommended setting. Withclaude_codeoranthropicas your summarizer,autodetects that the LLM supports vision and sends key frames directly — no Ollama, no separate VLM step needed. This means thesimpleprofile with a vision-capable summarizer gets the same frame-level visual understanding asstandard, with none of the extra setup.
transcript.providers — tried in order, first success wins. youtube_captions is instant for YouTube; whisper transcribes locally from audio and works on any file.
./output/<video-title>/
├── report.md
└── frames/
├── ch01_Introduction/
│ └── frame_00000_0m05s.jpg
└── ch02_Main_Content/
└── frame_00042_3m12s.jpg
video-analyzer run <url> --format htmlFrame images saved to frames/ alongside the HTML, or inlined as base64 for a single portable file:
output:
format: html
images: inlineFull machine-readable export — metadata, chapters, per-frame data (embeddings included), summaries.
video-analyzer run <url> --format jsonThe full profile writes a LanceDB vector store you can query across videos.
video-analyzer run <url> --profile full
video-analyzer query ./output/My_Video "what did they say about pricing?"
video-analyzer query ./output/My_Video "show me the architecture diagram" --top-k 3
video-analyzer query <path> <question> \
--top-k 5 \ # number of results
--context 2 # expand each result with ±N adjacent framesquery accepts either the output directory or its store/ subdirectory.
pip install video-analyzer
video-analyzer setup # choose: simple → claude_codeUses your existing claude session for summarization. No ANTHROPIC_API_KEY needed — auth is handled by Claude Code.
Because Claude supports vision, vision_mode: auto (the default) detects this and sends key frames directly to Claude alongside the transcript. You get the same frame-level visual understanding as the standard profile without any extra setup — no Ollama, no VLM download, no additional config.
Generated config:
profile: simple
summarizer:
backend: claude_code
vision_mode: auto # Claude Code supports vision → frames sent directly, VLM skippedThis is the recommended starting point if you already have Claude Code installed.
No internet required after the initial model download:
pip install "video-analyzer[standard]"
video-analyzer run <url> --localSets VLM to HuggingFace transformers (Qwen2.5-VL-3B) and summarizer to Ollama. Requires ollama serve with a text model pulled (ollama pull qwen2.5:7b).
video-analyzer run <url> --preflight=== Preflight Report ===
Video: My Conference Talk
Duration: 3600s (60.0 min)
Frames: 284 (after dedup)
Chapters: 8
Map chunks: 29
Transcription: youtube_captions ✓
Transcript tok: ~24,500
LLM: anthropic / claude-sonnet-4-6
Vision mode: text (VLM descriptions → summarizer)
Est input tok: ~38,200
Est output tok: ~8,600
Est LLM cost: ~$0.244
========================
Override the built-in Jinja2 summarization prompts:
prompts/
├── map_extract.j2 # per-chunk extraction
├── chapter_reduce.j2 # per-chapter summary
├── chapter_name.j2 # chapter title generation
└── final_reduce.j2 # final overall summary
from video_analyzer.summarizer.map_reduce import map_reduce
chapter_summaries, final_summary = map_reduce(
chapters=chapters,
llm=llm,
metadata=metadata,
goal="extract all code examples",
user_prompts_dir=Path("./prompts"),
)git clone https://github.com/anthropics/video-analyzer
cd video-analyzer
pip install -e ".[full,dev]"
pytest tests/video_analyzer/
├── cli.py ← Typer CLI (run, setup, query, check)
├── setup_wizard.py ← Interactive setup wizard
├── pipeline.py ← Top-level orchestration
├── config.py ← Pydantic config tree + profile logic
├── models.py ← Shared dataclasses
├── ingestion/ ← Video download + loading
├── extraction/ ← Frame extraction + deduplication
├── transcript/ ← Whisper + YouTube captions
├── alignment/ ← Transcript→frame alignment, chapter detection
├── vlm/ ← Visual frame description (VLM backends)
├── embeddings/ ← Image (SigLIP2/CLIP) + text (BGE-M3) encoders
├── summarizer/ ← Map-reduce summarization (Anthropic/OpenAI/Claude Code)
├── store/ ← LanceDB vector store + retrieval
└── output/ ← Markdown, HTML, JSON writers
- Subclass
video_analyzer.summarizer.base.LLM - Implement
complete(prompt)and optionallycomplete_with_images(prompt, images) - Set
supports_vision = Trueif vision is supported - Register in
build_llm()insummarizer/base.py
- Subclass
video_analyzer.vlm.base.VLM - Implement
describe_batch(images) - Register in
build_vlm()invlm/base.py