video-analyzer

AI-powered video analysis pipeline. Give it a YouTube link or a local file — get back a structured, readable summary with chapters, key frames, and optional semantic search.

video-analyzer run https://youtube.com/watch?v=dQw4w9WgXcQ

Profiles

The pipeline scales from a fast transcript-only summary up to a full visual-analysis RAG system.

Profile	What it does	Time per video
`simple`	Transcript → AI summary	~30 s
`standard`	+ VLM frame descriptions + chapter detection	~2–5 min
`full`	+ image/text embeddings + semantic search	~5–15 min

Start with simple. Upgrade when you need more.

Setup by Profile

`simple` — transcript + AI summary

The lightest setup. No local models required.

pip install video-analyzer

You need one of:

Summarizer	What you need
Claude Code (recommended)	`claude` CLI already installed — nothing else
Anthropic API	`ANTHROPIC_API_KEY` environment variable
OpenAI API	`OPENAI_API_KEY` environment variable
Ollama	`ollama serve` running locally

video-analyzer setup    # wizard guides you through the rest
video-analyzer run https://youtube.com/watch?v=...

`standard` — + visual frame analysis

Describes what's happening in key frames before summarising. Produces better summaries for visual content (slides, demos, diagrams).

pip install "video-analyzer[standard]"

Additional requirement — a VLM to describe frames:

VLM option	Setup
Ollama (recommended)	`ollama pull qwen2.5-vl:7b` then `ollama serve`
Anthropic Vision	Reuses your `ANTHROPIC_API_KEY` — no extra setup
HuggingFace transformers	Downloads ~6 GB model on first run, no Ollama needed

video-analyzer setup    # wizard detects your VLM choice and verifies the connection
video-analyzer run https://youtube.com/watch?v=...

`full` — + embeddings + semantic search

Everything in standard, plus image and text embeddings stored in LanceDB so you can query across videos later.

pip install "video-analyzer[full]"

Additional requirements:

Everything from standard above
~1.5 GB embedding model download (SigLIP2 + BGE-M3) — the setup wizard offers to do this for you

video-analyzer setup    # offers to pre-download the embedding models
video-analyzer run https://youtube.com/watch?v=...
video-analyzer query ./output/My_Video "what did they say about the roadmap?"

GPU acceleration (any profile)

pip install "video-analyzer[full,cuda]"    # or [standard,cuda]

Adds the CUDA libraries that faster-whisper needs. PyTorch will use CUDA automatically if a GPU is detected.

Quick Start

Fastest possible start — Claude Code users

If you have Claude Code installed, this is the zero-config path:

pip install video-analyzer
video-analyzer setup        # choose: simple → claude_code
video-analyzer run https://youtube.com/watch?v=...

No API key. No Ollama. Claude Code acts as both the summarizer and (via vision) replaces the need for a separate VLM — so simple profile gives you frame-aware summaries automatically.

Everyone else

pip install video-analyzer   # or [standard] / [full]
video-analyzer setup
video-analyzer run https://youtube.com/watch?v=...

Output lands in ./output/<video-title>/report.md.

Setup Wizard

video-analyzer setup

Walks you through profile → summarizer → VLM (if needed) → model download (if needed). Verifies all connections before writing the config.

╔══════════════════════════════════╗
║   video-analyzer  ·  setup wizard ║
╚══════════════════════════════════╝

Step 1 · Choose a profile
  1) simple     transcript → AI summary (API key only, ~30 s/video)
  2) standard   + visual frame analysis via VLM (~2–5 min/video)
  3) full       + semantic search / RAG (~5–15 min + ~1.5 GB)

Profile [1]: _

If the required packages aren't installed for your chosen profile, the wizard tells you exactly what to run before continuing.

Config is written to ./video-analyzer.yaml by default. To choose a different location:

video-analyzer setup --output ~/.config/video-analyzer/config.yaml

Re-run at any time to reconfigure.

Running the Pipeline

Basic

video-analyzer run https://youtube.com/watch?v=...
video-analyzer run /path/to/video.mp4

Common options

# Override what you want from the video
video-analyzer run <url> --goal "extract all action items and decisions"

# Override profile for this run only
video-analyzer run <url> --profile standard

# Output format
video-analyzer run <url> --format html       # markdown | html | json

# Output directory
video-analyzer run <url> --output ./reports

# All-local models (no API key)
video-analyzer run <url> --local

# Estimate cost before committing to LLM calls
video-analyzer run <url> --preflight

Cache and re-runs

Stages are cached automatically — re-running the same video skips completed work.

# Invalidate specific stages and re-run them
video-analyzer run <url> --redo vlm                # redo frame descriptions only
video-analyzer run <url> --redo vlm,transcript     # redo two stages
video-analyzer run <url> --redo embeddings         # recompute embeddings

# Valid stages: frames, transcript, vlm, embeddings

# Bypass all caches
video-analyzer run <url> --no-cache

Health check

video-analyzer check
video-analyzer check --config ./my-config.yaml

Verifies installed packages, API keys, Ollama connectivity, and device (CPU/CUDA/MPS) for your current profile.

Configuration Reference

Config is loaded in this order (first match wins):

--config <path> (explicit CLI flag)
./video-analyzer.yaml
./config.yaml
~/.config/video-analyzer/config.yaml
Built-in simple defaults (with a hint to run setup)

Full example

# video-analyzer.yaml

profile: standard          # simple | standard | full

goal: "extract key decisions and action items"

transcript:
  providers:               # tried in order; first success wins
    - youtube_captions     # instant for YouTube content
    - whisper              # local fallback, works on any audio
  whisper_model: base      # tiny, base, small, medium, large-v3

summarizer:
  backend: anthropic       # anthropic | openai_compat | claude_code
  model: claude-sonnet-4-6
  vision_mode: auto        # auto | text | vision
  chunk_size: 10           # frames per map-reduce chunk
  max_context_fraction: 0.8

vlm:                       # standard / full profiles only
  enabled: true
  backend: openai_compat   # openai_compat | anthropic | transformers
  model: qwen2.5-vl:7b
  base_url: http://localhost:11434/v1
  concurrency: 4

embeddings:                # full profile only
  image_encoder: siglip2   # siglip2 | clip
  text_encoder: bge-m3     # bge-m3 | sentence-transformers

frames:
  extractor: pixel_diff    # pixel_diff | pyscenedetect | fixed_interval
  pixel_diff_threshold: 0.05
  min_interval_secs: 5.0
  dedup:
    phash: true
    phash_threshold: 5

chapters:
  breakpoint_threshold: 0.85
  min_chapter_duration: 30.0

output:
  format: markdown         # markdown | html | json
  path: ./output
  images: files            # files | inline (html only)

storage:                   # full profile only
  enabled: true
  backend: lancedb

Key settings

profile is the master switch. It controls which pipeline stages run regardless of other flags:

simple — transcript → summary. VLM and embeddings never run.
standard — adds VLM frame descriptions and content-based chapter detection.
full — adds embeddings and a LanceDB store for semantic search.

vision_mode controls how frames reach the summarizer:

auto — if the LLM supports vision (Anthropic, Claude Code), send frames directly and skip the VLM stage. Otherwise use VLM text descriptions.
text — always use VLM text descriptions, never send images to the summarizer.
vision — always send frames directly to the summarizer, skip VLM entirely.

auto is the recommended setting. With claude_code or anthropic as your summarizer, auto detects that the LLM supports vision and sends key frames directly — no Ollama, no separate VLM step needed. This means the simple profile with a vision-capable summarizer gets the same frame-level visual understanding as standard, with none of the extra setup.

transcript.providers — tried in order, first success wins. youtube_captions is instant for YouTube; whisper transcribes locally from audio and works on any file.

Output Formats

Markdown (default)

./output/<video-title>/
├── report.md
└── frames/
    ├── ch01_Introduction/
    │   └── frame_00000_0m05s.jpg
    └── ch02_Main_Content/
        └── frame_00042_3m12s.jpg

HTML

video-analyzer run <url> --format html

Frame images saved to frames/ alongside the HTML, or inlined as base64 for a single portable file:

output:
  format: html
  images: inline

JSON

Full machine-readable export — metadata, chapters, per-frame data (embeddings included), summaries.

video-analyzer run <url> --format json

Semantic Search

The full profile writes a LanceDB vector store you can query across videos.

video-analyzer run <url> --profile full

video-analyzer query ./output/My_Video "what did they say about pricing?"
video-analyzer query ./output/My_Video "show me the architecture diagram" --top-k 3

video-analyzer query <path> <question> \
  --top-k 5 \       # number of results
  --context 2       # expand each result with ±N adjacent frames

query accepts either the output directory or its store/ subdirectory.

Advanced Usage

Claude Code backend (no API key needed)

pip install video-analyzer
video-analyzer setup    # choose: simple → claude_code

Uses your existing claude session for summarization. No ANTHROPIC_API_KEY needed — auth is handled by Claude Code.

Because Claude supports vision, vision_mode: auto (the default) detects this and sends key frames directly to Claude alongside the transcript. You get the same frame-level visual understanding as the standard profile without any extra setup — no Ollama, no VLM download, no additional config.

Generated config:

profile: simple
summarizer:
  backend: claude_code
  vision_mode: auto    # Claude Code supports vision → frames sent directly, VLM skipped

This is the recommended starting point if you already have Claude Code installed.

Fully local pipeline

No internet required after the initial model download:

pip install "video-analyzer[standard]"
video-analyzer run <url> --local

Sets VLM to HuggingFace transformers (Qwen2.5-VL-3B) and summarizer to Ollama. Requires ollama serve with a text model pulled (ollama pull qwen2.5:7b).

Cost estimation

video-analyzer run <url> --preflight

=== Preflight Report ===
Video:          My Conference Talk
Duration:       3600s (60.0 min)
Frames:         284 (after dedup)
Chapters:       8
Map chunks:     29

Transcription:  youtube_captions ✓
Transcript tok: ~24,500

LLM:            anthropic / claude-sonnet-4-6
Vision mode:    text (VLM descriptions → summarizer)
Est input tok:  ~38,200
Est output tok: ~8,600
Est LLM cost:   ~$0.244
========================

Custom prompts

Override the built-in Jinja2 summarization prompts:

prompts/
├── map_extract.j2       # per-chunk extraction
├── chapter_reduce.j2    # per-chapter summary
├── chapter_name.j2      # chapter title generation
└── final_reduce.j2      # final overall summary

from video_analyzer.summarizer.map_reduce import map_reduce

chapter_summaries, final_summary = map_reduce(
    chapters=chapters,
    llm=llm,
    metadata=metadata,
    goal="extract all code examples",
    user_prompts_dir=Path("./prompts"),
)

Development

git clone https://github.com/anthropics/video-analyzer
cd video-analyzer
pip install -e ".[full,dev]"
pytest tests/

Project structure

video_analyzer/
├── cli.py                  ← Typer CLI (run, setup, query, check)
├── setup_wizard.py         ← Interactive setup wizard
├── pipeline.py             ← Top-level orchestration
├── config.py               ← Pydantic config tree + profile logic
├── models.py               ← Shared dataclasses
├── ingestion/              ← Video download + loading
├── extraction/             ← Frame extraction + deduplication
├── transcript/             ← Whisper + YouTube captions
├── alignment/              ← Transcript→frame alignment, chapter detection
├── vlm/                    ← Visual frame description (VLM backends)
├── embeddings/             ← Image (SigLIP2/CLIP) + text (BGE-M3) encoders
├── summarizer/             ← Map-reduce summarization (Anthropic/OpenAI/Claude Code)
├── store/                  ← LanceDB vector store + retrieval
└── output/                 ← Markdown, HTML, JSON writers

Adding a new LLM backend

Subclass video_analyzer.summarizer.base.LLM
Implement complete(prompt) and optionally complete_with_images(prompt, images)
Set supports_vision = True if vision is supported
Register in build_llm() in summarizer/base.py

Adding a new VLM backend

Subclass video_analyzer.vlm.base.VLM
Implement describe_batch(images)
Register in build_vlm() in vlm/base.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
output		output
tests		tests
video_analyzer		video_analyzer
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

video-analyzer

Table of Contents

Profiles

Setup by Profile

simple — transcript + AI summary

standard — + visual frame analysis

full — + embeddings + semantic search

GPU acceleration (any profile)

Quick Start

Fastest possible start — Claude Code users

Everyone else

Setup Wizard

Running the Pipeline

Basic

Common options

Cache and re-runs

Health check

Configuration Reference

Full example

Key settings

Output Formats

Markdown (default)

HTML

JSON

Semantic Search

Advanced Usage

Claude Code backend (no API key needed)

Fully local pipeline

Cost estimation

Custom prompts

Development

Project structure

Adding a new LLM backend

Adding a new VLM backend

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`simple` — transcript + AI summary

`standard` — + visual frame analysis

`full` — + embeddings + semantic search

Packages