Skip to content

Ouroboroz/Video-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

video-analyzer

AI-powered video analysis pipeline. Give it a YouTube link or a local file — get back a structured, readable summary with chapters, key frames, and optional semantic search.

video-analyzer run https://youtube.com/watch?v=dQw4w9WgXcQ

Table of Contents


Profiles

The pipeline scales from a fast transcript-only summary up to a full visual-analysis RAG system.

Profile What it does Time per video
simple Transcript → AI summary ~30 s
standard + VLM frame descriptions + chapter detection ~2–5 min
full + image/text embeddings + semantic search ~5–15 min

Start with simple. Upgrade when you need more.


Setup by Profile

simple — transcript + AI summary

The lightest setup. No local models required.

pip install video-analyzer

You need one of:

Summarizer What you need
Claude Code (recommended) claude CLI already installed — nothing else
Anthropic API ANTHROPIC_API_KEY environment variable
OpenAI API OPENAI_API_KEY environment variable
Ollama ollama serve running locally
video-analyzer setup    # wizard guides you through the rest
video-analyzer run https://youtube.com/watch?v=...

standard — + visual frame analysis

Describes what's happening in key frames before summarising. Produces better summaries for visual content (slides, demos, diagrams).

pip install "video-analyzer[standard]"

Additional requirement — a VLM to describe frames:

VLM option Setup
Ollama (recommended) ollama pull qwen2.5-vl:7b then ollama serve
Anthropic Vision Reuses your ANTHROPIC_API_KEY — no extra setup
HuggingFace transformers Downloads ~6 GB model on first run, no Ollama needed
video-analyzer setup    # wizard detects your VLM choice and verifies the connection
video-analyzer run https://youtube.com/watch?v=...

full — + embeddings + semantic search

Everything in standard, plus image and text embeddings stored in LanceDB so you can query across videos later.

pip install "video-analyzer[full]"

Additional requirements:

  • Everything from standard above
  • ~1.5 GB embedding model download (SigLIP2 + BGE-M3) — the setup wizard offers to do this for you
video-analyzer setup    # offers to pre-download the embedding models
video-analyzer run https://youtube.com/watch?v=...
video-analyzer query ./output/My_Video "what did they say about the roadmap?"

GPU acceleration (any profile)

pip install "video-analyzer[full,cuda]"    # or [standard,cuda]

Adds the CUDA libraries that faster-whisper needs. PyTorch will use CUDA automatically if a GPU is detected.


Quick Start

Fastest possible start — Claude Code users

If you have Claude Code installed, this is the zero-config path:

pip install video-analyzer
video-analyzer setup        # choose: simple → claude_code
video-analyzer run https://youtube.com/watch?v=...

No API key. No Ollama. Claude Code acts as both the summarizer and (via vision) replaces the need for a separate VLM — so simple profile gives you frame-aware summaries automatically.

Everyone else

pip install video-analyzer   # or [standard] / [full]
video-analyzer setup
video-analyzer run https://youtube.com/watch?v=...

Output lands in ./output/<video-title>/report.md.


Setup Wizard

video-analyzer setup

Walks you through profile → summarizer → VLM (if needed) → model download (if needed). Verifies all connections before writing the config.

╔══════════════════════════════════╗
║   video-analyzer  ·  setup wizard ║
╚══════════════════════════════════╝

Step 1 · Choose a profile
  1) simple     transcript → AI summary (API key only, ~30 s/video)
  2) standard   + visual frame analysis via VLM (~2–5 min/video)
  3) full       + semantic search / RAG (~5–15 min + ~1.5 GB)

Profile [1]: _

If the required packages aren't installed for your chosen profile, the wizard tells you exactly what to run before continuing.

Config is written to ./video-analyzer.yaml by default. To choose a different location:

video-analyzer setup --output ~/.config/video-analyzer/config.yaml

Re-run at any time to reconfigure.


Running the Pipeline

Basic

video-analyzer run https://youtube.com/watch?v=...
video-analyzer run /path/to/video.mp4

Common options

# Override what you want from the video
video-analyzer run <url> --goal "extract all action items and decisions"

# Override profile for this run only
video-analyzer run <url> --profile standard

# Output format
video-analyzer run <url> --format html       # markdown | html | json

# Output directory
video-analyzer run <url> --output ./reports

# All-local models (no API key)
video-analyzer run <url> --local

# Estimate cost before committing to LLM calls
video-analyzer run <url> --preflight

Cache and re-runs

Stages are cached automatically — re-running the same video skips completed work.

# Invalidate specific stages and re-run them
video-analyzer run <url> --redo vlm                # redo frame descriptions only
video-analyzer run <url> --redo vlm,transcript     # redo two stages
video-analyzer run <url> --redo embeddings         # recompute embeddings

# Valid stages: frames, transcript, vlm, embeddings

# Bypass all caches
video-analyzer run <url> --no-cache

Health check

video-analyzer check
video-analyzer check --config ./my-config.yaml

Verifies installed packages, API keys, Ollama connectivity, and device (CPU/CUDA/MPS) for your current profile.


Configuration Reference

Config is loaded in this order (first match wins):

  1. --config <path> (explicit CLI flag)
  2. ./video-analyzer.yaml
  3. ./config.yaml
  4. ~/.config/video-analyzer/config.yaml
  5. Built-in simple defaults (with a hint to run setup)

Full example

# video-analyzer.yaml

profile: standard          # simple | standard | full

goal: "extract key decisions and action items"

transcript:
  providers:               # tried in order; first success wins
    - youtube_captions     # instant for YouTube content
    - whisper              # local fallback, works on any audio
  whisper_model: base      # tiny, base, small, medium, large-v3

summarizer:
  backend: anthropic       # anthropic | openai_compat | claude_code
  model: claude-sonnet-4-6
  vision_mode: auto        # auto | text | vision
  chunk_size: 10           # frames per map-reduce chunk
  max_context_fraction: 0.8

vlm:                       # standard / full profiles only
  enabled: true
  backend: openai_compat   # openai_compat | anthropic | transformers
  model: qwen2.5-vl:7b
  base_url: http://localhost:11434/v1
  concurrency: 4

embeddings:                # full profile only
  image_encoder: siglip2   # siglip2 | clip
  text_encoder: bge-m3     # bge-m3 | sentence-transformers

frames:
  extractor: pixel_diff    # pixel_diff | pyscenedetect | fixed_interval
  pixel_diff_threshold: 0.05
  min_interval_secs: 5.0
  dedup:
    phash: true
    phash_threshold: 5

chapters:
  breakpoint_threshold: 0.85
  min_chapter_duration: 30.0

output:
  format: markdown         # markdown | html | json
  path: ./output
  images: files            # files | inline (html only)

storage:                   # full profile only
  enabled: true
  backend: lancedb

Key settings

profile is the master switch. It controls which pipeline stages run regardless of other flags:

  • simple — transcript → summary. VLM and embeddings never run.
  • standard — adds VLM frame descriptions and content-based chapter detection.
  • full — adds embeddings and a LanceDB store for semantic search.

vision_mode controls how frames reach the summarizer:

  • auto — if the LLM supports vision (Anthropic, Claude Code), send frames directly and skip the VLM stage. Otherwise use VLM text descriptions.
  • text — always use VLM text descriptions, never send images to the summarizer.
  • vision — always send frames directly to the summarizer, skip VLM entirely.

auto is the recommended setting. With claude_code or anthropic as your summarizer, auto detects that the LLM supports vision and sends key frames directly — no Ollama, no separate VLM step needed. This means the simple profile with a vision-capable summarizer gets the same frame-level visual understanding as standard, with none of the extra setup.

transcript.providers — tried in order, first success wins. youtube_captions is instant for YouTube; whisper transcribes locally from audio and works on any file.


Output Formats

Markdown (default)

./output/<video-title>/
├── report.md
└── frames/
    ├── ch01_Introduction/
    │   └── frame_00000_0m05s.jpg
    └── ch02_Main_Content/
        └── frame_00042_3m12s.jpg

HTML

video-analyzer run <url> --format html

Frame images saved to frames/ alongside the HTML, or inlined as base64 for a single portable file:

output:
  format: html
  images: inline

JSON

Full machine-readable export — metadata, chapters, per-frame data (embeddings included), summaries.

video-analyzer run <url> --format json

Semantic Search

The full profile writes a LanceDB vector store you can query across videos.

video-analyzer run <url> --profile full

video-analyzer query ./output/My_Video "what did they say about pricing?"
video-analyzer query ./output/My_Video "show me the architecture diagram" --top-k 3

video-analyzer query <path> <question> \
  --top-k 5 \       # number of results
  --context 2       # expand each result with ±N adjacent frames

query accepts either the output directory or its store/ subdirectory.


Advanced Usage

Claude Code backend (no API key needed)

pip install video-analyzer
video-analyzer setup    # choose: simple → claude_code

Uses your existing claude session for summarization. No ANTHROPIC_API_KEY needed — auth is handled by Claude Code.

Because Claude supports vision, vision_mode: auto (the default) detects this and sends key frames directly to Claude alongside the transcript. You get the same frame-level visual understanding as the standard profile without any extra setup — no Ollama, no VLM download, no additional config.

Generated config:

profile: simple
summarizer:
  backend: claude_code
  vision_mode: auto    # Claude Code supports vision → frames sent directly, VLM skipped

This is the recommended starting point if you already have Claude Code installed.

Fully local pipeline

No internet required after the initial model download:

pip install "video-analyzer[standard]"
video-analyzer run <url> --local

Sets VLM to HuggingFace transformers (Qwen2.5-VL-3B) and summarizer to Ollama. Requires ollama serve with a text model pulled (ollama pull qwen2.5:7b).

Cost estimation

video-analyzer run <url> --preflight
=== Preflight Report ===
Video:          My Conference Talk
Duration:       3600s (60.0 min)
Frames:         284 (after dedup)
Chapters:       8
Map chunks:     29

Transcription:  youtube_captions ✓
Transcript tok: ~24,500

LLM:            anthropic / claude-sonnet-4-6
Vision mode:    text (VLM descriptions → summarizer)
Est input tok:  ~38,200
Est output tok: ~8,600
Est LLM cost:   ~$0.244
========================

Custom prompts

Override the built-in Jinja2 summarization prompts:

prompts/
├── map_extract.j2       # per-chunk extraction
├── chapter_reduce.j2    # per-chapter summary
├── chapter_name.j2      # chapter title generation
└── final_reduce.j2      # final overall summary
from video_analyzer.summarizer.map_reduce import map_reduce

chapter_summaries, final_summary = map_reduce(
    chapters=chapters,
    llm=llm,
    metadata=metadata,
    goal="extract all code examples",
    user_prompts_dir=Path("./prompts"),
)

Development

git clone https://github.com/anthropics/video-analyzer
cd video-analyzer
pip install -e ".[full,dev]"
pytest tests/

Project structure

video_analyzer/
├── cli.py                  ← Typer CLI (run, setup, query, check)
├── setup_wizard.py         ← Interactive setup wizard
├── pipeline.py             ← Top-level orchestration
├── config.py               ← Pydantic config tree + profile logic
├── models.py               ← Shared dataclasses
├── ingestion/              ← Video download + loading
├── extraction/             ← Frame extraction + deduplication
├── transcript/             ← Whisper + YouTube captions
├── alignment/              ← Transcript→frame alignment, chapter detection
├── vlm/                    ← Visual frame description (VLM backends)
├── embeddings/             ← Image (SigLIP2/CLIP) + text (BGE-M3) encoders
├── summarizer/             ← Map-reduce summarization (Anthropic/OpenAI/Claude Code)
├── store/                  ← LanceDB vector store + retrieval
└── output/                 ← Markdown, HTML, JSON writers

Adding a new LLM backend

  1. Subclass video_analyzer.summarizer.base.LLM
  2. Implement complete(prompt) and optionally complete_with_images(prompt, images)
  3. Set supports_vision = True if vision is supported
  4. Register in build_llm() in summarizer/base.py

Adding a new VLM backend

  1. Subclass video_analyzer.vlm.base.VLM
  2. Implement describe_batch(images)
  3. Register in build_vlm() in vlm/base.py

About

AI tool that analyzes videos by breaking it down to key frames and various models to extract information for summary and QnA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors