A two-stage pipeline for monitoring YouTube channels and producing AI-generated summaries.
Stage 1 — fetch (script/pipeline.py): discovers videos published within the configured lookback window, downloads metadata and transcripts, and writes data/pending.json.
Stage 2 — analyze (script/ai_pipeline.py): for each video in pending.json, invokes a headless Claude Code instance with the /video-analysis skill, which analyses the transcript (falling back to web research when unavailable) and writes a structured markdown summary to data/summaries/<video_id>.md. A PreToolUse hook validates the full required structure before each file is written; if validation fails the agent retries automatically.
- Install uv
- From the workspace root (
my_workflow/):uv sync
# Validate prerequisites
uv run python run.py validate
# Run the full pipeline (fetch → analyze)
uv run python run.py run
# Run a single stage
uv run python run.py run --stage fetch
uv run python run.py run --stage analyze
# Check status after a run
uv run python run.py status| Command | Description |
|---|---|
uv run python run.py run |
Run all stages in order |
uv run python run.py run --stage <name> |
Run a single stage (fetch or analyze) |
uv run python run.py status |
Print data/state.json (run history, stage results) |
uv run python run.py config |
Print resolved channels.yaml as JSON |
uv run python run.py info |
Print workflow metadata as JSON |
uv run python run.py validate |
Check that config and stage scripts exist |
Exit codes: 0 = success, 1 = partial failure, 2 = complete failure, 3 = config error, 4 = prerequisite missing.
.
├── pyproject.toml # Project dependencies
├── run.py # Standard CLI entry point
├── workflow.yaml # Workflow manifest (identity, stages, schedule)
├── channels.yaml # Channel configuration
├── config.schema.json # JSON Schema for channels.yaml
├── script/
│ ├── pipeline.py # Stage 1: deterministic fetch
│ ├── ai_pipeline.py # Stage 2: AI analysis
│ └── yt_transcript.py # Transcript utilities
├── data/
│ ├── state.json # Machine-readable run state (auto-created)
│ ├── pending.json # Videos awaiting analysis (stage 1 → stage 2)
│ ├── processed.json # Deduplication tracker
│ ├── summaries/ # One markdown summary per video
│ └── logs/ # Per-video Claude invocation logs
└── .claude/
├── skills/video-analysis/ # AI skill definition
└── hooks/validate_write.py # Pre-write validation hook
Edit channels.yaml to add or remove channels:
defaults:
languages: [en]
lookback_hours: 48
channels:
- name: "Fireship"
url: "https://www.youtube.com/@Fireship"
enabled: trueSet enabled: false to pause a channel without removing it. Per-channel overrides for languages and lookback_hours are supported.
The full schema is defined in config.schema.json.
channels.yaml— channels to monitordata/processed.json— deduplication tracker (updated after each video)
| Field | Type | Description |
|---|---|---|
schema_version |
int | Schema version (currently 1) |
pipeline_run_at |
str | ISO-8601 UTC timestamp of the pipeline run |
url |
str | Full YouTube watch URL |
video_id |
str | 11-character YouTube video ID |
channel |
str | Channel name from channels.yaml |
channel_id |
str | YouTube channel ID |
channel_url |
str | YouTube channel URL |
title |
str | Video title |
description |
str | Video description |
upload_date |
str | Upload date as YYYYMMDD |
duration_seconds |
int | Duration in seconds |
duration_string |
str | Duration as MM:SS or HH:MM:SS |
view_count |
int | View count at fetch time |
like_count |
int | Like count at fetch time |
transcript_status |
str | See below |
transcript_language |
str | BCP-47 language code |
transcript_is_generated |
bool | true if auto-generated |
transcript |
str | Full transcript text |
| Value | Meaning |
|---|---|
ok |
Transcript fetched successfully |
no_transcript |
No transcript in requested languages |
transcripts_disabled |
Transcripts disabled for this video |
video_unavailable |
Video is unavailable |
error |
Unexpected error during transcript fetch |
Videos are tracked by URL in data/processed.json. A video is only processed once; re-running fetch skips already-processed videos.
script/ai_pipeline.py
└── for each video in data/pending.json (parallel, up to 4 workers):
claude -p "/video-analysis\n<video fields>"
--tools WebSearch,WebFetch,Write
--model claude-sonnet-4-6
--max-turns 20
├── .claude/skills/video-analysis/SKILL.md (instructions + format)
├── PreToolUse(Write) → .claude/hooks/validate_write.py
└── writes data/summaries/<video_id>.md
Each summary contains seven required sections (in order):
| Section | Content |
|---|---|
# <Title> |
H1 matching the video title |
## Overview |
2-4 sentence summary of the video's main thesis |
## Key Concepts |
Core ideas, technologies, and acronyms defined |
## How It Works |
Mechanisms, architecture, or step-by-step process |
## Key Takeaways |
Bullet-point list of actionable insights |
## Use Cases |
Practical applications and scenarios |
## Further Reading |
Links and resources mentioned or relevant |
.claude/hooks/validate_write.py runs as a PreToolUse hook before every write to data/summaries/*.md. It blocks the write (exit 2) if:
- Content is empty or under 500 characters
- Document doesn't start with
#(H1) - Any of the 6 required
##sections is missing ## Key Conceptsis missing required###subsections (### Technologies,### Terms)## Key Takeawayshas no bullet points (-)
The agent sees the error list and retries automatically. Videos whose .md already exists are skipped, making re-runs safe.
workflow.yaml is a machine-readable manifest that describes this workflow for dashboards, cron, and AI agents. It declares the stages, config location, data layout, and suggested schedule — enabling any consumer to discover, validate, and run the pipeline without workflow-specific knowledge.
The manifest suggests 0 */6 * * * (every 6 hours). To set up with cron:
0 */6 * * * cd /path/to/yt_workflow && uv run python run.py run >> data/cron.log 2>&1Run state is tracked in data/state.json with timestamps, exit codes, and the last 50 run records.