yt-workflow

A two-stage pipeline for monitoring YouTube channels and producing AI-generated summaries.

Stage 1 — fetch (script/pipeline.py): discovers videos published within the configured lookback window, downloads metadata and transcripts, and writes data/pending.json.

Stage 2 — analyze (script/ai_pipeline.py): for each video in pending.json, invokes a headless Claude Code instance with the /video-analysis skill, which analyses the transcript (falling back to web research when unavailable) and writes a structured markdown summary to data/summaries/<video_id>.md. A PreToolUse hook validates the full required structure before each file is written; if validation fails the agent retries automatically.

Requirements

Install uv
From the workspace root (my_workflow/):
```
uv sync
```

Quick Start

# Validate prerequisites
uv run python run.py validate

# Run the full pipeline (fetch → analyze)
uv run python run.py run

# Run a single stage
uv run python run.py run --stage fetch
uv run python run.py run --stage analyze

# Check status after a run
uv run python run.py status

CLI Reference

Command	Description
`uv run python run.py run`	Run all stages in order
`uv run python run.py run --stage <name>`	Run a single stage (`fetch` or `analyze`)
`uv run python run.py status`	Print `data/state.json` (run history, stage results)
`uv run python run.py config`	Print resolved `channels.yaml` as JSON
`uv run python run.py info`	Print workflow metadata as JSON
`uv run python run.py validate`	Check that config and stage scripts exist

Exit codes: 0 = success, 1 = partial failure, 2 = complete failure, 3 = config error, 4 = prerequisite missing.

Project Structure

.
├── pyproject.toml          # Project dependencies
├── run.py                  # Standard CLI entry point
├── workflow.yaml           # Workflow manifest (identity, stages, schedule)
├── channels.yaml           # Channel configuration
├── config.schema.json      # JSON Schema for channels.yaml
├── script/
│   ├── pipeline.py         # Stage 1: deterministic fetch
│   ├── ai_pipeline.py      # Stage 2: AI analysis
│   └── yt_transcript.py    # Transcript utilities
├── data/
│   ├── state.json          # Machine-readable run state (auto-created)
│   ├── pending.json        # Videos awaiting analysis (stage 1 → stage 2)
│   ├── processed.json      # Deduplication tracker
│   ├── summaries/          # One markdown summary per video
│   └── logs/               # Per-video Claude invocation logs
└── .claude/
    ├── skills/video-analysis/  # AI skill definition
    └── hooks/validate_write.py # Pre-write validation hook

Configuration

Edit channels.yaml to add or remove channels:

defaults:
  languages: [en]
  lookback_hours: 48

channels:
  - name: "Fireship"
    url: "https://www.youtube.com/@Fireship"
    enabled: true

Set enabled: false to pause a channel without removing it. Per-channel overrides for languages and lookback_hours are supported.

The full schema is defined in config.schema.json.

Stage 1 — Fetch

Inputs

channels.yaml — channels to monitor
data/processed.json — deduplication tracker (updated after each video)

Output: `data/pending.json`

Field	Type	Description
`schema_version`	int	Schema version (currently 1)
`pipeline_run_at`	str	ISO-8601 UTC timestamp of the pipeline run
`url`	str	Full YouTube watch URL
`video_id`	str	11-character YouTube video ID
`channel`	str	Channel name from `channels.yaml`
`channel_id`	str	YouTube channel ID
`channel_url`	str	YouTube channel URL
`title`	str	Video title
`description`	str	Video description
`upload_date`	str	Upload date as `YYYYMMDD`
`duration_seconds`	int	Duration in seconds
`duration_string`	str	Duration as `MM:SS` or `HH:MM:SS`
`view_count`	int	View count at fetch time
`like_count`	int	Like count at fetch time
`transcript_status`	str	See below
`transcript_language`	str	BCP-47 language code
`transcript_is_generated`	bool	`true` if auto-generated
`transcript`	str	Full transcript text

Transcript Status Values

Value	Meaning
`ok`	Transcript fetched successfully
`no_transcript`	No transcript in requested languages
`transcripts_disabled`	Transcripts disabled for this video
`video_unavailable`	Video is unavailable
`error`	Unexpected error during transcript fetch

Deduplication

Videos are tracked by URL in data/processed.json. A video is only processed once; re-running fetch skips already-processed videos.

Stage 2 — AI Analysis

Architecture

script/ai_pipeline.py
  └── for each video in data/pending.json (parallel, up to 4 workers):
        claude -p "/video-analysis\n<video fields>"
               --tools WebSearch,WebFetch,Write
               --model claude-sonnet-4-6
               --max-turns 20
              ├── .claude/skills/video-analysis/SKILL.md  (instructions + format)
              ├── PreToolUse(Write) → .claude/hooks/validate_write.py
              └── writes data/summaries/<video_id>.md

Summary Format

Each summary contains seven required sections (in order):

Section	Content
`# <Title>`	H1 matching the video title
`## Overview`	2-4 sentence summary of the video's main thesis
`## Key Concepts`	Core ideas, technologies, and acronyms defined
`## How It Works`	Mechanisms, architecture, or step-by-step process
`## Key Takeaways`	Bullet-point list of actionable insights
`## Use Cases`	Practical applications and scenarios
`## Further Reading`	Links and resources mentioned or relevant

Validation Hook

.claude/hooks/validate_write.py runs as a PreToolUse hook before every write to data/summaries/*.md. It blocks the write (exit 2) if:

Content is empty or under 500 characters
Document doesn't start with # (H1)
Any of the 6 required ## sections is missing
## Key Concepts is missing required ### subsections (### Technologies, ### Terms)
## Key Takeaways has no bullet points (- )

The agent sees the error list and retries automatically. Videos whose .md already exists are skipped, making re-runs safe.

Workflow Manifest

workflow.yaml is a machine-readable manifest that describes this workflow for dashboards, cron, and AI agents. It declares the stages, config location, data layout, and suggested schedule — enabling any consumer to discover, validate, and run the pipeline without workflow-specific knowledge.

Scheduling

The manifest suggests 0 */6 * * * (every 6 hours). To set up with cron:

0 */6 * * * cd /path/to/yt_workflow && uv run python run.py run >> data/cron.log 2>&1

Run state is tracked in data/state.json with timestamps, exit codes, and the last 50 run records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yt-workflow

Requirements

Quick Start

CLI Reference

Project Structure

Configuration

Stage 1 — Fetch

Inputs

Output: `data/pending.json`

Transcript Status Values

Deduplication

Stage 2 — AI Analysis

Architecture

Summary Format

Validation Hook

Workflow Manifest

Scheduling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
script		script
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
channels.yaml		channels.yaml
config.schema.json		config.schema.json
pyproject.toml		pyproject.toml
run.py		run.py
workflow.yaml		workflow.yaml

Folders and files

Latest commit

History

Repository files navigation

yt-workflow

Requirements

Quick Start

CLI Reference

Project Structure

Configuration

Stage 1 — Fetch

Inputs

Output: data/pending.json

Transcript Status Values

Deduplication

Stage 2 — AI Analysis

Architecture

Summary Format

Validation Hook

Workflow Manifest

Scheduling

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Output: `data/pending.json`

Packages