Skip to content

EDEN757/YT_Workflow

Repository files navigation

yt-workflow

A two-stage pipeline for monitoring YouTube channels and producing AI-generated summaries.

Stage 1 — fetch (script/pipeline.py): discovers videos published within the configured lookback window, downloads metadata and transcripts, and writes data/pending.json.

Stage 2 — analyze (script/ai_pipeline.py): for each video in pending.json, invokes a headless Claude Code instance with the /video-analysis skill, which analyses the transcript (falling back to web research when unavailable) and writes a structured markdown summary to data/summaries/<video_id>.md. A PreToolUse hook validates the full required structure before each file is written; if validation fails the agent retries automatically.

Requirements

  1. Install uv
  2. From the workspace root (my_workflow/):
    uv sync

Quick Start

# Validate prerequisites
uv run python run.py validate

# Run the full pipeline (fetch → analyze)
uv run python run.py run

# Run a single stage
uv run python run.py run --stage fetch
uv run python run.py run --stage analyze

# Check status after a run
uv run python run.py status

CLI Reference

Command Description
uv run python run.py run Run all stages in order
uv run python run.py run --stage <name> Run a single stage (fetch or analyze)
uv run python run.py status Print data/state.json (run history, stage results)
uv run python run.py config Print resolved channels.yaml as JSON
uv run python run.py info Print workflow metadata as JSON
uv run python run.py validate Check that config and stage scripts exist

Exit codes: 0 = success, 1 = partial failure, 2 = complete failure, 3 = config error, 4 = prerequisite missing.

Project Structure

.
├── pyproject.toml          # Project dependencies
├── run.py                  # Standard CLI entry point
├── workflow.yaml           # Workflow manifest (identity, stages, schedule)
├── channels.yaml           # Channel configuration
├── config.schema.json      # JSON Schema for channels.yaml
├── script/
│   ├── pipeline.py         # Stage 1: deterministic fetch
│   ├── ai_pipeline.py      # Stage 2: AI analysis
│   └── yt_transcript.py    # Transcript utilities
├── data/
│   ├── state.json          # Machine-readable run state (auto-created)
│   ├── pending.json        # Videos awaiting analysis (stage 1 → stage 2)
│   ├── processed.json      # Deduplication tracker
│   ├── summaries/          # One markdown summary per video
│   └── logs/               # Per-video Claude invocation logs
└── .claude/
    ├── skills/video-analysis/  # AI skill definition
    └── hooks/validate_write.py # Pre-write validation hook

Configuration

Edit channels.yaml to add or remove channels:

defaults:
  languages: [en]
  lookback_hours: 48

channels:
  - name: "Fireship"
    url: "https://www.youtube.com/@Fireship"
    enabled: true

Set enabled: false to pause a channel without removing it. Per-channel overrides for languages and lookback_hours are supported.

The full schema is defined in config.schema.json.

Stage 1 — Fetch

Inputs

  • channels.yaml — channels to monitor
  • data/processed.json — deduplication tracker (updated after each video)

Output: data/pending.json

Field Type Description
schema_version int Schema version (currently 1)
pipeline_run_at str ISO-8601 UTC timestamp of the pipeline run
url str Full YouTube watch URL
video_id str 11-character YouTube video ID
channel str Channel name from channels.yaml
channel_id str YouTube channel ID
channel_url str YouTube channel URL
title str Video title
description str Video description
upload_date str Upload date as YYYYMMDD
duration_seconds int Duration in seconds
duration_string str Duration as MM:SS or HH:MM:SS
view_count int View count at fetch time
like_count int Like count at fetch time
transcript_status str See below
transcript_language str BCP-47 language code
transcript_is_generated bool true if auto-generated
transcript str Full transcript text

Transcript Status Values

Value Meaning
ok Transcript fetched successfully
no_transcript No transcript in requested languages
transcripts_disabled Transcripts disabled for this video
video_unavailable Video is unavailable
error Unexpected error during transcript fetch

Deduplication

Videos are tracked by URL in data/processed.json. A video is only processed once; re-running fetch skips already-processed videos.

Stage 2 — AI Analysis

Architecture

script/ai_pipeline.py
  └── for each video in data/pending.json (parallel, up to 4 workers):
        claude -p "/video-analysis\n<video fields>"
               --tools WebSearch,WebFetch,Write
               --model claude-sonnet-4-6
               --max-turns 20
              ├── .claude/skills/video-analysis/SKILL.md  (instructions + format)
              ├── PreToolUse(Write) → .claude/hooks/validate_write.py
              └── writes data/summaries/<video_id>.md

Summary Format

Each summary contains seven required sections (in order):

Section Content
# <Title> H1 matching the video title
## Overview 2-4 sentence summary of the video's main thesis
## Key Concepts Core ideas, technologies, and acronyms defined
## How It Works Mechanisms, architecture, or step-by-step process
## Key Takeaways Bullet-point list of actionable insights
## Use Cases Practical applications and scenarios
## Further Reading Links and resources mentioned or relevant

Validation Hook

.claude/hooks/validate_write.py runs as a PreToolUse hook before every write to data/summaries/*.md. It blocks the write (exit 2) if:

  • Content is empty or under 500 characters
  • Document doesn't start with # (H1)
  • Any of the 6 required ## sections is missing
  • ## Key Concepts is missing required ### subsections (### Technologies, ### Terms)
  • ## Key Takeaways has no bullet points (- )

The agent sees the error list and retries automatically. Videos whose .md already exists are skipped, making re-runs safe.

Workflow Manifest

workflow.yaml is a machine-readable manifest that describes this workflow for dashboards, cron, and AI agents. It declares the stages, config location, data layout, and suggested schedule — enabling any consumer to discover, validate, and run the pipeline without workflow-specific knowledge.

Scheduling

The manifest suggests 0 */6 * * * (every 6 hours). To set up with cron:

0 */6 * * * cd /path/to/yt_workflow && uv run python run.py run >> data/cron.log 2>&1

Run state is tracked in data/state.json with timestamps, exit codes, and the last 50 run records.

About

Automated YouTube channel monitor that fetches new videos and generates AI-powered summaries using Claude.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages