███████╗ ██████╗ ██████╗ █████╗ ██╗ ██╗ ██████╗ ██████╗
██╔════╝██╔═══██╗██╔══██╗██╔══██╗ ██║ ██║██╔════╝ ██╔════╝
███████╗██║ ██║██████╔╝███████║ ██║ ██║██║ ███╗██║
╚════██║██║ ██║██╔══██╗██╔══██║ ██║ ██║██║ ██║██║
███████║╚██████╔╝██║ ██║██║ ██║ ╚██████╔╝╚██████╔╝╚██████╗
╚══════╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═════╝
AI-generated UGC videos that look like a real person pulled out their phone and started talking.
Brands pay $500–$5,000 per UGC video from human creators. This skill produces them for $5–$50 in API costs.
Most AI video looks like AI video. Cinematic drone shots. Perfect lighting. Orchestral energy. Nobody scrolls past that thinking "real person" — they think "ad" and keep moving.
UGC works because it looks like someone pulled out their phone and talked.
This skill makes AI produce that.
Not by adding filters after. By building the entire pipeline around anti-polish: persona research that steals real customer language, first frames that look like iPhone photos, motion prompts that produce handheld energy, audio that sounds like a kitchen not a studio, and post-production that adds grain instead of removing it.
Brand + Audience
↓
┌─────────────────────────────┐
│ 1. Persona research │ ← mines real reviews for exact language
│ 2. Creator profile │ ← persistent AI "creator" with locked identity
│ 3. Format + script │ ← 6 formats with shot-by-shot enforcement
│ 4. First frame (Nano Banana)│ ← iPhone-realistic, not AI-looking
│ 5. A-roll (Sora 2) │ ← talking head with synced voice + lip sync
│ 6. B-roll (Kling 3) │ ← fast contextual scenes, env-matched
│ 7. Audio orchestration │ ← native voice, continuous over B-roll
│ 8. Post-production │ ← color grade, grain, frame rate
│ 9. Captions │ ← native platform-style overlays
│ 10. Virality scoring │ ← 7-criteria gate, only 70+ publishes
└─────────────────────────────┘
↓
Scroll-stopping UGC video
| Format | What it is | Duration |
|---|---|---|
| Talking Head | One person, one camera, honest review | 15-25s |
| Hook Face + Demo | Emotive face stops scroll → product demo | 15s max |
| Podcast Clip | Fake podcast guest — mic, headphones, authority | 8-20s |
| Wall of Text | Animated person + dense text overlay | 4-8s |
| Visual Transformation | Named concept ("The Scroll Trap") + before/after | 10-25s |
| Hybrid Transformation | Talking head bookends + slideshow mechanism bridge | 20-30s |
| Key | Required | What it does |
|---|---|---|
FAL_KEY |
Yes | Sora 2 (talking head video) + Kling 3 (B-roll) via fal.ai |
REPLICATE_API_TOKEN |
Yes | Nano Banana (first frame image generation) |
OPENROUTER_API_KEY |
Recommended | Gemini via OpenRouter (virality scoring + analysis) |
ELEVENLABS_API_KEY |
Optional | Only for multi-clip voice consistency (S2S) |
Get keys: fal.ai · Replicate · OpenRouter · ElevenLabs
- Python 3 with PIL (
sudo apt install python3-pil) - ffmpeg with libfreetype (
sudo apt install ffmpeg— NOT the Homebrew version) - Inter font (auto-downloaded by the dependency checker)
bash scripts/check-deps.sh- Run the dependency check to make sure everything's configured
- Tell the skill what brand and who the audience is — it handles persona research
- Pick a format — the skill recommends one based on your goal
- Approve the script — it writes one mapped to the format's shot breakdown
- Generate the first frame — review it before committing to video
- Generate video + B-roll — Sora for talking head, Kling for B-roll scenes
- Post-production + captions — automated color grade, grain, caption overlay
- Virality score — only publish if it scores 70+
- Sora's native voice is always better than ElevenLabs TTS for talking head. TTS sounds fake. Sora does voice + lip sync together.
- B-roll must be environment-matched. Extract a frame from the A-roll → feed to Kling. Generic B-roll looks like stock footage.
- Captions go LAST — after post-production. Grain degrades caption pills.
- AI cannot generate realistic UI/app screens. Use real screenshots for demos.
- Describe audio by how it sounds, not the gear. "Clean, natural, close and present" works. "Shure SM7B" doesn't.
- ~1 in 3 Sora generations have hand artifacts. Reroll, don't fix the prompt.
- Multi-frame formats: chain from frame 1. Generate the canonical face first, then reference it for every subsequent frame. Parallel generation causes face drift.
- Baseline comparison:
evals/baseline-vs-scrollclaw.md - Execution checks:
evals/execution-evals.md - Trigger checks:
evals/trigger-evals.md
scrollclaw/
├── SKILL.md Router — orchestrates the full suite
├── README.md You are here
├── _system/ Shared doctrine + context protocol
│ ├── SKILL.md Core doctrine, format selection, pipeline routing
│ └── references/
│ ├── brand-campaign-context.md Brand memory + campaign workspace protocol
│ ├── color-reference-system.md
│ ├── creator-system.md
│ ├── format-library.md
│ ├── hook-emotions.md
│ └── taste-calibration.md
├── persona/ Step 1: Persona research + scripting
│ ├── SKILL.md
│ └── references/
│ ├── persona-research.md
│ └── script-voice.md
├── first-frame/ Step 2: Canonical frame generation
│ ├── SKILL.md
│ └── references/
│ ├── first-frame-prompting.md
│ └── first-frame-psychology.md
├── animate/ Step 3: A-roll (Sora 2)
│ ├── SKILL.md
│ └── references/
│ ├── motion-prompting.md
│ └── sora-api.md
├── b-roll/ Step 4: B-roll (Kling 3)
│ ├── SKILL.md
│ └── references/
│ ├── kling-api.md
│ └── orchestrator.md
├── assemble/ Step 5: Stitch, post, captions, or full-assemble
│ ├── SKILL.md
│ └── references/
│ ├── audio-orchestration.md
│ ├── green-zone.md
│ ├── orchestrator.md
│ ├── post-production.md
│ └── voice-system.md
├── score/ Step 6: Virality scoring gate
│ ├── SKILL.md
│ └── references/
│ └── virality-scoring.md
├── scripts/ 10 automation scripts
├── evals/ Baseline, trigger, and execution benchmarks
└── assets/ Campaign brief template
ScrollClaw persists work across sessions so campaign 10 takes a fraction of campaign 1.
workspace/
├── brand/ ← Read-only for ScrollClaw (GrowthClaw or manual)
│ ├── voice-profile.md ← Informs script tone
│ ├── positioning.md ← Informs persona research direction
│ └── audience.md ← Anchors creator archetype selection
├── creators/ ← Global creator profiles (reuse across campaigns)
└── campaigns/<slug>/
├── brief.md ← Campaign brief
├── persona-research.md ← Extracted customer language
├── creators/ ← Campaign-specific creator overrides
├── scripts/ ← Approved scripts
├── frames/ ← First frames + context frames
├── clips/ ← A-roll, B-roll, assembled finals
├── scores/ ← Virality score cards
├── output-log.md ← All prompt params (append-only)
└── learnings.md ← What worked, what didn't (append-only)
Full context protocol: _system/references/brand-campaign-context.md
Built by Matt Berman · Emerald Digital · Big Players Newsletter
Full documentation: _system/SKILL.md