AgenticVBench is a 100-task benchmark for evaluating AI agents on real-world video post-production workflows β Assembly, Repair, Sequencing, and Repurpose. Tasks are authored by 20 industry experts (avg. 6 years of professional experience) and scored on a 0β1 scale, mixing programmatic verifiers with rubric-based LLM judges.
Built on Harbor β agent installation, sandboxed execution, concurrency, and trial scoring are handled for you.
2026.06.01π Our paper is now public on arXiv β read the full description of the benchmark, task families, and verifier design.
| Family | Count | What the agent does |
|---|---|---|
agentic_vbench_repair |
18 | Restore a localized corruption (color shift, blur, low-res, swapped object, glitch, content cut, disfluency, or audio defect) in a clip. |
agentic_vbench_assembly |
18 | Pick 4 candidate clips from a pool and place them in the correct slot order to satisfy a prompt. |
agentic_vbench_sequencing |
28 | Re-order a set of shuffled clips into the correct narrative sequence. |
agentic_vbench_repurpose |
36 | Re-cut a long-form source video into a short vertical clip that satisfies a per-task creative brief. |
The repair, assembly, and sequencing families score with deterministic per-family judges and ship a bundled oracle solver + broken/random baseline β by construction the oracle scores 1.0 and the baseline scores 0.0. See docs/VERIFIER_DESIGN.md for the per-family scoring math. The repurpose family uses a rubric-based LLM-as-judge against a per-task creative brief (deterministic format checks + Gemini/Opus content judges; same 0β1 reward shape) β running it requires both GEMINI_API_KEY and ANTHROPIC_API_KEY on the verifier side.
Typical model scores live on the leaderboard β use it to gauge whether your run is in the expected range.
git clone https://github.com/PhiloLabs/agentic-vbench.git
cd agentic-vbench
./scripts/install-harbor.sh
python3 -m venv .venv && .venv/bin/pip install --upgrade pipPick any agent supported by Harbor (claude-code, codex, gemini-cli, opencode, β¦), export the matching API key, and run via the ./avb CLI (a thin wrapper that auto-injects per-agent + per-family env vars into harbor run):
# Claude Code (Anthropic):
export ANTHROPIC_API_KEY=...
./avb run exp-codec-restore-task01 -a claude-code -m anthropic/claude-sonnet-4-6
# Codex (OpenAI):
export OPENAI_API_KEY=...
./avb run exp-codec-restore-task01 -a codex -m openai/gpt-5.5For agentic_vbench_repurpose tasks the verifier additionally needs GEMINI_API_KEY (the rubric LLM judge uses Gemini for audio/video grading) β export it and avb will forward it via Harbor's --ve flag. Run ./avb tasks env <task> to see what credentials a given task and agent combo need.
Bringing your own agent. The four agents listed above are vendor-native Harbor agents that work against the task set out of the box. Custom agents β including open-source harnesses and proprietary stacks β plug into Harbor through a small adapter. See Harbor's agents docs for the adapter contract.
Free smoke test (no agent API spend). Every repair/assembly/sequencing task ships a bundled oracle solver. Use it to confirm the harness is wired up end-to-end:
./avb run exp-codec-restore-task01 -a oracle -e docker
# reward.json β β 1.0, ~30 s on a cached image, zero agent costTime + cost budget. A real-agent rollout on Modal typically takes ~10 min per task wall clock. Cost depends entirely on agent + model β order-of-magnitude $0.10β$2 per task with mid-tier models, scaling linearly with the agent's token use. Plan accordingly for a 100-task sweep.
Here's what a task prompt actually looks like (exp-codec-restore-task01):
I have a short mono speech recording at
/workspace/materials/noisy.wav. For a stretch in it, the audio sounds muffled β like the high end has been chopped off and the voice lost its sparkle. The rest of the recording sounds clean and full.Please restore the muffled stretch so it sounds as clear and full as the rest of the recording. Leave the already-clean parts unchanged.
/workspace/output/enhanced.wavβ 16-bit PCM mono at 16 kHz, same total length (sample count) as the input.
Each task ships its own such brief at tasks/<family>/<task>/steps/solve/instruction.md.
Inspect the result. Each trial drops four artifacts under jobs/<job-name>/<trial-id>/:
| File | What it is |
|---|---|
steps/solve/verifier/reward.json |
Final score + per-metric breakdown. |
agent/trajectory.json |
Full event stream Harbor captured for the agent (tool calls, tool results, model messages, final output). |
result.json |
Per-trial Harbor summary (timings, exit codes, exception info). |
trial.log |
Combined stdout/stderr stream for the whole trial. |
./avb results show # rewards from the latest job
cat jobs/<job-name>/*/steps/solve/verifier/reward.json
# {
# "reward": 0.55,
# "details": { "reason": "ok", ... }
# }While a trial is running, ./avb run prints tail -F jobs/<job>/*/trial.log β copy that to watch progress in another shell.
Run ./avb -h to see every subcommand (tasks list / check / env, run, rollout, results show).
export ANTHROPIC_API_KEY=...
export MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=...
./avb rollout --family repair --agent claude-code --env modal --max-parallel 20Same pattern for the other families. Per-task rewards land in logs/rollout-results.tsv; full per-trial artifacts (agent trajectory, verifier breakdown) under jobs/<job-name>/.
Once you've run all 100 tasks, zip the jobs/ directory (which contains the reward.json + trajectory.json + result.json per trial) and follow the submission flow at agenticvbench.com β that page has the email template + a Google Drive link prompt. Reviewers verify that every task in the suite has an intact trajectory and that scores fall in [0, 1], then publish to the leaderboard.
Any executor that Harbor supports works β pass -e <executor> to harbor run (or ./avb run). Common picks:
| Executor | Use when | Required env |
|---|---|---|
docker |
Local sanity checks, single-task debugging | none |
modal |
Large parallel runs across the suite | MODAL_TOKEN_ID, MODAL_TOKEN_SECRET |
daytona |
Cloud sandboxes (alternative to Modal) | DAYTONA_API_KEY |
e2b |
Sandbox-as-a-service for code execution | E2B_API_KEY |
runloop |
Long-running cloud workspaces | RUNLOOP_API_KEY |
Plus apple_container, gke, singularity, tensorlake, and anything else Harbor adds β see Harbor's --env choices in harbor run -h for the live list.
All task materials are hosted on Hugging Face under ameddserM/agentic_vbench_video_* and baked into each task's Docker image at build time, so the same image runs on any executor without provider-specific configuration.
agentic-vbench/
βββ tasks/ # 100 Harbor task directories
β βββ agentic_vbench_repair/ # 18 repair tasks
β βββ agentic_vbench_assembly/ # 18 assembly tasks
β βββ agentic_vbench_sequencing/ # 28 sequencing tasks
β βββ agentic_vbench_repurpose/ # 36 repurpose tasks
βββ scripts/
β βββ install-harbor.sh # Harbor CLI pin
β βββ parallel_rollout.py # batched rollout + reward collection
β βββ monitor_job.py # tail a running trial
β βββ _task_paths.py # task-name β path resolver
βββ docs/VERIFIER_DESIGN.md # per-family scoring math
βββ README.md, LICENSE, AGENTS.md
