Skip to content

PhiloLabs/agentic-vbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

agentic-vbench

Website Paper Leaderboard

AgenticVBench: four task families β€” Assembly, Repair, Sequencing, Repurpose

AgenticVBench is a 100-task benchmark for evaluating AI agents on real-world video post-production workflows β€” Assembly, Repair, Sequencing, and Repurpose. Tasks are authored by 20 industry experts (avg. 6 years of professional experience) and scored on a 0–1 scale, mixing programmatic verifiers with rubric-based LLM judges.

Built on Harbor β€” agent installation, sandboxed execution, concurrency, and trial scoring are handled for you.


πŸ”₯ News

  • 2026.06.01 πŸ“– Our paper is now public on arXiv β€” read the full description of the benchmark, task families, and verifier design.

πŸ“Š What's in the suite

Family Count What the agent does
agentic_vbench_repair 18 Restore a localized corruption (color shift, blur, low-res, swapped object, glitch, content cut, disfluency, or audio defect) in a clip.
agentic_vbench_assembly 18 Pick 4 candidate clips from a pool and place them in the correct slot order to satisfy a prompt.
agentic_vbench_sequencing 28 Re-order a set of shuffled clips into the correct narrative sequence.
agentic_vbench_repurpose 36 Re-cut a long-form source video into a short vertical clip that satisfies a per-task creative brief.

The repair, assembly, and sequencing families score with deterministic per-family judges and ship a bundled oracle solver + broken/random baseline β€” by construction the oracle scores 1.0 and the baseline scores 0.0. See docs/VERIFIER_DESIGN.md for the per-family scoring math. The repurpose family uses a rubric-based LLM-as-judge against a per-task creative brief (deterministic format checks + Gemini/Opus content judges; same 0–1 reward shape) β€” running it requires both GEMINI_API_KEY and ANTHROPIC_API_KEY on the verifier side.

Typical model scores live on the leaderboard β€” use it to gauge whether your run is in the expected range.


πŸš€ Quick start

1. Install

git clone https://github.com/PhiloLabs/agentic-vbench.git
cd agentic-vbench
./scripts/install-harbor.sh
python3 -m venv .venv && .venv/bin/pip install --upgrade pip

2. Run one task with an agent

Pick any agent supported by Harbor (claude-code, codex, gemini-cli, opencode, …), export the matching API key, and run via the ./avb CLI (a thin wrapper that auto-injects per-agent + per-family env vars into harbor run):

# Claude Code (Anthropic):
export ANTHROPIC_API_KEY=...
./avb run exp-codec-restore-task01 -a claude-code -m anthropic/claude-sonnet-4-6

# Codex (OpenAI):
export OPENAI_API_KEY=...
./avb run exp-codec-restore-task01 -a codex -m openai/gpt-5.5

For agentic_vbench_repurpose tasks the verifier additionally needs GEMINI_API_KEY (the rubric LLM judge uses Gemini for audio/video grading) β€” export it and avb will forward it via Harbor's --ve flag. Run ./avb tasks env <task> to see what credentials a given task and agent combo need.

Bringing your own agent. The four agents listed above are vendor-native Harbor agents that work against the task set out of the box. Custom agents β€” including open-source harnesses and proprietary stacks β€” plug into Harbor through a small adapter. See Harbor's agents docs for the adapter contract.

Free smoke test (no agent API spend). Every repair/assembly/sequencing task ships a bundled oracle solver. Use it to confirm the harness is wired up end-to-end:

./avb run exp-codec-restore-task01 -a oracle -e docker
# reward.json β†’ β‰ˆ 1.0, ~30 s on a cached image, zero agent cost

Time + cost budget. A real-agent rollout on Modal typically takes ~10 min per task wall clock. Cost depends entirely on agent + model β€” order-of-magnitude $0.10–$2 per task with mid-tier models, scaling linearly with the agent's token use. Plan accordingly for a 100-task sweep.

Here's what a task prompt actually looks like (exp-codec-restore-task01):

Restore A Muffled Stretch Of Audio

I have a short mono speech recording at /workspace/materials/noisy.wav. For a stretch in it, the audio sounds muffled β€” like the high end has been chopped off and the voice lost its sparkle. The rest of the recording sounds clean and full.

Please restore the muffled stretch so it sounds as clear and full as the rest of the recording. Leave the already-clean parts unchanged.

What to deliver

  • /workspace/output/enhanced.wav β€” 16-bit PCM mono at 16 kHz, same total length (sample count) as the input.

Each task ships its own such brief at tasks/<family>/<task>/steps/solve/instruction.md.

Inspect the result. Each trial drops four artifacts under jobs/<job-name>/<trial-id>/:

File What it is
steps/solve/verifier/reward.json Final score + per-metric breakdown.
agent/trajectory.json Full event stream Harbor captured for the agent (tool calls, tool results, model messages, final output).
result.json Per-trial Harbor summary (timings, exit codes, exception info).
trial.log Combined stdout/stderr stream for the whole trial.
./avb results show          # rewards from the latest job
cat jobs/<job-name>/*/steps/solve/verifier/reward.json
# {
#   "reward": 0.55,
#   "details": { "reason": "ok", ... }
# }

While a trial is running, ./avb run prints tail -F jobs/<job>/*/trial.log β€” copy that to watch progress in another shell.

Run ./avb -h to see every subcommand (tasks list / check / env, run, rollout, results show).

3. Run a full family in parallel

export ANTHROPIC_API_KEY=...
export MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=...

./avb rollout --family repair --agent claude-code --env modal --max-parallel 20

Same pattern for the other families. Per-task rewards land in logs/rollout-results.tsv; full per-trial artifacts (agent trajectory, verifier breakdown) under jobs/<job-name>/.


πŸ† Submitting to the leaderboard

Once you've run all 100 tasks, zip the jobs/ directory (which contains the reward.json + trajectory.json + result.json per trial) and follow the submission flow at agenticvbench.com β€” that page has the email template + a Google Drive link prompt. Reviewers verify that every task in the suite has an intact trajectory and that scores fall in [0, 1], then publish to the leaderboard.


βš™οΈ Supported executors

Any executor that Harbor supports works β€” pass -e <executor> to harbor run (or ./avb run). Common picks:

Executor Use when Required env
docker Local sanity checks, single-task debugging none
modal Large parallel runs across the suite MODAL_TOKEN_ID, MODAL_TOKEN_SECRET
daytona Cloud sandboxes (alternative to Modal) DAYTONA_API_KEY
e2b Sandbox-as-a-service for code execution E2B_API_KEY
runloop Long-running cloud workspaces RUNLOOP_API_KEY

Plus apple_container, gke, singularity, tensorlake, and anything else Harbor adds β€” see Harbor's --env choices in harbor run -h for the live list.

All task materials are hosted on Hugging Face under ameddserM/agentic_vbench_video_* and baked into each task's Docker image at build time, so the same image runs on any executor without provider-specific configuration.


πŸ“ Repo layout

agentic-vbench/
β”œβ”€β”€ tasks/                              # 100 Harbor task directories
β”‚   β”œβ”€β”€ agentic_vbench_repair/          # 18 repair tasks
β”‚   β”œβ”€β”€ agentic_vbench_assembly/        # 18 assembly tasks
β”‚   β”œβ”€β”€ agentic_vbench_sequencing/      # 28 sequencing tasks
β”‚   └── agentic_vbench_repurpose/       # 36 repurpose tasks
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ install-harbor.sh               # Harbor CLI pin
β”‚   β”œβ”€β”€ parallel_rollout.py             # batched rollout + reward collection
β”‚   β”œβ”€β”€ monitor_job.py                  # tail a running trial
β”‚   └── _task_paths.py                  # task-name β†’ path resolver
β”œβ”€β”€ docs/VERIFIER_DESIGN.md             # per-family scoring math
└── README.md, LICENSE, AGENTS.md

About

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors