Skip to content

feat: immutable run bundle / reproducibility artifact #1133

@christso

Description

@christso

Objective

Add an agentv bundle command that compiles an EVAL.yaml and all its referenced assets into a self-contained, portable directory that can be run without access to the original repo.

Problem

An EVAL.yaml references external assets scattered across the filesystem:

  • tests: ./cases.jsonl — external test data
  • workspace: ./template/ — workspace template directory
  • use_target: default — delegation chains in targets.yaml
  • ${{ ENV_VAR }} — environment variable interpolation
  • command: ./scripts/verify.sh — code-grader scripts
  • hooks.before_each.command: ./setup.sh — setup/teardown scripts

Reproducing a run requires the entire repo at the correct commit, the right env vars, and knowledge of which CLI flags were used. Sharing results with a colleague means "clone my repo, checkout commit abc123, set these env vars, run this command."

Design

agentv bundle is a compiler step that resolves and inlines all external references into a self-contained output directory.

agentv bundle my-eval.yaml --output ./bundled-eval/

What it does

  1. Inlines test data — resolves tests: ./cases.jsonl into inline test cases
  2. Copies workspace assets — copies workspace template directory contents into the bundle
  3. Flattens target config — resolves use_target chains into concrete target definitions
  4. Copies scripts — copies code-grader scripts, setup/teardown hooks, prompt files
  5. Records provenance — AgentV version, git commit (if in a repo), timestamp, CLI flags
  6. Optionally resolves env vars--capture-env flag to snapshot resolved ${{ VAR }} values (redacted by default for secrets)

Output structure

bundled-eval/
  eval.yaml              # fully resolved, all references inlined or relative to bundle
  workspace/             # copied workspace template
  scripts/               # copied grader and hook scripts
  data/                  # inlined test data (if external)
  manifest.json          # provenance: agentv version, git commit, timestamp, source paths

Usage

# Bundle an eval
agentv bundle evals/my-benchmark.eval.yaml -o benchmark-v1/

# Run from a bundle (just a normal eval run — the bundle IS an eval directory)
agentv eval benchmark-v1/eval.yaml

# Share it
zip -r benchmark-v1.zip benchmark-v1/

# Inspect provenance
cat benchmark-v1/manifest.json

Key design decisions

  1. The bundle is just a directory with a resolved EVAL.yaml — no new format, no special runtime. agentv eval runs it like any other eval.
  2. Opt-in, not required — normal agentv eval continues to work with repo-based eval files. Bundling is for sharing and reproducibility.
  3. No secrets by default — env var values are NOT captured unless --capture-env is explicitly passed. Manifest records env var names only.
  4. Idempotent — bundling an already-bundled directory is a no-op.

Non-goals

  • Not a Docker image or container
  • Not a replacement for EVAL.yaml as the authoring format
  • Not required for normal eval workflows

Acceptance signals

  • agentv bundle produces a self-contained directory from an EVAL.yaml
  • The bundled directory can be run with agentv eval on a different machine without the original repo
  • manifest.json records provenance (agentv version, git commit, timestamp)
  • External test data, workspace templates, and scripts are all resolved into the bundle
  • Target delegation chains are flattened
  • Env var values are redacted by default

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions