-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Objective
Add a git-native storage backend for eval run artifacts, inspired by entireio/cli. Eval results (metadata, transcripts, scores) are committed to a configurable branch and remote, making runs self-contained, versionable, and independent of cloud storage.
Design
Storage model
Each eval run produces one commit containing:
<runId[:2]>/<runId[2:]>/
metadata.json # run config, model, timestamps, scores, source commit/repo
transcript.jsonl # full session transcript (redacted)
summary.json # condensed stats for indexing
Sharding by first two hex chars of run ID (up to 256 buckets) prevents directory bloat.
Branch creation logic
Branch exists?
YES → commit to it (orphan, regular, main — doesn't matter)
NO → is it the repo's default branch?
YES → error (refuse to create main/master)
NO → create as orphan, commit to it
Once a branch exists, all operations are identical regardless of how it was created.
Configuration
# .agentv/config.yaml
artifacts:
backend: git # "git" | "local" (default: local)
git:
remote: agentv-evals # git remote name or URL
branch: agentv/checkpoints/v1 # branch name (default)
path: .agentv/runs # optional subdirectory prefix (useful on shared branches like main)- Default backend:
local(current behavior, no change) gitbackend: commits to configured branch, pushes to configured remote- Remote can be the same repo (
origin) or a separate repo — user's choice path: when committing to a shared branch likemain, scopes artifacts under a subdirectory to avoid polluting the root
Example configurations
Dedicated eval repo (recommended)
artifacts:
backend: git
git:
remote: git@github.com:org/agentv-evals.git
branch: agentv/checkpoints/v1Same repo, orphan branch
artifacts:
backend: git
git:
remote: origin
branch: agentv/checkpoints/v1Same repo, main branch (mixed human + machine artifacts)
artifacts:
backend: git
git:
remote: origin
branch: main
path: .agentv/runsWrite flow
- Eval run completes → runner has result payload
- Git storage backend:
- Branch exists → fetch latest
- Branch doesn't exist and isn't default branch → create as orphan
- Branch doesn't exist and is default branch → error
- Build tree object with sharded path (under
pathprefix if configured) - Commit with message
Run: <runId>and trailers (AgentV-Eval,AgentV-Model,Source-Commit) - Push to configured remote
- On conflict (concurrent runs): fetch, rebase, retry (append-only so always fast-forward compatible)
Read flow
agentv results list→git log <branch> --onelineagentv results show <runId>→git show <branch>:<path>/<shard>/<id>/metadata.json- Dashboard / web UI reads from the git remote directly
Cross-repo linking
Each metadata.json includes:
{
"sourceRepo": "org/repo",
"sourceCommit": "abc123def",
"evalFile": "evals/my-eval.yaml",
"runId": "a3b2c4d5e6f78901",
"model": "claude-sonnet-4-6",
"scores": { ... },
"timestamp": "2026-03-25T12:00:00Z"
}This solves the multi-repo eval problem — runs from different codebases all land in one eval results repo with provenance.
Why git-native
- No cloud dependency — works offline, self-hosted, air-gapped
- Familiar tooling —
git log,git show,git difffor querying results - Access control — inherits git remote permissions
- Auditability — immutable append-only history
- CI-friendly — runners just need git push access to the eval repo
- Separation of concerns — eval data scales independently of source code
Why separate repo (recommended default)
- Source repo stays lean (eval transcripts are large, append-only)
- Different retention policies (prune old runs without touching code)
- Scoped CI permissions (eval runners don't need code repo write access)
- Natural home for cross-repo evals
Using main on the same repo is fully supported for teams that prefer a single repo with human-editable artifacts alongside automated results.
Implementation plan
Phase 1: Git storage backend
- Add
artifacts.gitconfig schema to config loader - Implement
GitArtifactStoreclass withwrite(runResult)andlist()/get(runId)methods - Branch creation logic: exists → use it, new + non-default → orphan, new + default → error
- Sharded path builder:
runId→<id[:2]>/<id[2:]>/ - Commit with trailers, push to remote
Phase 2: CLI integration
- Wire
GitArtifactStoreinto eval runner via backend config agentv results list— read from git branchagentv results show <runId>— read metadata/transcript from git branch
Phase 3: Concurrency & robustness
- Fetch-rebase-retry loop for concurrent pushes
- Graceful handling of missing remote, auth failures, network errors (fall back to local with warning)
Phase 4: Dashboard integration
- Dashboard reads results from git remote (extends feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563)
Prior art
- entireio/cli — two-tier model with shadow branches + orphan checkpoint branch,
checkpoint_remotefor separate repo support - Git notes — similar concept but limited to annotating existing commits
Acceptance signals
-
artifacts.backend: gitconfig option is respected - Branch creation follows the exists/orphan/error logic
- Eval results written to sharded paths on the branch
-
pathprefix respected when configured (for shared branches) - Push to configured remote after each run
-
agentv results list/showreads from the git branch - Concurrent runs don't corrupt the branch
- Existing
localbackend unchanged (default)
Non-goals
- Shadow branches / mid-run checkpointing (entireio's Tier 1) — not needed since we write after run completion
- Git hooks integration — eval runs are triggered by CLI, not git commit
- Transcript deduplication across runs — git's object dedup handles this naturally
Related
- feat(eval): session recording and deterministic replay for offline evaluation #333 — session recording and replay
- feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563 — self-hosted dashboard (could read from this branch)
- docs: document post-processing patterns for eval results (markdown summary, CI integration) #700 — post-processing patterns
Metadata
Metadata
Assignees
Labels
Type
Projects
Status