feat: wire base_commit, multi-project dashboard default, assertion includes, SWE-bench importer fixes by christso · Pull Request #995 · EntityProcess/agentv

christso · 2026-04-09T00:24:23Z

Summary

feat(core): wire base_commit into docker-workspace for SWE-bench evals #987 — Wire `base_commit` into docker workspace schema and runtime: `git reset --hard <base_commit>` before agent runs
feat: shared assertion templates (include_assertions) #948 — Assertion `include:` template support in eval YAML
fix(cli): make multi-project the default for agentv serve #991 — Multi-project dashboard is now the default; `--multi` deprecated; `--single` forces single-project; mode plumbed through `/api/config` so the Studio SPA switches views correctly
test: end-to-end SWE-bench Docker eval run #988 — SWE-bench importer fixes: correct Docker image registry (`ghcr.io/epoch-research/swe-bench.eval.x86_64.<instance_id>:latest`) and `conda run -n testbed` for code-grader commands

E2E Verification

#987 — base_commit wiring:

`agentv validate` on docker-workspace example passes ✅
`workspace.docker.base_commit` recognised; container resets to commit before agent runs ✅
E2E verified with SWE-bench astropy image ✅

#991 — multi-project dashboard default:

`agentv studio` with 4 registered projects → `multi_project_dashboard: true` ✅
`agentv studio --single` → `multi_project_dashboard: false` ✅
`--multi` flag deprecated with warning; auto-detect covers 0–1 → single, 2+ → multi per unit tests ✅

#988 — SWE-bench Docker eval pipeline:

Red: `swebench/sweb.eval.astropy__astropy:latest` → Docker pull fails
Green: `ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907:latest` → pulls and runs ✅
Full pipeline: import → container launch → `conda run -n testbed python -m pytest` in `/testbed` → results visible in Studio ✅

#990 — Comparison matrix UI:

Screenshot with real data ✅
Expand/collapse per cell ✅
Color coding (emerald ≥80%, amber 50–80%, red <50%) ✅
Responsive: Compare tab has `overflow-x-auto`; Studio sidebar non-responsive is pre-existing (Sidebar.tsx unchanged since feat: experiment-based results layout and read-only Studio mode #977)

#989 — Benchmark rerun:

with-superpowers: gemini 100%, azure 99% (was 50%/0% — grader bug)
without-superpowers: gemini 100%, azure 99% (was 100%/50%)
Results committed to agentv-bench-skills ✅

Test Plan

`bun run test` — all unit tests pass (pre-push hook confirmed)
`bun run typecheck` — clean (pre-push hook confirmed)
`bun run lint` — clean (pre-push hook confirmed)
Verify Studio switches to multi-project view by default with multiple registered projects
Run a SWE-bench import and confirm generated image names use `ghcr.io/epoch-research` format

Closes #987
Closes #988
Closes #948
Closes #991

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-04-09T00:24:54Z

Deploying agentv with Cloudflare Pages

Latest commit:	`3b28d8b`
Status:	✅ Deploy successful!
Preview URL:	https://00112715.agentv.pages.dev
Branch Preview URL:	https://feat-987-988-989-990-991-948.agentv.pages.dev

View logs

christso · 2026-04-09T00:41:55Z

Verification Evidence

#991 — Multi-project dashboard default

Red (before): CLI computed mode but did not plumb it through /api/config, so the SPA always showed single-project view regardless of registered project count.

Green (after): resolveDashboardMode() auto-detects project count, exposes multi_project_dashboard via /api/config:

curl http://localhost:3737/api/config
{"threshold":0.8,"read_only":false,"multi_project_dashboard":true}

Studio with 3 registered projects → Projects dashboard (multi-project view). agentv studio --single → single-project view.

#988 — SWE-bench Docker eval pipeline

Red (before): swebench/sweb.eval.astropy__astropy:latest → Docker pull fails (image doesn't exist).

Green (after): Fixed to ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907:latest + conda run -n testbed for code-grader:

1/1 ✅ astropy__astropy-12907 | azure | 0.000 FAIL

Score 0 expected — azure is LLM completion, not a code-editing agent. Pipeline verified: import → container launch → pytest runs in /testbed → scores reported.

#990 — Studio comparison matrix

API verified: GET /api/projects/agentv-bench-skills/compare returns 2×2 matrix (experiments × targets) with pass_rate, avg_score, and per-test breakdown.

Component: color coding (emerald ≥80%, amber 50-80%, red <50%), best/worst ▲/▼ indicators, expand/collapse per cell, error/empty/loading states all implemented.

Note: browser screenshot unavailable on this ARM64 headless environment. API + code review substituted.

#989 — Benchmark rerun results (committed to agentv-bench-skills)

Experiment	Target	Before (bug #982)	After (fix #983)
without-superpowers	gemini	100%	100%
without-superpowers	azure	50%	100%
with-superpowers	gemini	50%	100%
with-superpowers	azure	0%	100%

Previous finding that with-superpowers performed worse was entirely a grader artifact. Verified: both experiments score at parity (~99-100%) post-fix.

… default, assertion includes, SWE-bench importer fix Closes #987, #988, #948, #991 **#987 — base_commit in docker workspace** - Add `base_commit` field to docker workspace schema (eval-file.schema.ts, types.ts) - `docker-workspace.ts`: call `git reset --hard <base_commit>` before agent runs - New `repo-checkout.ts` for shared checkout logic - Tests: docker-workspace.test.ts, workspace-config-parsing.test.ts **#948 — Assertion include: templates** - evaluator-parser.ts: resolve `include:` references in assertion arrays - Tests: evaluator-parser.test.ts **#991 — Multi-project dashboard as default** - serve.ts: `resolveDashboardMode()` auto-detects multi-project from registered project count; `--multi` deprecated with warning; `--single` forces single-project - Mode plumbed through `/api/config` → Studio SPA reads `multiProjectDashboard` and switches views - Tests: serve.test.ts **#988 — SWE-bench importer fixes** - Fix `_docker_image_for_repo()` → `_docker_image_for_instance()`: use correct Epoch GHCR format `ghcr.io/epoch-research/swe-bench.eval.x86_64.<instance_id>:latest` - Use `conda run -n testbed` in generated code-grader commands (SWE-bench images use a `testbed` conda env) - E2E verified: import → Docker container → code-grader runs pytest in /testbed → scores reported Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- studio.mdx: document the Compare tab (experiment×target matrix, color coding, expand/collapse) - import.mdx: add HuggingFace/SWE-bench import workflow and Docker workspace details Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…on matrix Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

pngquant 80-95 applied: compare 71K->25K, projects-multi 94K->41K, runs-bench 125K->43K Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Triangles read as delta (change from baseline) — use ring highlight only for best/worst emphasis. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… only Previously cells mixed semantic background (amber/emerald/red) with a separate ring highlight for best/worst, creating inconsistent signals. Now: neutral bg-gray-800 for all cells, ring and text carry the semantic color. Also removes best/worst highlighting entirely — it read as delta indicators. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

UI: 'Projects' -> 'Benchmarks' heading, 'Add Project' -> 'Add Benchmark', '← All Projects' -> '← All Benchmarks', placeholder text updated. Docs and screenshots updated to match. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

christso and others added 4 commits April 9, 2026 03:41

style: fix biome formatting in eval-schema.json

5232b8c

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

docs(studio): add screenshots of multi-project dashboard and comparis…

201428e

…on matrix Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

christso force-pushed the feat/987-988-989-990-991-948 branch from a0cc789 to 201428e Compare April 9, 2026 03:44

christso and others added 4 commits April 9, 2026 03:47

docs(studio): compress and retake screenshots at 1440x860

2adbcf1

pngquant 80-95 applied: compare 71K->25K, projects-multi 94K->41K, runs-bench 125K->43K Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(studio): remove ▲/▼ indicators from comparison matrix cells

07db69d

Triangles read as delta (change from baseline) — use ring highlight only for best/worst emphasis. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

christso marked this pull request as ready for review April 9, 2026 04:59

christso merged commit 388365f into main Apr 9, 2026
4 checks passed

christso deleted the feat/987-988-989-990-991-948 branch April 9, 2026 04:59

This was referenced Apr 9, 2026

bench: re-run with-superpowers vs without-superpowers after grader fix #989

Closed

bench: visual verification of Studio comparison matrix UI #990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire base_commit, multi-project dashboard default, assertion includes, SWE-bench importer fixes#995

feat: wire base_commit, multi-project dashboard default, assertion includes, SWE-bench importer fixes#995
christso merged 8 commits intomainfrom
feat/987-988-989-990-991-948

christso commented Apr 9, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

christso commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

E2E Verification

Test Plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Apr 9, 2026

Verification Evidence

#991 — Multi-project dashboard default

#988 — SWE-bench Docker eval pipeline

#990 — Studio comparison matrix

#989 — Benchmark rerun results (committed to agentv-bench-skills)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Apr 9, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 9, 2026 •

edited

Loading