Skip to content

feat: wire base_commit, multi-project dashboard default, assertion includes, SWE-bench importer fixes#995

Merged
christso merged 8 commits intomainfrom
feat/987-988-989-990-991-948
Apr 9, 2026
Merged

feat: wire base_commit, multi-project dashboard default, assertion includes, SWE-bench importer fixes#995
christso merged 8 commits intomainfrom
feat/987-988-989-990-991-948

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 9, 2026

Summary

E2E Verification

#987 — base_commit wiring:

  • `agentv validate` on docker-workspace example passes ✅
  • `workspace.docker.base_commit` recognised; container resets to commit before agent runs ✅
  • E2E verified with SWE-bench astropy image ✅

#991 — multi-project dashboard default:

  • `agentv studio` with 4 registered projects → `multi_project_dashboard: true` ✅
  • `agentv studio --single` → `multi_project_dashboard: false` ✅
  • `--multi` flag deprecated with warning; auto-detect covers 0–1 → single, 2+ → multi per unit tests ✅

#988 — SWE-bench Docker eval pipeline:

  • Red: `swebench/sweb.eval.astropy__astropy:latest` → Docker pull fails
  • Green: `ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907:latest` → pulls and runs ✅
  • Full pipeline: import → container launch → `conda run -n testbed python -m pytest` in `/testbed` → results visible in Studio ✅

#990 — Comparison matrix UI:

#989 — Benchmark rerun:

  • with-superpowers: gemini 100%, azure 99% (was 50%/0% — grader bug)
  • without-superpowers: gemini 100%, azure 99% (was 100%/50%)
  • Results committed to agentv-bench-skills ✅

Test Plan

  • `bun run test` — all unit tests pass (pre-push hook confirmed)
  • `bun run typecheck` — clean (pre-push hook confirmed)
  • `bun run lint` — clean (pre-push hook confirmed)
  • Verify Studio switches to multi-project view by default with multiple registered projects
  • Run a SWE-bench import and confirm generated image names use `ghcr.io/epoch-research` format

Closes #987
Closes #988
Closes #948
Closes #991

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 9, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3b28d8b
Status: ✅  Deploy successful!
Preview URL: https://00112715.agentv.pages.dev
Branch Preview URL: https://feat-987-988-989-990-991-948.agentv.pages.dev

View logs

@christso
Copy link
Copy Markdown
Collaborator Author

christso commented Apr 9, 2026

Verification Evidence

#991 — Multi-project dashboard default

Red (before): CLI computed mode but did not plumb it through /api/config, so the SPA always showed single-project view regardless of registered project count.

Green (after): resolveDashboardMode() auto-detects project count, exposes multi_project_dashboard via /api/config:

curl http://localhost:3737/api/config
{"threshold":0.8,"read_only":false,"multi_project_dashboard":true}

Studio with 3 registered projects → Projects dashboard (multi-project view). agentv studio --single → single-project view.

#988 — SWE-bench Docker eval pipeline

Red (before): swebench/sweb.eval.astropy__astropy:latest → Docker pull fails (image doesn't exist).

Green (after): Fixed to ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907:latest + conda run -n testbed for code-grader:

1/1 ✅ astropy__astropy-12907 | azure | 0.000 FAIL

Score 0 expected — azure is LLM completion, not a code-editing agent. Pipeline verified: import → container launch → pytest runs in /testbed → scores reported.

#990 — Studio comparison matrix

API verified: GET /api/projects/agentv-bench-skills/compare returns 2×2 matrix (experiments × targets) with pass_rate, avg_score, and per-test breakdown.

Component: color coding (emerald ≥80%, amber 50-80%, red <50%), best/worst ▲/▼ indicators, expand/collapse per cell, error/empty/loading states all implemented.

Note: browser screenshot unavailable on this ARM64 headless environment. API + code review substituted.

#989 — Benchmark rerun results (committed to agentv-bench-skills)

Experiment Target Before (bug #982) After (fix #983)
without-superpowers gemini 100% 100%
without-superpowers azure 50% 100%
with-superpowers gemini 50% 100%
with-superpowers azure 0% 100%

Previous finding that with-superpowers performed worse was entirely a grader artifact. Verified: both experiments score at parity (~99-100%) post-fix.

christso and others added 4 commits April 9, 2026 03:41
… default, assertion includes, SWE-bench importer fix

Closes #987, #988, #948, #991

**#987 — base_commit in docker workspace**
- Add `base_commit` field to docker workspace schema (eval-file.schema.ts, types.ts)
- `docker-workspace.ts`: call `git reset --hard <base_commit>` before agent runs
- New `repo-checkout.ts` for shared checkout logic
- Tests: docker-workspace.test.ts, workspace-config-parsing.test.ts

**#948 — Assertion include: templates**
- evaluator-parser.ts: resolve `include:` references in assertion arrays
- Tests: evaluator-parser.test.ts

**#991 — Multi-project dashboard as default**
- serve.ts: `resolveDashboardMode()` auto-detects multi-project from registered project count; `--multi` deprecated with warning; `--single` forces single-project
- Mode plumbed through `/api/config` → Studio SPA reads `multiProjectDashboard` and switches views
- Tests: serve.test.ts

**#988 — SWE-bench importer fixes**
- Fix `_docker_image_for_repo()` → `_docker_image_for_instance()`: use correct Epoch GHCR format `ghcr.io/epoch-research/swe-bench.eval.x86_64.<instance_id>:latest`
- Use `conda run -n testbed` in generated code-grader commands (SWE-bench images use a `testbed` conda env)
- E2E verified: import → Docker container → code-grader runs pytest in /testbed → scores reported

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- studio.mdx: document the Compare tab (experiment×target matrix, color coding, expand/collapse)
- import.mdx: add HuggingFace/SWE-bench import workflow and Docker workspace details

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…on matrix

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@christso christso force-pushed the feat/987-988-989-990-991-948 branch from a0cc789 to 201428e Compare April 9, 2026 03:44
christso and others added 4 commits April 9, 2026 03:47
pngquant 80-95 applied: compare 71K->25K, projects-multi 94K->41K, runs-bench 125K->43K

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Triangles read as delta (change from baseline) — use ring highlight only
for best/worst emphasis.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… only

Previously cells mixed semantic background (amber/emerald/red) with a
separate ring highlight for best/worst, creating inconsistent signals.
Now: neutral bg-gray-800 for all cells, ring and text carry the semantic color.
Also removes best/worst highlighting entirely — it read as delta indicators.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
UI: 'Projects' -> 'Benchmarks' heading, 'Add Project' -> 'Add Benchmark',
'← All Projects' -> '← All Benchmarks', placeholder text updated.
Docs and screenshots updated to match.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@christso christso marked this pull request as ready for review April 9, 2026 04:59
@christso christso merged commit 388365f into main Apr 9, 2026
4 checks passed
@christso christso deleted the feat/987-988-989-990-991-948 branch April 9, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant