Conversation
Deploying agentv with
|
| Latest commit: |
3b28d8b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://00112715.agentv.pages.dev |
| Branch Preview URL: | https://feat-987-988-989-990-991-948.agentv.pages.dev |
Verification Evidence#991 — Multi-project dashboard defaultRed (before): CLI computed mode but did not plumb it through Green (after): Studio with 3 registered projects → Projects dashboard (multi-project view). #988 — SWE-bench Docker eval pipelineRed (before): Green (after): Fixed to Score 0 expected — #990 — Studio comparison matrixAPI verified: Component: color coding (emerald ≥80%, amber 50-80%, red <50%), best/worst ▲/▼ indicators, expand/collapse per cell, error/empty/loading states all implemented. Note: browser screenshot unavailable on this ARM64 headless environment. API + code review substituted. #989 — Benchmark rerun results (committed to agentv-bench-skills)
Previous finding that with-superpowers performed worse was entirely a grader artifact. Verified: both experiments score at parity (~99-100%) post-fix. |
… default, assertion includes, SWE-bench importer fix Closes #987, #988, #948, #991 **#987 — base_commit in docker workspace** - Add `base_commit` field to docker workspace schema (eval-file.schema.ts, types.ts) - `docker-workspace.ts`: call `git reset --hard <base_commit>` before agent runs - New `repo-checkout.ts` for shared checkout logic - Tests: docker-workspace.test.ts, workspace-config-parsing.test.ts **#948 — Assertion include: templates** - evaluator-parser.ts: resolve `include:` references in assertion arrays - Tests: evaluator-parser.test.ts **#991 — Multi-project dashboard as default** - serve.ts: `resolveDashboardMode()` auto-detects multi-project from registered project count; `--multi` deprecated with warning; `--single` forces single-project - Mode plumbed through `/api/config` → Studio SPA reads `multiProjectDashboard` and switches views - Tests: serve.test.ts **#988 — SWE-bench importer fixes** - Fix `_docker_image_for_repo()` → `_docker_image_for_instance()`: use correct Epoch GHCR format `ghcr.io/epoch-research/swe-bench.eval.x86_64.<instance_id>:latest` - Use `conda run -n testbed` in generated code-grader commands (SWE-bench images use a `testbed` conda env) - E2E verified: import → Docker container → code-grader runs pytest in /testbed → scores reported Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- studio.mdx: document the Compare tab (experiment×target matrix, color coding, expand/collapse) - import.mdx: add HuggingFace/SWE-bench import workflow and Docker workspace details Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…on matrix Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
a0cc789 to
201428e
Compare
pngquant 80-95 applied: compare 71K->25K, projects-multi 94K->41K, runs-bench 125K->43K Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Triangles read as delta (change from baseline) — use ring highlight only for best/worst emphasis. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… only Previously cells mixed semantic background (amber/emerald/red) with a separate ring highlight for best/worst, creating inconsistent signals. Now: neutral bg-gray-800 for all cells, ring and text carry the semantic color. Also removes best/worst highlighting entirely — it read as delta indicators. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
UI: 'Projects' -> 'Benchmarks' heading, 'Add Project' -> 'Add Benchmark', '← All Projects' -> '← All Benchmarks', placeholder text updated. Docs and screenshots updated to match. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
E2E Verification
#987 — base_commit wiring:
#991 — multi-project dashboard default:
#988 — SWE-bench Docker eval pipeline:
#990 — Comparison matrix UI:
#989 — Benchmark rerun:
Test Plan
Closes #987
Closes #988
Closes #948
Closes #991
🤖 Generated with Claude Code