Skip to content

feat: align repeat config with attempt artifacts#1608

Merged
christso merged 3 commits into
mainfrom
feat/av-d64j-repeat-config
Jul 2, 2026
Merged

feat: align repeat config with attempt artifacts#1608
christso merged 3 commits into
mainfrom
feat/av-d64j-repeat-config

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Repeat authoring now follows the Promptfoo-style shape while AgentV keeps its richer produced-attempt artifacts. YAML authors can put evaluate_options.repeat: 3 or a richer repeat object in evaluate_options.repeat, and individual cases can override the repeat policy with tests[].options.repeat.

This also finishes the vocabulary split: repeat is configuration, attempts[] is produced execution metadata, and per-execution sidecars live in attempt-N/ directories. Result readers keep compatibility fallbacks for older trials[] / run_path manifests while new writers emit the canonical attempt shape.

Area Outcome
YAML/schema Accepts numeric and object evaluate_options.repeat; rejects removed top-level repeat with migration guidance
Per-case overrides Applies tests[].options.repeat over the global repeat object/count
Artifacts Emits attempts[], attempt_path, attempt-N/, and total_attempts/passed_attempts summaries
Readers CLI results and Dashboard prefer attempts while still reading legacy trials and run_path
Docs/examples Updates public docs, examples, verification notes, and eval-writer schema guidance

Validation

  • bun run build
  • bun run lint
  • bun run typecheck
  • bun run test
  • bun run validate:examples
  • Focused parser/schema/artifact/dashboard/sdk tests during development:
    • bun test packages/core/test/evaluation/experiment.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/serve.test.ts apps/dashboard/src/lib/result-table.test.ts packages/sdk/test/eval-authoring.test.ts
    • bun test apps/cli/test/eval.integration.test.ts apps/cli/test/commands/prepare/prepare.test.ts

Evidence

  • Live dogfood run through the local OpenAI-compatible endpoint: http://127.0.0.1:10531/v1
  • Live model/target and grader model: gpt-5.3-codex-spark
  • Evidence branch: EntityProcess/agentv-private:evidence/av-d64j-repeat-config
  • Evidence commit: d42dceb
  • Evidence contents: source eval/targets, canonical run bundle, artifact tree, and contract-check.json showing:
    • global-repeat-shorthand emitted 2 attempts with attempt-1, attempt-2
    • per-case-repeat-object emitted 3 attempts with attempt-1, attempt-2, attempt-3
    • manifest rows use attempts[] and do not emit legacy top-level trials

Code Review

Simplify/code-review pass completed before PR. No actionable residual findings.

Post-Deploy Monitoring & Validation

No production service deployment is required for this package/schema/docs change. After publishing or merging, validate by watching:

  • CI checks for schema sync, example validation, package build, CLI tests, and Dashboard tests.
  • Any docs/example validation failures mentioning removed top-level repeat.
  • Any consumer reports where Dashboard or agentv results serve/validate/export cannot read older manifests with trials[] or run_path.

Healthy signal: new eval files with evaluate_options.repeat validate, new repeat runs write attempts[] and attempt-N/, and older run bundles remain readable. Rollback trigger: CI or dogfood shows newly written bundles missing attempt sidecars or existing legacy bundles becoming unreadable.

Related: Bead av-d64j.


Compound Engineering
GPT-5_Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 35456a0
Status: ✅  Deploy successful!
Preview URL: https://68d7783a.agentv.pages.dev
Branch Preview URL: https://feat-av-d64j-repeat-config.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review verdict: changes requested.

Findings:

  • P2: Remaining user-facing repeat-attempt vocabulary still says runs/trials. This PR establishes repeat as authored configuration and attempts / attempt-N as produced executions, but the Dashboard repeat UI still renders Run success, Passed runs, runs passed, and runs at apps/dashboard/src/components/EvalDetail.tsx:621, apps/dashboard/src/components/EvalDetail.tsx:623, apps/dashboard/src/components/ResultTable.tsx:714, and apps/dashboard/src/components/ResultTable.tsx:721. The CLI also exposes the internal name in warnings, e.g. trials.count at packages/core/src/evaluation/orchestrator.ts:816 and packages/core/src/evaluation/orchestrator.ts:880. These should use attempts / repeat count / evaluate_options.repeat wording so new Dashboard and CLI output does not reintroduce the legacy public vocabulary.

Checks run: local diff and targeted rg inspection across schema, parser, artifact writer, CLI readers, Dashboard, docs/examples/skills; git diff --check origin/main...HEAD.

@christso christso force-pushed the feat/av-d64j-repeat-config branch from 3b1642c to 36e9680 Compare July 2, 2026 15:57
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review verdict: changes requested.

Findings:

  • P2: Remaining public repeat-attempt vocabulary still says run/trial in the attempt-facing surfaces. The previous Dashboard aggregate labels and CLI warnings are fixed, and CI is green for 36e9680, but the selected attempt detail still renders Run score in apps/dashboard/src/components/EvalDetail.tsx:732 and says This run does not include a transcript artifact from the selected-attempt transcript tab at apps/dashboard/src/components/EvalDetail.tsx:897. Public docs/API comments also still describe the new attempt artifact layout with old terms: apps/web/src/content/docs/docs/tools/results.mdx:130 says attempt details live under run-N/, and the exported core types still say Configuration for running multiple trials per eval case, Result of a single trial attempt, and run-N folders at packages/core/src/evaluation/types.ts:1075, packages/core/src/evaluation/types.ts:1086, and packages/core/src/evaluation/types.ts:1103. These should use attempts / attempt-N wording so the new public contract does not continue leaking the legacy vocabulary.

Checks run: git fetch origin --prune; git status --short --branch; gh pr view 1608 --json ...; fetched PR head into refs/review/pr-1608-head; inspected git diff origin/main...refs/review/pr-1608-head; targeted git grep/git show inspection across Dashboard, CLI warnings, docs, artifacts, tests, and exported type comments; git diff --check origin/main...refs/review/pr-1608-head. No tests/builds run per research-only review instructions and because the fresh PR CI is green.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the re-review vocabulary finding in 35456a0.\n\nChanges:\n- selected-attempt Dashboard detail now uses Attempt score and This attempt does not include a transcript artifact.\n- results docs now describe per-attempt artifacts under attempt-N/ instead of run-N/\n- exported core comments now describe repeated attempts / attempt-N folders while preserving compatibility type names\n\nLocal validation:\n- bun run --cwd apps/dashboard test\n- bun run --cwd apps/dashboard build\n- bun run --cwd packages/core lint\n- bun run --cwd packages/core typecheck\n- bun run --cwd apps/web build\n- bun run lint\n- git diff --check\n- targeted greps for reviewed stale strings

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Final re-review verdict: clean.

The prior vocabulary finding is resolved at head 35456a00e1b6be9c1b91d5a19851bece34a576ea. The selected-attempt Dashboard detail now says Attempt score and This attempt does not include a transcript artifact; the tools/results docs now use attempt-N/ for per-attempt artifacts; and the exported core comments now describe repeated attempts / attempt-N folders.

I did not find new blockers in the latest three-file fix. Fresh PR CI is green for this head: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded. The orchestrator may proceed to ready/merge.

Checks run: git fetch origin --prune; git status --short --branch; inspected origin/main...origin/feat/av-d64j-repeat-config; reviewed 36e96801..35456a00 for EvalDetail.tsx, tools/results.mdx, and packages/core/src/evaluation/types.ts; targeted git grep for the stale strings from the prior finding; git diff --check origin/main...origin/feat/av-d64j-repeat-config; gh pr view 1608 --json .... No local builds/tests/evals run per research-only instructions and because CI evidence was sufficient.

@christso christso marked this pull request as ready for review July 2, 2026 16:16
@christso christso merged commit 74b961c into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-d64j-repeat-config branch July 2, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant