Skip to content

feat(eval): normalize transcript artifacts#1600

Merged
christso merged 2 commits into
mainfrom
feat/av-kfik-8-transcripts
Jul 2, 2026
Merged

feat(eval): normalize transcript artifacts#1600
christso merged 2 commits into
mainfrom
feat/av-kfik-8-transcripts

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • write normalized transcript.json documents while preserving raw provider evidence as transcript-raw.jsonl
  • add canonical transcript tool_name normalization and inline transcript_summary into result rows/artifacts
  • update CLI/Dashboard readers, docs, and tests for the new transcript contract with legacy JSONL fallback where needed

Verification

  • bun run lint
  • bun run build
  • bun run test
  • bun run validate:examples
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts packages/core/test/evaluation/orchestrator.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx
  • bun test apps/cli/test/commands/results/export.test.ts apps/cli/test/commands/results/serve.test.ts

Dogfood

  • Live local OpenAI-compatible run against http://127.0.0.1:10531/v1, model gpt-5.4-mini, passed 1/1.
  • Local run bundle: .agentv/results/2026-07-02T10-00-04-331Z
  • Private evidence: agentv-private branch evidence/av-kfik8-transcripts-2026-07-02, commit 6c17be5.

Bead: av-kfik.8

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1f602d0
Status: ✅  Deploy successful!
Preview URL: https://19494009.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-8-transcripts.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Independent review for Bead av-kfik.8.1.

Finding:

  • P2: transcript_summary is not projected for repeat-run result rows. The Bead requires the precomputed summary to be inlined into each result row so consumers can inspect trajectory/metrics without reparsing transcript artifacts. The writer only attaches transcript_summary when a single-run transcriptPath exists (buildIndexArtifactEntry, packages/core/src/evaluation/run-artifacts.ts:1790) or when isSingleRun && hasTranscript is true (buildResultIndexArtifact, packages/core/src/evaluation/run-artifacts.ts:1879). For repeated runs, isSingleRun is false, so the root index.jsonl row has trials and per-attempt transcript paths but no transcript_summary; the Dashboard repeat read model also reconstructs attempt paths without adding transcript_summary (apps/cli/src/commands/results/serve.ts:1153). This leaves repeated eval consumers parsing run-N/transcript.json despite the new row-level contract. Please either project per-attempt summaries into trials[] / an aggregate row summary, or explicitly narrow the contract/docs if repeat rows are intentionally excluded.

Other checks:

  • Raw provider evidence remains transcript-raw.jsonl; normalized transcript is transcript.json.
  • Canonical tool_name enum is routed through provider-aware normalization.
  • I did not find ordinary sidecars being added to artifact_pointers.
  • Local targeted tests could not run in this review worktree because dependencies/build links were missing (react/jsx-dev-runtime, micromatch, @agentv/core), but PR checks are passing on GitHub.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the repeat-run transcript summary blocker in commit 1f602d0 (fix(eval): project repeat transcript summaries).

What changed:

  • index.jsonl trial entries now include trials[].transcript_summary when the trial has a persisted transcript-bearing result.
  • Dashboard/API repeat trial read models preserve trials[].transcript_summary from the manifest and hydrate it from run-N/result.json for older rows that do not inline it.
  • Added focused writer and serve/API regressions for repeat trial projection.

Validation:

  • bun --filter @agentv/core build passed
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed (66/66)
  • bun test apps/cli/test/commands/results/serve.test.ts passed (120/120)
  • bun --filter @agentv/core typecheck passed
  • bun --filter agentv typecheck passed
  • bunx biome check packages/core/src/evaluation/run-artifacts.ts apps/cli/src/commands/results/serve.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/serve.test.ts passed
  • bun --filter agentv build passed
  • git diff --check passed

Live dogfood:

  • Attempted OpenAI direct repeat-run dogfood; blocked because the local OPENAI_API_KEY is a dummy key and returned HTTP 401.
  • Attempted GitHub Models live repeat-run dogfood with openai/gpt-4o-mini and openai/gpt-4.1-mini; both reached the live provider path, but the LLM grader failed with malformed JSON parse errors.
  • The resulting repeat bundle at /tmp/agentv-kfik8-dogfood/run still validated the artifact projection under review: root index.jsonl had 2 trials, and both trial entries had transcript_summary matching each run-N/result.json.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI is green for commit 1f602d05, but I am not marking the PR ready or merging it.

Blocker: the repo verification gate asks for live dogfood on repeat-run/artifact contract changes. I attempted that gate:

  • Direct OpenAI target: blocked by local OPENAI_API_KEY returning HTTP 401 (dummy key).
  • GitHub Models target with openai/gpt-4o-mini and openai/gpt-4.1-mini: live provider calls ran, but the LLM grader failed repeatedly with malformed JSON parse errors.

Useful evidence from the failed live run: /tmp/agentv-kfik8-dogfood/run/index.jsonl had 2 repeat trials, and both trials[] entries included transcript_summary matching the corresponding run-N/result.json.

Leaving PR #1600 as draft/unmerged until a clean live grader dogfood run is available or this gate is explicitly waived.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Live dogfood gate is now unblocked and passing for commit 1f602d05.

Configuration used:

  • Endpoint: http://127.0.0.1:10531/v1
  • Model: gpt-5.3-codex-spark
  • Target provider: openai, api_format: chat
  • No real API key was used. I verified the endpoint accepts chat completions without auth. AgentV target validation requires an api_key value for this provider, so I set DOGFOOD_LOCAL_API_KEY=unused-local-placeholder only for client initialization.

Dogfood command:
DOGFOOD_LOCAL_API_KEY=unused-local-placeholder DOGFOOD_LOCAL_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik8-local-dogfood/repeat-transcript.eval.yaml --targets /tmp/agentv-kfik8-local-dogfood/targets.yaml --target local-openai-agent --workers 1

Result:

  • Passed: 1/1
  • Score: 100%
  • Run bundle: .agentv/results/2026-07-02T14-31-30-578Z

Artifact contract evidence:

  • Root index.jsonl row: test_id=repeat-transcript-summary-local, execution_status=ok, score=1.
  • trials.length === 2.
  • Both trials[] entries include transcript_summary.
  • Both trial summaries match the corresponding run-N/result.json transcript_summary.
  • Attempt sidecars exist under run-1/ and run-2/ with result.json, grading.json, metrics.json, timing.json, transcript.json, and transcript-raw.jsonl.

CI was already green for 1f602d05; I am marking the PR ready and proceeding with the GitHub PR merge if it remains clean.

@christso christso marked this pull request as ready for review July 2, 2026 14:32
@christso christso merged commit e9ab59b into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-kfik-8-transcripts branch July 2, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant