feat(eval): normalize transcript artifacts by christso · Pull Request #1600 · EntityProcess/agentv

christso · 2026-07-02T10:20:02Z

Summary

write normalized transcript.json documents while preserving raw provider evidence as transcript-raw.jsonl
add canonical transcript tool_name normalization and inline transcript_summary into result rows/artifacts
update CLI/Dashboard readers, docs, and tests for the new transcript contract with legacy JSONL fallback where needed

Verification

bun run lint
bun run build
bun run test
bun run validate:examples
bun test apps/cli/test/commands/eval/artifact-writer.test.ts packages/core/test/evaluation/orchestrator.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx
bun test apps/cli/test/commands/results/export.test.ts apps/cli/test/commands/results/serve.test.ts

Dogfood

Live local OpenAI-compatible run against http://127.0.0.1:10531/v1, model gpt-5.4-mini, passed 1/1.
Local run bundle: .agentv/results/2026-07-02T10-00-04-331Z
Private evidence: agentv-private branch evidence/av-kfik8-transcripts-2026-07-02, commit 6c17be5.

Bead: av-kfik.8

cloudflare-workers-and-pages · 2026-07-02T10:20:15Z

Deploying agentv with Cloudflare Pages

Latest commit:	`1f602d0`
Status:	✅ Deploy successful!
Preview URL:	https://19494009.agentv.pages.dev
Branch Preview URL:	https://feat-av-kfik-8-transcripts.agentv.pages.dev

View logs

christso · 2026-07-02T13:44:30Z

Independent review for Bead av-kfik.8.1.

Finding:

P2: transcript_summary is not projected for repeat-run result rows. The Bead requires the precomputed summary to be inlined into each result row so consumers can inspect trajectory/metrics without reparsing transcript artifacts. The writer only attaches transcript_summary when a single-run transcriptPath exists (buildIndexArtifactEntry, packages/core/src/evaluation/run-artifacts.ts:1790) or when isSingleRun && hasTranscript is true (buildResultIndexArtifact, packages/core/src/evaluation/run-artifacts.ts:1879). For repeated runs, isSingleRun is false, so the root index.jsonl row has trials and per-attempt transcript paths but no transcript_summary; the Dashboard repeat read model also reconstructs attempt paths without adding transcript_summary (apps/cli/src/commands/results/serve.ts:1153). This leaves repeated eval consumers parsing run-N/transcript.json despite the new row-level contract. Please either project per-attempt summaries into trials[] / an aggregate row summary, or explicitly narrow the contract/docs if repeat rows are intentionally excluded.

Other checks:

Raw provider evidence remains transcript-raw.jsonl; normalized transcript is transcript.json.
Canonical tool_name enum is routed through provider-aware normalization.
I did not find ordinary sidecars being added to artifact_pointers.
Local targeted tests could not run in this review worktree because dependencies/build links were missing (react/jsx-dev-runtime, micromatch, @agentv/core), but PR checks are passing on GitHub.

christso · 2026-07-02T14:26:18Z

Addressed the repeat-run transcript summary blocker in commit 1f602d0 (fix(eval): project repeat transcript summaries).

What changed:

index.jsonl trial entries now include trials[].transcript_summary when the trial has a persisted transcript-bearing result.
Dashboard/API repeat trial read models preserve trials[].transcript_summary from the manifest and hydrate it from run-N/result.json for older rows that do not inline it.
Added focused writer and serve/API regressions for repeat trial projection.

Validation:

bun --filter @agentv/core build passed
bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed (66/66)
bun test apps/cli/test/commands/results/serve.test.ts passed (120/120)
bun --filter @agentv/core typecheck passed
bun --filter agentv typecheck passed
bunx biome check packages/core/src/evaluation/run-artifacts.ts apps/cli/src/commands/results/serve.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/serve.test.ts passed
bun --filter agentv build passed
git diff --check passed

Live dogfood:

Attempted OpenAI direct repeat-run dogfood; blocked because the local OPENAI_API_KEY is a dummy key and returned HTTP 401.
Attempted GitHub Models live repeat-run dogfood with openai/gpt-4o-mini and openai/gpt-4.1-mini; both reached the live provider path, but the LLM grader failed with malformed JSON parse errors.
The resulting repeat bundle at /tmp/agentv-kfik8-dogfood/run still validated the artifact projection under review: root index.jsonl had 2 trials, and both trial entries had transcript_summary matching each run-N/result.json.

christso · 2026-07-02T14:27:00Z

CI is green for commit 1f602d05, but I am not marking the PR ready or merging it.

Blocker: the repo verification gate asks for live dogfood on repeat-run/artifact contract changes. I attempted that gate:

Direct OpenAI target: blocked by local OPENAI_API_KEY returning HTTP 401 (dummy key).
GitHub Models target with openai/gpt-4o-mini and openai/gpt-4.1-mini: live provider calls ran, but the LLM grader failed repeatedly with malformed JSON parse errors.

Useful evidence from the failed live run: /tmp/agentv-kfik8-dogfood/run/index.jsonl had 2 repeat trials, and both trials[] entries included transcript_summary matching the corresponding run-N/result.json.

Leaving PR #1600 as draft/unmerged until a clean live grader dogfood run is available or this gate is explicitly waived.

christso · 2026-07-02T14:32:22Z

Live dogfood gate is now unblocked and passing for commit 1f602d05.

Configuration used:

Endpoint: http://127.0.0.1:10531/v1
Model: gpt-5.3-codex-spark
Target provider: openai, api_format: chat
No real API key was used. I verified the endpoint accepts chat completions without auth. AgentV target validation requires an api_key value for this provider, so I set DOGFOOD_LOCAL_API_KEY=unused-local-placeholder only for client initialization.

Dogfood command:
DOGFOOD_LOCAL_API_KEY=unused-local-placeholder DOGFOOD_LOCAL_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik8-local-dogfood/repeat-transcript.eval.yaml --targets /tmp/agentv-kfik8-local-dogfood/targets.yaml --target local-openai-agent --workers 1

Result:

Passed: 1/1
Score: 100%
Run bundle: .agentv/results/2026-07-02T14-31-30-578Z

Artifact contract evidence:

Root index.jsonl row: test_id=repeat-transcript-summary-local, execution_status=ok, score=1.
trials.length === 2.
Both trials[] entries include transcript_summary.
Both trial summaries match the corresponding run-N/result.json transcript_summary.
Attempt sidecars exist under run-1/ and run-2/ with result.json, grading.json, metrics.json, timing.json, transcript.json, and transcript-raw.jsonl.

CI was already green for 1f602d05; I am marking the PR ready and proceeding with the GitHub PR merge if it remains clean.

feat(eval): normalize transcript artifacts

dca97ef

fix(eval): project repeat transcript summaries

1f602d0

christso marked this pull request as ready for review July 2, 2026 14:32

christso merged commit e9ab59b into main Jul 2, 2026
8 checks passed

christso deleted the feat/av-kfik-8-transcripts branch July 2, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): normalize transcript artifacts#1600

feat(eval): normalize transcript artifacts#1600
christso merged 2 commits into
mainfrom
feat/av-kfik-8-transcripts

christso commented Jul 2, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 2, 2026

Summary

Verification

Dogfood

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading