Skip to content

Add *.eval.ts auto-discovery so TS-first eval authors don't need a hand-wired runner #1116

@christso

Description

@christso

Problem

Running evals authored in TypeScript today requires writing a run.ts that imports case modules and calls evaluate(...) explicitly. YAML evals are auto-discovered by the CLI; TS evals aren't.

Evidence: apps/cli/src/commands/eval/commands/run.ts:23 describes the positional args as "Path(s) or glob(s) to evaluation .yaml file(s)". Globbing works for YAML / JSONL / JSON only. The sdk-config-file example covers agentv.config.ts discovery but that's the config, not eval case files.

That missing piece of boilerplate is the main reason the TS authoring path feels second-class next to YAML.

Proposal

A discovery convention:

  • CLI discovers **/*.eval.ts (and **/*.eval.js after a build step, if relevant) the same way it discovers EVAL.yaml / *.eval.yaml.
  • Each discovered module default-exports (or named-exports) an EvalConfig value, and the CLI runs it with the same reporter, inspect, and compare tooling as YAML evals.
  • Config discovery precedence and --filter / --tag / --only flags apply uniformly across YAML and TS evals.
  • Runtime: use whatever loader the runtime supports for TS modules (Bun direct import, or tsx / jiti for Node). Document the expectation.

Acceptance criteria

  • agentv run picks up *.eval.ts files with no extra flags.
  • A TS eval and a YAML eval in the same workspace produce identical trace / inspect output.
  • Example under examples/features/ demonstrates a mixed YAML + TS suite.
  • Docs updated with discovery rules and the runtime expectation for executing .ts files.
  • --workers, --threshold, --tag, --exclude-tag, --filter, --retry-errors, --output, and the cache/output-dir behaviour all work identically for TS-authored suites.

Non-goals

Depends on

Motivation

Closes the DX gap with hand-rolled TS harnesses while keeping agentv's framework advantages — cost tracking, inspect, compare, grader variety. The current cliff is: "use YAML, or write your own runner script." Neither is what a TS-first user wants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate worksdkRelates to the TypeScript SDK (programmatic API, CLI flags, schema)

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions