Skip to content

Proposal: support external agent eval fixture packs #12

Description

@bennewell35

Problem

Small Harness has a useful agent eval path (small-harness --eval ..., /eval agent ..., and /play ...), but eval fixtures are currently registered in Rust source via the built-in fixture list. That makes custom or domain-specific eval packs awkward: adding one requires changing and rebuilding Small Harness itself instead of dropping a fixture into a project or sharing a small fixture directory.

This limits use cases like:

  • comparing local models against the same project-specific tasks
  • shipping reusable eval packs alongside a repo
  • testing domain-specific agent behavior without forking Small Harness
  • using /play-style sandboxes for non-built-in tasks

Proposal

Add support for external AgentEvalFixture definitions loaded from a file or project directory while keeping the current built-ins unchanged.

A minimal first version could support:

small-harness --eval ./evals/fixtures/fix-readme-badge.json --model qwen2.5:7b

and optionally later:

/eval agent ./evals/fixtures/fix-readme-badge.json
/eval agent project

Suggested fixture shape can mirror the existing serialized Rust structs:

{
  "id": "fix-readme-badge",
  "prompt": "Update the README version badge to match Cargo.toml.",
  "workspace": "workspaces/fix-readme-badge",
  "checks": [
    { "type": "fileContains", "path": "README.md", "needle": "version-1.0.4" },
    { "type": "testsPass" }
  ]
}

Design constraints

  • Preserve all existing built-in fixtures and fixture IDs.
  • Keep external fixtures data-only at first; no arbitrary commands in fixture files.
  • Resolve fixture workspaces relative to the fixture file or fixture-pack root.
  • Reject workspace paths that escape the fixture pack/root.
  • Reuse existing check types initially (testsPass, fileContains, gitClean, toolUsed, assistantMentions).
  • Produce clear errors for unknown fixture paths, malformed JSON, unknown checks, or missing workspaces.

Acceptance criteria

  • small-harness --eval <builtin-id> continues to work unchanged.
  • small-harness --eval <path-to-fixture.json> runs an external fixture.
  • External fixture workspace copy behavior matches built-ins.
  • External fixture loading has unit tests for happy path, malformed fixture, missing workspace, and path traversal rejection.
  • README/Quickstart document a small external fixture example.

This would make Small Harness more useful as a general local-agent benchmark harness without forcing every eval scenario into the main repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions