Skip to content

Regression Guard, Agent Card, Smart Recommend & External Datasets #15

@himmi-01

Description

@himmi-01

Adds four developer-experience improvements to make EvalMonkey feel native to the agent development workflow:

  • Regression Guard — evalmonkey guard exits with code 1 if your agent's score drops vs the last baseline (CI/CD gate); auto-warns on every run-benchmark run.
  • Agent Card — evalmonkey report generates a shareable Markdown file with a shields.io badge and per-scenario score table, ready to paste into your README.
  • Smart Recommend — evalmonkey recommend reads agent_type from evalmonkey.yaml and shows only the relevant benchmark subset (e.g. research_agent → hotpotqa, drop, gaia-benchmark) instead of all 22.
  • External & Private Datasets — bring your own data via --dataset my_cases.jsonl, hf::org/dataset (any HuggingFace dataset), confident-ai::id / braintrust::ref / langsmith::id prefixes (harness on top of your existing eval platform datasets), or a Generic REST endpoint configured in evalmonkey.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions