Release v0.10.0 · Fango2007/InferHarness

Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
Duplicate tool-call argument scoring — tool_arguments_valid now consumes matched tool calls so repeated calls to the same function with different arguments are scored consistently with tool_call_assertion_pass.
Legacy Runs API cleanup — removed the orphaned public /runs list/delete routes, their route-specific service, and route-only tests now that Results deletion uses /results-view/runs/:runId, while retaining the underlying run/result tables for active benchmark, evaluation, retention, and cleanup flows.
Datasets editor checkpoint — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under INFERHARNESS_BENCHMARK_DATASET_ROOT, with synced dataset_manifest documents, copy-down editing for repeated fields, and clamped long-prompt display.
Benchmark plan cleanup — removed the transitional inline /benchmark/plans/run execution API and stale INFERHARNESS_TEST_TEMPLATES_DIR example so plan execution goes through persisted benchmark_plan documents.
Tool-call assertion metric — benchmark tool-call templates now include tool_call_assertion_pass, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures.
Tool-call assertion UI — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.
Onboarding prompt scope — the Run page completion handoff now appears only for the onboarding first-run step, canceling the onboarding-launched server drawer stops setup with an explicit normal-mode notice, and the three-step welcome layout is centered.
Built-in template reload after DB clear — clearing the database from Settings now reloads the built-in benchmark library immediately, keeping shipped templates available without restarting the backend.
Catalog empty server card — the empty Servers catalog now presents a dashed first-server card with the add action instead of a centered empty-state panel.
Catalog model auto-selection — opening the Models catalog without a server filter now selects the first available inference server so discovered models render immediately.
Run empty preview — the empty Run workspace now shows a dummy benchmark result with sample model, prompt, metrics, and audit rows instead of a generic empty panel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!