v0.9.0

github-actions released this 17 Jun 19:27

· 26 commits to main since this release

2e9f8c6

Added

File-backed benchmark document library — benchmark templates, datasets, runtime profiles, and plans now load from built-in JSON documents plus a writable local library, with API saves persisted to files so documents can be reconstructed after SQLite loss.
Native Anthropic and Gemini benchmark tool calls — benchmark execution now resolves Anthropic Messages and Gemini GenerateContent operations, maps dataset tools and tool_choice into provider-native payloads, and normalizes returned tool calls and usage metrics.
Adaptive Results performance views — the Results dashboard now has an Auto performance view with manual modes for cold-start comparison, latency trend, pass-rate trend, latency histogram, and model-summary table comparisons backed by filtered model aggregates.
Project workflow guardrails — AGENTS.md now combines the main branch workflow, Node 25 rules, challenge-and-skill behavior instructions, and a static-data rule that keeps prompts, schemas, fixtures, and examples out of application code.
Benchmark template agent — Templates now includes a review-first benchmark-template agent that uses a database-persisted Settings model, challenges underspecified requests, loads its prompt from Markdown with the full test_template schema and example injected, validates generated drafts server-side, and applies drafts to the existing editor without auto-saving.
Run-page persisted benchmark plan checkpoint — Run can now select saved chat benchmark templates, prepare inline or server-side dataset manifests, persist unique runtime/dataset/plan artifacts per click, execute /benchmark/plans/:id/run, and render per-target results including failed targets without result documents.
Run smoke chat benchmark template — the built-in benchmark document library now includes a real "Run smoke chat" test_template for first-run prompt checks.
Templates LLM-first layout — Templates now uses an AI-first authoring split with live JSON, Advanced form, and Raw JSON tabs, plus a redesigned preview/list layout for JSON-only test_template documents.

Changed

Human-readable agent workflow guidance — AGENTS.md now groups workflow rules into clearer sections, documents parallel worktree expectations including origin/main checks before commit/push requests and resync timing before validation or merge, directs agents to create a focused branch without pausing for confirmation, and asks agents to explicitly request commit approval with a suggested message and details.
Templates agent composer — The Templates authoring panel now gives the freeform request field more space and removes the preset suggestion chip.
Run selection rail polish — the Run page model chips, benchmark template selector, and response header now use clearer selected-state borders/backgrounds and denser mono text, with redundant server summary and model helper copy removed.

Fixed

Results sidebar count — the Results navigation badge now reads the benchmark-native results total instead of the legacy runs endpoint.
Catalog sidebar count — the Catalog navigation item now shows the available model count in the sidebar badge.
Run smoke template selection — the Run page now selects the built-in "Run smoke chat" template by default and requires a real template document before starting a benchmark.
Run multi-model response labels — multi-model Run detail headers now reuse the same letter and accent color assigned in the selected model chips.
Run multi-model layout — multi-model benchmark details now use an auto-fitting grid that shows more cards per row on wide screens while keeping each card readable.
Run metrics placement — per-model metrics now sit directly under the model header in compact fields, with raw benchmark JSON moved beneath the benchmark audit.
Run metric emphasis — metric values in Run result cards now use bold mono text for faster scanning.
Run benchmark audit presentation — audit metadata now renders as compact 11px status lines with check, pending, and failure markers.
Run placeholder actions — disabled "Open in Evaluate" and "Copy as cURL" buttons were removed from the Run metrics panel.
Templates authoring draft preservation — Switching between Live JSON, Advanced form, and Raw JSON now preserves the agent-inferred benchmark draft instead of reverting to the starter document.
Benchmark-only Results history — Results dashboard, history, detail drawers, and deletion now read benchmark test run records instead of legacy run/result tables, so benchmark smoke runs appear after completion.
Template agent starter drafting — The benchmark-template agent now drafts conservative starter templates for recognizable benchmark families such as tool-call compliance instead of blocking on follow-up questions when reasonable assumptions are available.
Built-in template onboarding — first-run onboarding now tracks only server connection, model selection, and first successful run, auto-selects installed chat templates on Run, and no longer asks users to create a starter template.
Benchmark foundation stress test timeout — the indexed lookup stress test now has an explicit timeout that matches its own 10-second performance budget, avoiding Vitest preemption on slower CI runners.
Restored tracked AGENTS.md project workflow rules while keeping the Node 25.x native-module guidance, restored CLAUDE.md tracking, and aligned Claude-specific project guidance with the enforced Node 25.x runtime.
Template agent settings rate limiting — /system/settings and /system/settings/template-agent-model now use an in-memory per-client rate limit before reading or updating app settings.
Template agent message contrast — Assistant replies and validated draft previews now render with readable text on their light message backgrounds.
Production token bootstrap — Production build and start scripts now run the local API token bootstrap so Vite has VITE_INFERHARNESS_API_TOKEN before bundling or previewing the frontend.

Assets 4