Skip to content

Releases: Fango2007/InferHarness

v0.10.0

19 Jun 13:52

Choose a tag to compare

  • Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
  • Duplicate tool-call argument scoringtool_arguments_valid now consumes matched tool calls so repeated calls to the same function with different arguments are scored consistently with tool_call_assertion_pass.
  • Legacy Runs API cleanup — removed the orphaned public /runs list/delete routes, their route-specific service, and route-only tests now that Results deletion uses /results-view/runs/:runId, while retaining the underlying run/result tables for active benchmark, evaluation, retention, and cleanup flows.
  • Datasets editor checkpoint — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under INFERHARNESS_BENCHMARK_DATASET_ROOT, with synced dataset_manifest documents, copy-down editing for repeated fields, and clamped long-prompt display.
  • Benchmark plan cleanup — removed the transitional inline /benchmark/plans/run execution API and stale INFERHARNESS_TEST_TEMPLATES_DIR example so plan execution goes through persisted benchmark_plan documents.
  • Tool-call assertion metric — benchmark tool-call templates now include tool_call_assertion_pass, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures.
  • Tool-call assertion UI — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.
  • Onboarding prompt scope — the Run page completion handoff now appears only for the onboarding first-run step, canceling the onboarding-launched server drawer stops setup with an explicit normal-mode notice, and the three-step welcome layout is centered.
  • Built-in template reload after DB clear — clearing the database from Settings now reloads the built-in benchmark library immediately, keeping shipped templates available without restarting the backend.
  • Catalog empty server card — the empty Servers catalog now presents a dashed first-server card with the add action instead of a centered empty-state panel.
  • Catalog model auto-selection — opening the Models catalog without a server filter now selects the first available inference server so discovered models render immediately.
  • Run empty preview — the empty Run workspace now shows a dummy benchmark result with sample model, prompt, metrics, and audit rows instead of a generic empty panel.

v0.9.0

17 Jun 19:27
2e9f8c6

Choose a tag to compare

Added

  • File-backed benchmark document library — benchmark templates, datasets, runtime profiles, and plans now load from built-in JSON documents plus a writable local library, with API saves persisted to files so documents can be reconstructed after SQLite loss.
  • Native Anthropic and Gemini benchmark tool calls — benchmark execution now resolves Anthropic Messages and Gemini GenerateContent operations, maps dataset tools and tool_choice into provider-native payloads, and normalizes returned tool calls and usage metrics.
  • Adaptive Results performance views — the Results dashboard now has an Auto performance view with manual modes for cold-start comparison, latency trend, pass-rate trend, latency histogram, and model-summary table comparisons backed by filtered model aggregates.
  • Project workflow guardrailsAGENTS.md now combines the main branch workflow, Node 25 rules, challenge-and-skill behavior instructions, and a static-data rule that keeps prompts, schemas, fixtures, and examples out of application code.
  • Benchmark template agent — Templates now includes a review-first benchmark-template agent that uses a database-persisted Settings model, challenges underspecified requests, loads its prompt from Markdown with the full test_template schema and example injected, validates generated drafts server-side, and applies drafts to the existing editor without auto-saving.
  • Run-page persisted benchmark plan checkpoint — Run can now select saved chat benchmark templates, prepare inline or server-side dataset manifests, persist unique runtime/dataset/plan artifacts per click, execute /benchmark/plans/:id/run, and render per-target results including failed targets without result documents.
  • Run smoke chat benchmark template — the built-in benchmark document library now includes a real "Run smoke chat" test_template for first-run prompt checks.
  • Templates LLM-first layout — Templates now uses an AI-first authoring split with live JSON, Advanced form, and Raw JSON tabs, plus a redesigned preview/list layout for JSON-only test_template documents.

Changed

  • Human-readable agent workflow guidanceAGENTS.md now groups workflow rules into clearer sections, documents parallel worktree expectations including origin/main checks before commit/push requests and resync timing before validation or merge, directs agents to create a focused branch without pausing for confirmation, and asks agents to explicitly request commit approval with a suggested message and details.
  • Templates agent composer — The Templates authoring panel now gives the freeform request field more space and removes the preset suggestion chip.
  • Run selection rail polish — the Run page model chips, benchmark template selector, and response header now use clearer selected-state borders/backgrounds and denser mono text, with redundant server summary and model helper copy removed.

Fixed

  • Results sidebar count — the Results navigation badge now reads the benchmark-native results total instead of the legacy runs endpoint.
  • Catalog sidebar count — the Catalog navigation item now shows the available model count in the sidebar badge.
  • Run smoke template selection — the Run page now selects the built-in "Run smoke chat" template by default and requires a real template document before starting a benchmark.
  • Run multi-model response labels — multi-model Run detail headers now reuse the same letter and accent color assigned in the selected model chips.
  • Run multi-model layout — multi-model benchmark details now use an auto-fitting grid that shows more cards per row on wide screens while keeping each card readable.
  • Run metrics placement — per-model metrics now sit directly under the model header in compact fields, with raw benchmark JSON moved beneath the benchmark audit.
  • Run metric emphasis — metric values in Run result cards now use bold mono text for faster scanning.
  • Run benchmark audit presentation — audit metadata now renders as compact 11px status lines with check, pending, and failure markers.
  • Run placeholder actions — disabled "Open in Evaluate" and "Copy as cURL" buttons were removed from the Run metrics panel.
  • Templates authoring draft preservation — Switching between Live JSON, Advanced form, and Raw JSON now preserves the agent-inferred benchmark draft instead of reverting to the starter document.
  • Benchmark-only Results history — Results dashboard, history, detail drawers, and deletion now read benchmark test run records instead of legacy run/result tables, so benchmark smoke runs appear after completion.
  • Template agent starter drafting — The benchmark-template agent now drafts conservative starter templates for recognizable benchmark families such as tool-call compliance instead of blocking on follow-up questions when reasonable assumptions are available.
  • Built-in template onboarding — first-run onboarding now tracks only server connection, model selection, and first successful run, auto-selects installed chat templates on Run, and no longer asks users to create a starter template.
  • Benchmark foundation stress test timeout — the indexed lookup stress test now has an explicit timeout that matches its own 10-second performance budget, avoiding Vitest preemption on slower CI runners.
  • Restored tracked AGENTS.md project workflow rules while keeping the Node 25.x native-module guidance, restored CLAUDE.md tracking, and aligned Claude-specific project guidance with the enforced Node 25.x runtime.
  • Template agent settings rate limiting/system/settings and /system/settings/template-agent-model now use an in-memory per-client rate limit before reading or updating app settings.
  • Template agent message contrast — Assistant replies and validated draft previews now render with readable text on their light message backgrounds.
  • Production token bootstrap — Production build and start scripts now run the local API token bootstrap so Vite has VITE_INFERHARNESS_API_TOKEN before bundling or previewing the frontend.

v0.8.0

14 Jun 20:03

Choose a tag to compare

Added

  • Real first-run onboarding — added a frontend-only guided setup path that uses existing server, model, benchmark document, and run/result APIs to help users create their first production-ready server, model selection, starter benchmark template, and successful run without demo data or new backend endpoints.
  • Automatic local API token bootstrap — first local startup now creates or syncs INFERHARNESS_API_TOKEN and VITE_INFERHARNESS_API_TOKEN in .env when missing, keeping the frontend and backend able to communicate on a fresh install.
  • README project badges — the root README now shows version, Node.js, Python, CI, and MIT license badges.

Changed

  • Onboarding-aware shell — added the setup pill, welcome page, progress ribbons, handoff prompts, Settings tour controls, sidebar setup locking, and first-run completion prompt while keeping users on the Run page after a successful benchmark.
  • Starter benchmark creation — the Run page can create a valid reusable test_template starter benchmark through the existing benchmark document API.

v0.7.0

14 Jun 13:38

Choose a tag to compare

Added

  • Settings side-shell model selection — Settings now uses a categorized side-shell with a dedicated local-only Model Selection picker backed by active /models records, plus foldable environment sections scoped to Runtime, Providers & Auth, Connectivity, Frontend, and Advanced instead of a duplicated catch-all environment tab.
  • README settings alignment — the root README now calls out active development status and Settings-managed environment values.
  • Paired benchmark stage runner checkpointpaired_request_loop templates now validate and run with pair-member preservation, pair metric paths such as pair.cold.elapsed_ms, and simple difference derived metrics while keeping paired-stage authoring in the Templates Raw JSON drawer.
  • Complete benchmark-template stage authoring — the Templates editor now exposes paired-stage fields including pair delays, pair members, simple difference derived metrics, stage observability JSON, and custom metric IDs while retaining Raw JSON as an escape hatch.
  • Templates benchmark-template authoring checkpoint — the existing Templates page now authors benchmark test_template document CRUD through /benchmark/documents, while keeping benchmark_plan creation out of the UI for the later Run-page flow.
  • BenchmarkPlan ref-document checkpoint — benchmark-native documents can now be persisted through /benchmark/documents, stored benchmark_plan documents can be created/read through /benchmark/plans, and /benchmark/plans/:id/run resolves template/dataset/runtime/model refs into the existing multi-model plan runner while keeping the inline /benchmark/plans/run route transitional.
  • Model load time metricload_duration_ms extracted from server-native response metadata (Ollama reports exact load time in nanoseconds on every /api/chat and /api/generate response). Exposed as a first-class metric in computeItemMetrics and aggregated as max (load only fires on the cold request). Run page metrics panel shows a "model load" row when the value is non-null and > 0; hidden for servers that don't report it (llama.cpp, vLLM, TGI).
  • Ollama protocol timing metricstotal_duration (ns) feeds server_total_time_ms (server-measured total including load+prefill+decode); prompt_eval_duration (ns) → server_prompt_eval_ms; eval_duration (ns) → server_eval_ms. Applies to all Ollama-compatible servers (Ollama, Inferencer, etc.). Run page shows "server prefill" and "server decode" rows when non-null; server-reported, no red.
  • oMLX native metricsusage.model_load_duration (seconds) now feeds load_duration_ms alongside Ollama's load_duration (nanoseconds); usage.total_time (seconds) surfaces as new server_total_time_ms metric representing server-measured processing time (excludes network, comparable to elapsed_ms). Run page shows "server time" row when non-null.
  • Request-triggered load estimatorestimateRequestTriggeredLoad() computes a heuristic load_estimate from ordered metric_results when ≥ 3 samples exist: compares first-request latency against the median of warm requests; detects a load event when the cold spike exceeds max(50% of warm baseline, 3× warm stddev). Prefers first_token_ms over elapsed_ms when streaming data is present. Stored as load_estimate on the result document. Run page shows "model load (est.)" in bold red when detected and no native load_duration_ms is available — signals heuristic rather than server-reported value.

Fixed

  • Stream dropdown in Run page Step 4 options grid now matches the height of number inputs (font-size: 12px and explicit height: 35px applied uniformly via .run-options-grid selector).
  • Derived/estimated metrics in the Run page metrics panel (tok / s (decode), tok / s (overall), prefill tok / s, model load (est.)) now render in bold red via .is-estimated class, consistently distinguishing computed values from directly measured or server-reported ones.

v0.6.0

10 Jun 17:01

Choose a tag to compare

Added

  • Benchmark metrics & aggregation — new benchmark-metrics service computing the full schema-advertised metric set per item (tokens_per_second, output_input_token_ratio, exact_match, contains_required_terms, json_valid, schema_valid, regex_match, and tool-call metrics) plus run-level aggregations (mean/median/min/max/sum/count/p50/p90/p95/p99/stddev/variance), with boolean metrics surfaced as success_rate and partial-execution sample accounting.
  • Run page right-side metrics panel now shows tokens-per-second, duration p95 and item count for multi-item runs, and a correctness section (per-metric success rate) when the template requests correctness metrics.
  • Generation parameters (temperature, top_p, max_tokens, stream) editable inline in the Run page Step 4 options grid; previously hardcoded to defaults.
  • Decode-aware throughput metrics decode_tokens_per_second (output_tokens / (elapsed_ms − first_token_ms)) and prefill_tokens_per_second (input_tokens / first_token_ms), isolating generation speed from prompt prefill on streaming runs; both null on non-streaming runs. Metrics panel shows decode / overall / prefill tok/s separately.

Changed

  • Benchmark runner replaces the stub aggregator (count/elapsed_ms_mean/output_tokens_sum) with template-driven metric computation and aggregation; metric_version bumped from basic-v1 to metrics-v1.
  • Response normalizer now surfaces tool_calls so tool-call metrics can be computed.
  • Run page smoke template requests tokens_per_second, decode_tokens_per_second, prefill_tokens_per_second, and p95/count aggregations.
  • Run page metrics panel labels clarified: latencyduration (total request time, distinct from ttft).

Security

  • Upgraded shell-quote to ^1.8.4 via a root override to remediate a known advisory.

v0.5.0

08 Jun 13:38

Choose a tag to compare

Added

  • Benchmark test pipeline (phase 1) — new POST /benchmark route accepts structured benchmark plans and dispatches dataset-backed test runs against registered inference servers.
  • Seven JSON schemas for benchmark documents: model_profile, model_snapshot, runtime_profile, dataset_manifest, test_template, test_instantiation, test_run_result, and benchmark_plan, with schema-version-based kind inference.
  • benchmark-schemas service exposing validateBenchmarkDocument, benchmarkKindFromDocument, and benchmarkSchemaPath for typed document validation.
  • benchmark-datasets service for loading, validating, and caching dataset manifests, with support for embedded, compressed-blob, and manifest-only dataset formats.
  • benchmark-foundation service for creating, storing, and reloading structured benchmark results against the SQLite schema.
  • benchmark-runner service orchestrating full benchmark plan execution: instantiation, dataset injection, per-model inference dispatch, and result persistence.
  • INFERHARNESS_BENCHMARK_DATASET_ROOT environment variable for server-side benchmark dataset file resolution.
  • INFERHARNESS_INFERENCE_TLS_INSECURE environment variable (default false) to disable TLS certificate verification for outbound inference requests, equivalent to curl --insecure.
  • POST /inference-servers/probe endpoint tests connection and lists models without writing to DB, used by the server creation drawer before saving.
  • Per-server refresh icon button on server cards triggers refreshInferenceServerDiscovery for that server on demand.
  • Refresh-all icon button in the servers section header re-probes all active servers in parallel.
  • probeServer() now accepts parseModels: false for lightweight health checks that confirm reachability without parsing the model list.
  • Capabilities filter (thinking / coding / instruct / MoE) on the Catalog model rail, with URL-backed capabilities query parameter.
  • Parameter count upper-bound slider on the Catalog model rail, with URL-backed maxParams query parameter and inline label.
  • Parameter count label pill displayed on model cards.
  • GPU cores field added to the inference server create/edit drawer, collected through the extended server schema.

Changed

  • Server creation drawer now uses a test-first workflow: "Test connection" probes the endpoint and shows discovered models before any DB write; "Save to Catalog" then creates the record and runs discovery.
  • Health checks (GET /inference-servers/health) pass parseModels: false to avoid redundant model parsing during periodic polling.
  • Automatic TTL-based discovery refresh removed from Catalog — model lists are refreshed only on explicit user action (per-card icon, refresh-all, or server save).
  • CONNECTIVITY_POLL_INTERVAL_MS renamed to INFERHARNESS_HEALTH_POLL_INTERVAL and now accepts seconds instead of milliseconds (default: 30).
  • probeServer() extracted into a dedicated inference-server-probe.ts service, eliminating duplicated HTTP probe logic across refreshDiscovery and checkInferenceServerHealth.
  • "Last probe" timestamp removed from server cards and the server detail rail.
  • Capabilities and maxParams filters cleared on server deselect and rail clear.
  • Server create/edit drawer now uses dropdown fields and a two-column layout.
  • Mistral /v1/models discovery now keeps only canonical entries where id == name, dropping alias rows before DB persistence.
  • Run-groups endpoints and data model removed; benchmark pipeline replaces the former grouped-run concept.

Fixed

  • Deleting an inference server no longer throws a FOREIGN KEY constraint error; child records (metric samples, test results, runs, evaluations, models) are now deleted in dependency order within a transaction.
  • Contract and integration tests for benchmark schemas now reference committed fixture files instead of the gitignored specs/ directory, fixing all 26 CI failures.
  • Root-level vitest run no longer fails due to missing or misrouted test configuration.

v0.4.1

11 May 14:59

Choose a tag to compare

Added

  • Results dashboard now compares raw cold-start performance across servers and models with sample-backed summary rows and box plots for cold penalty, cold total, and hot total metrics.
  • Results run detail drawers now support guarded hard deletion of completed runs, removing result documents, metric samples, queue skips, and run-group item links while preserving linked evaluations.
  • Server discovery now upserts discovered models with persisted parser-derived metadata, including clean base names, quantized providers, parameter labels, active MoE labels, formats, quantization bits, and use-case tags.

Changed

  • Catalog and Models metadata filters/details now use persisted /models records as their source of truth instead of inferring provider, format, quantized provider, or use cases from raw model IDs.
  • Catalog Servers now keeps Filter, Archived, and + Add server in the section header, opens the filter rail only on demand, defaults to active servers, and starts server cards unselected with click-to-toggle detail rails.
  • Catalog model inspection now uses the routed /catalog/models/:id handoff layout while preserving the Catalog header, Servers/Models sub-tabs, and inference context bar.

Fixed

  • Catalog server archive actions now keep the selected server available in the archived view so the detail rail immediately exposes the matching Unarchive action.

v0.4.0

10 May 16:10
2972389

Choose a tag to compare

Added

  • Backend run groups now persist grouped Run executions, instantiate selected templates per target, launch child runs concurrently, expose /run-groups create/read/cancel endpoints, and isolate per-target failures.
  • Results now has a run-backed /results-view/query API and /results-view/runs/:runId detail API for the merged Dashboard/History experience, including filter metadata, scorecards, chart series, recent runs, dense history rows, and drawer data.
  • Evaluation detail is now available at GET /evaluations/:evaluationId so leaderboard rows can open a detail drawer for the representative evaluation.
  • Inference parameter presets are now persisted through /inference-param-presets CRUD endpoints and exposed in the shared frontend context bar.
  • Evaluate now has a queue API backed by completed test_results, with source-linked scoring and skip persistence while preserving the existing five 1-5 leaderboard score fields.

Changed

  • CI, release, and local Node version guidance now target Node.js 25 while declaring the supported runtime range as >=22.19 <26, matching Undici 8 requirements without claiming Node 26 support before native SQLite dependencies allow it.
  • better-sqlite3 is now pinned to the latest verified 12.9 release line for the current Node runtime window.
  • Frontend styling now loads the new design-system foundation tokens, vendored IBM Plex fonts, and shared component primitives for cards, buttons, inputs, health pills, metrics, and architecture-tree surfaces.
  • The frontend shell now uses React Router with a 220px always-expanded five-item sidebar, URL-backed Catalog/Results sub-tabs, legacy route redirects, and sidebar health/count status instead of the former global metric-card header.
  • Catalog now replaces the legacy Inference Servers and Models bodies with a merged Servers/Models funnel, URL-backed server/model filters, server health view, slide-over add/edit drawer, card grids, and a full-width model inspector layout.
  • Run now uses a unified 1-8 model workflow with query-backed model chips, shared template/options controls, single-target detail rendering, multi-target comparison columns, and summary aggregation.
  • Results now uses a single merged Dashboard/Leaderboard/History page with a shared 240px filter rail, URL-owned tab/filter/sort/pagination/detail state, export/share/reset actions, run detail drawers for Dashboard and History, and evaluation detail drawers for Leaderboard.
  • Package 06 polish adds shared reg-lights, a persistent inference context bar on Run/Templates/Results/Evaluate, a two-pane Templates layout, and a manual Evaluate scoring queue.
  • Run, Templates, Results, and Evaluate now share merged page headers with the inference context bar aligned directly below the page header.
  • Results now uses a full-width staged funnel with relationship-aware Servers -> Models -> Tests/range filtering, a full-width empty dashboard state, and downstream pruning when upstream selections change.
  • Results and Catalog Models funnels now share numbered stages, aligned Clear/Collapse controls, Catalog-style collapsible rail treatment, and persisted collapse state.
  • Results Tests/range and Catalog Models filter rails now use scoped Clear actions that preserve upstream selections while clearing only the filters owned by that rail.
  • Leaderboard remains backed by evaluations while accepting server, model, score range, sort, and group query parameters, including grouping by server and inference_config.quantization_level.
  • Inference server authentication can now use stored raw bearer/custom-header tokens for backend probes and runs while preserving the existing token_env fallback.

Fixed

  • Backend Vitest runs now ignore production SQLite database defaults, use a dedicated backend-test.sqlite by default, and fail fast if a backend test tries to open the production DB.
  • Backend proxy support now sends plain HTTP outbound requests to the configured proxy in absolute-form while retaining CONNECT tunneling for HTTPS targets, routes backend outbound fetches through the configured Undici dispatcher directly, and no longer lets process-level NO_PROXY bypass backend proxy routing unless AITESTBENCH_INFERENCE_NO_PROXY is set.
  • Inference server API responses now mask stored raw auth tokens and expose only token presence metadata.

v0.3.2

05 May 18:50

Choose a tag to compare

Added

  • Backend inference-server calls can now be routed through an optional Undici proxy configured with AITESTBENCH_INFERENCE_PROXY and AITESTBENCH_INFERENCE_NO_PROXY, without exposing proxy settings to the frontend.

Changed

  • CI and release workflows now run on Node.js 22 to match current backend dependency requirements.

Fixed

  • Results dashboard performance graphs now link repeated runs from the same template/model into one series even when generated active test IDs differ.
  • Results dashboard merged metric graphs now keep different models as separate lines instead of collapsing same-test metrics together.
  • Results dashboard default date ranges now include the newest result even when its timestamp has seconds or milliseconds, preventing single-run dashboards from appearing empty.
  • Settings Empty database now clears all application SQLite tables, including evaluation prompts and evaluations that feed the leaderboard.
  • Leaderboard view now clears stale displayed rows immediately after the database is emptied from settings.
  • Architecture inspection errors now show visible, non-empty diagnostics in the model detail page instead of leaving only a red button state.
  • MLX architecture inspection now uses config-backed estimation directly, avoiding PyTorch-dependent AutoModel construction and allowing models such as /inferencerlabs/Qwen3-Coder-30B-A3B-Instruct-MLX-6.5bit to inspect successfully from config.json.
  • Architecture inspector subprocess failures now include captured output or an explicit timeout diagnostic when the Python process exits or is killed without a structured error.
  • Models page filters now infer provider, quantized provider, format, quantization bit-depth, and use-case metadata from discovered model IDs, and collapse provider-prefixed aliases so the model filter shows clean base model names only.

v0.3.1

03 May 15:51

Choose a tag to compare

Changed

  • Model format handling now accepts GCUF as a compatibility alias for canonical GGUF.
  • Architecture inspection now supports local GGUF files, MLX models with local config.json directories, and local-server MLX IDs that point back to HF-style repos, including leading-slash IDs such as /lmstudio-community/...-MLX-6bit.
  • Architecture inspection now uses a layered pipeline: exact Transformers construction first, then format-aware config/header fallback with explicit provenance and accuracy metadata.
  • Config fallback now normalizes nested decoder configs, estimates dense decoder, multimodal projector, and MoE structures, respects tied embeddings, and returns a clear unsupported error when required dimensions are missing.
  • GPTQ, AWQ, SafeTensors, MLX, and GGUF inspection targets now route through the appropriate exact, config-backed, or header-only strategy without downloading weight tensors.
  • Architecture cache entries now include inspector metadata and invalidate stale zero-parameter root-only results.