Releases · Fango2007/InferHarness

19 Jun 13:52

github-actions

v0.10.0

77c25be

v0.10.0 Latest

Latest

Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
Duplicate tool-call argument scoring — tool_arguments_valid now consumes matched tool calls so repeated calls to the same function with different arguments are scored consistently with tool_call_assertion_pass.
Legacy Runs API cleanup — removed the orphaned public /runs list/delete routes, their route-specific service, and route-only tests now that Results deletion uses /results-view/runs/:runId, while retaining the underlying run/result tables for active benchmark, evaluation, retention, and cleanup flows.
Datasets editor checkpoint — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under INFERHARNESS_BENCHMARK_DATASET_ROOT, with synced dataset_manifest documents, copy-down editing for repeated fields, and clamped long-prompt display.
Benchmark plan cleanup — removed the transitional inline /benchmark/plans/run execution API and stale INFERHARNESS_TEST_TEMPLATES_DIR example so plan execution goes through persisted benchmark_plan documents.
Tool-call assertion metric — benchmark tool-call templates now include tool_call_assertion_pass, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures.
Tool-call assertion UI — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.
Onboarding prompt scope — the Run page completion handoff now appears only for the onboarding first-run step, canceling the onboarding-launched server drawer stops setup with an explicit normal-mode notice, and the three-step welcome layout is centered.
Built-in template reload after DB clear — clearing the database from Settings now reloads the built-in benchmark library immediately, keeping shipped templates available without restarting the backend.
Catalog empty server card — the empty Servers catalog now presents a dashed first-server card with the add action instead of a centered empty-state panel.
Catalog model auto-selection — opening the Models catalog without a server filter now selects the first available inference server so discovered models render immediately.
Run empty preview — the empty Run workspace now shows a dummy benchmark result with sample model, prompt, metrics, and audit rows instead of a generic empty panel.

Assets 4

17 Jun 19:27

github-actions

v0.9.0

2e9f8c6

v0.9.0

Added

File-backed benchmark document library — benchmark templates, datasets, runtime profiles, and plans now load from built-in JSON documents plus a writable local library, with API saves persisted to files so documents can be reconstructed after SQLite loss.
Native Anthropic and Gemini benchmark tool calls — benchmark execution now resolves Anthropic Messages and Gemini GenerateContent operations, maps dataset tools and tool_choice into provider-native payloads, and normalizes returned tool calls and usage metrics.
Adaptive Results performance views — the Results dashboard now has an Auto performance view with manual modes for cold-start comparison, latency trend, pass-rate trend, latency histogram, and model-summary table comparisons backed by filtered model aggregates.
Project workflow guardrails — AGENTS.md now combines the main branch workflow, Node 25 rules, challenge-and-skill behavior instructions, and a static-data rule that keeps prompts, schemas, fixtures, and examples out of application code.
Benchmark template agent — Templates now includes a review-first benchmark-template agent that uses a database-persisted Settings model, challenges underspecified requests, loads its prompt from Markdown with the full test_template schema and example injected, validates generated drafts server-side, and applies drafts to the existing editor without auto-saving.
Run-page persisted benchmark plan checkpoint — Run can now select saved chat benchmark templates, prepare inline or server-side dataset manifests, persist unique runtime/dataset/plan artifacts per click, execute /benchmark/plans/:id/run, and render per-target results including failed targets without result documents.
Run smoke chat benchmark template — the built-in benchmark document library now includes a real "Run smoke chat" test_template for first-run prompt checks.
Templates LLM-first layout — Templates now uses an AI-first authoring split with live JSON, Advanced form, and Raw JSON tabs, plus a redesigned preview/list layout for JSON-only test_template documents.

Changed

Human-readable agent workflow guidance — AGENTS.md now groups workflow rules into clearer sections, documents parallel worktree expectations including origin/main checks before commit/push requests and resync timing before validation or merge, directs agents to create a focused branch without pausing for confirmation, and asks agents to explicitly request commit approval with a suggested message and details.
Templates agent composer — The Templates authoring panel now gives the freeform request field more space and removes the preset suggestion chip.
Run selection rail polish — the Run page model chips, benchmark template selector, and response header now use clearer selected-state borders/backgrounds and denser mono text, with redundant server summary and model helper copy removed.

Fixed

Results sidebar count — the Results navigation badge now reads the benchmark-native results total instead of the legacy runs endpoint.
Catalog sidebar count — the Catalog navigation item now shows the available model count in the sidebar badge.
Run smoke template selection — the Run page now selects the built-in "Run smoke chat" template by default and requires a real template document before starting a benchmark.
Run multi-model response labels — multi-model Run detail headers now reuse the same letter and accent color assigned in the selected model chips.
Run multi-model layout — multi-model benchmark details now use an auto-fitting grid that shows more cards per row on wide screens while keeping each card readable.
Run metrics placement — per-model metrics now sit directly under the model header in compact fields, with raw benchmark JSON moved beneath the benchmark audit.
Run metric emphasis — metric values in Run result cards now use bold mono text for faster scanning.
Run benchmark audit presentation — audit metadata now renders as compact 11px status lines with check, pending, and failure markers.
Run placeholder actions — disabled "Open in Evaluate" and "Copy as cURL" buttons were removed from the Run metrics panel.
Templates authoring draft preservation — Switching between Live JSON, Advanced form, and Raw JSON now preserves the agent-inferred benchmark draft instead of reverting to the starter document.
Benchmark-only Results history — Results dashboard, history, detail drawers, and deletion now read benchmark test run records instead of legacy run/result tables, so benchmark smoke runs appear after completion.
Template agent starter drafting — The benchmark-template agent now drafts conservative starter templates for recognizable benchmark families such as tool-call compliance instead of blocking on follow-up questions when reasonable assumptions are available.
Built-in template onboarding — first-run onboarding now tracks only server connection, model selection, and first successful run, auto-selects installed chat templates on Run, and no longer asks users to create a starter template.
Benchmark foundation stress test timeout — the indexed lookup stress test now has an explicit timeout that matches its own 10-second performance budget, avoiding Vitest preemption on slower CI runners.
Restored tracked AGENTS.md project workflow rules while keeping the Node 25.x native-module guidance, restored CLAUDE.md tracking, and aligned Claude-specific project guidance with the enforced Node 25.x runtime.
Template agent settings rate limiting — /system/settings and /system/settings/template-agent-model now use an in-memory per-client rate limit before reading or updating app settings.
Template agent message contrast — Assistant replies and validated draft previews now render with readable text on their light message backgrounds.
Production token bootstrap — Production build and start scripts now run the local API token bootstrap so Vite has VITE_INFERHARNESS_API_TOKEN before bundling or previewing the frontend.

Assets 4

14 Jun 20:03

github-actions

v0.8.0

42ea01a

v0.8.0

Added

Real first-run onboarding — added a frontend-only guided setup path that uses existing server, model, benchmark document, and run/result APIs to help users create their first production-ready server, model selection, starter benchmark template, and successful run without demo data or new backend endpoints.
Automatic local API token bootstrap — first local startup now creates or syncs INFERHARNESS_API_TOKEN and VITE_INFERHARNESS_API_TOKEN in .env when missing, keeping the frontend and backend able to communicate on a fresh install.
README project badges — the root README now shows version, Node.js, Python, CI, and MIT license badges.

Changed

Onboarding-aware shell — added the setup pill, welcome page, progress ribbons, handoff prompts, Settings tour controls, sidebar setup locking, and first-run completion prompt while keeping users on the Run page after a successful benchmark.
Starter benchmark creation — the Run page can create a valid reusable test_template starter benchmark through the existing benchmark document API.

Assets 4

14 Jun 13:38

github-actions

v0.7.0

3043435

v0.7.0

Added

Settings side-shell model selection — Settings now uses a categorized side-shell with a dedicated local-only Model Selection picker backed by active /models records, plus foldable environment sections scoped to Runtime, Providers & Auth, Connectivity, Frontend, and Advanced instead of a duplicated catch-all environment tab.
README settings alignment — the root README now calls out active development status and Settings-managed environment values.
Paired benchmark stage runner checkpoint — paired_request_loop templates now validate and run with pair-member preservation, pair metric paths such as pair.cold.elapsed_ms, and simple difference derived metrics while keeping paired-stage authoring in the Templates Raw JSON drawer.
Complete benchmark-template stage authoring — the Templates editor now exposes paired-stage fields including pair delays, pair members, simple difference derived metrics, stage observability JSON, and custom metric IDs while retaining Raw JSON as an escape hatch.
Templates benchmark-template authoring checkpoint — the existing Templates page now authors benchmark test_template document CRUD through /benchmark/documents, while keeping benchmark_plan creation out of the UI for the later Run-page flow.
BenchmarkPlan ref-document checkpoint — benchmark-native documents can now be persisted through /benchmark/documents, stored benchmark_plan documents can be created/read through /benchmark/plans, and /benchmark/plans/:id/run resolves template/dataset/runtime/model refs into the existing multi-model plan runner while keeping the inline /benchmark/plans/run route transitional.
Model load time metric — load_duration_ms extracted from server-native response metadata (Ollama reports exact load time in nanoseconds on every /api/chat and /api/generate response). Exposed as a first-class metric in computeItemMetrics and aggregated as max (load only fires on the cold request). Run page metrics panel shows a "model load" row when the value is non-null and > 0; hidden for servers that don't report it (llama.cpp, vLLM, TGI).
Ollama protocol timing metrics — total_duration (ns) feeds server_total_time_ms (server-measured total including load+prefill+decode); prompt_eval_duration (ns) → server_prompt_eval_ms; eval_duration (ns) → server_eval_ms. Applies to all Ollama-compatible servers (Ollama, Inferencer, etc.). Run page shows "server prefill" and "server decode" rows when non-null; server-reported, no red.
oMLX native metrics — usage.model_load_duration (seconds) now feeds load_duration_ms alongside Ollama's load_duration (nanoseconds); usage.total_time (seconds) surfaces as new server_total_time_ms metric representing server-measured processing time (excludes network, comparable to elapsed_ms). Run page shows "server time" row when non-null.
Request-triggered load estimator — estimateRequestTriggeredLoad() computes a heuristic load_estimate from ordered metric_results when ≥ 3 samples exist: compares first-request latency against the median of warm requests; detects a load event when the cold spike exceeds max(50% of warm baseline, 3× warm stddev). Prefers first_token_ms over elapsed_ms when streaming data is present. Stored as load_estimate on the result document. Run page shows "model load (est.)" in bold red when detected and no native load_duration_ms is available — signals heuristic rather than server-reported value.

Fixed

Stream dropdown in Run page Step 4 options grid now matches the height of number inputs (font-size: 12px and explicit height: 35px applied uniformly via .run-options-grid selector).
Derived/estimated metrics in the Run page metrics panel (tok / s (decode), tok / s (overall), prefill tok / s, model load (est.)) now render in bold red via .is-estimated class, consistently distinguishing computed values from directly measured or server-reported ones.

Assets 4

10 Jun 17:01

github-actions

v0.6.0

0c03fa3

v0.6.0

Added

Benchmark metrics & aggregation — new benchmark-metrics service computing the full schema-advertised metric set per item (tokens_per_second, output_input_token_ratio, exact_match, contains_required_terms, json_valid, schema_valid, regex_match, and tool-call metrics) plus run-level aggregations (mean/median/min/max/sum/count/p50/p90/p95/p99/stddev/variance), with boolean metrics surfaced as success_rate and partial-execution sample accounting.
Run page right-side metrics panel now shows tokens-per-second, duration p95 and item count for multi-item runs, and a correctness section (per-metric success rate) when the template requests correctness metrics.
Generation parameters (temperature, top_p, max_tokens, stream) editable inline in the Run page Step 4 options grid; previously hardcoded to defaults.
Decode-aware throughput metrics decode_tokens_per_second (output_tokens / (elapsed_ms − first_token_ms)) and prefill_tokens_per_second (input_tokens / first_token_ms), isolating generation speed from prompt prefill on streaming runs; both null on non-streaming runs. Metrics panel shows decode / overall / prefill tok/s separately.

Changed

Benchmark runner replaces the stub aggregator (count/elapsed_ms_mean/output_tokens_sum) with template-driven metric computation and aggregation; metric_version bumped from basic-v1 to metrics-v1.
Response normalizer now surfaces tool_calls so tool-call metrics can be computed.
Run page smoke template requests tokens_per_second, decode_tokens_per_second, prefill_tokens_per_second, and p95/count aggregations.
Run page metrics panel labels clarified: latency → duration (total request time, distinct from ttft).

Security

Upgraded shell-quote to ^1.8.4 via a root override to remediate a known advisory.

Assets 4

08 Jun 13:38

github-actions

v0.5.0

cb3e693

v0.5.0

Added

Benchmark test pipeline (phase 1) — new POST /benchmark route accepts structured benchmark plans and dispatches dataset-backed test runs against registered inference servers.
Seven JSON schemas for benchmark documents: model_profile, model_snapshot, runtime_profile, dataset_manifest, test_template, test_instantiation, test_run_result, and benchmark_plan, with schema-version-based kind inference.
benchmark-schemas service exposing validateBenchmarkDocument, benchmarkKindFromDocument, and benchmarkSchemaPath for typed document validation.
benchmark-datasets service for loading, validating, and caching dataset manifests, with support for embedded, compressed-blob, and manifest-only dataset formats.
benchmark-foundation service for creating, storing, and reloading structured benchmark results against the SQLite schema.
benchmark-runner service orchestrating full benchmark plan execution: instantiation, dataset injection, per-model inference dispatch, and result persistence.
INFERHARNESS_BENCHMARK_DATASET_ROOT environment variable for server-side benchmark dataset file resolution.
INFERHARNESS_INFERENCE_TLS_INSECURE environment variable (default false) to disable TLS certificate verification for outbound inference requests, equivalent to curl --insecure.
POST /inference-servers/probe endpoint tests connection and lists models without writing to DB, used by the server creation drawer before saving.
Per-server refresh icon button on server cards triggers refreshInferenceServerDiscovery for that server on demand.
Refresh-all icon button in the servers section header re-probes all active servers in parallel.
probeServer() now accepts parseModels: false for lightweight health checks that confirm reachability without parsing the model list.
Capabilities filter (thinking / coding / instruct / MoE) on the Catalog model rail, with URL-backed capabilities query parameter.
Parameter count upper-bound slider on the Catalog model rail, with URL-backed maxParams query parameter and inline label.
Parameter count label pill displayed on model cards.
GPU cores field added to the inference server create/edit drawer, collected through the extended server schema.

Changed

Server creation drawer now uses a test-first workflow: "Test connection" probes the endpoint and shows discovered models before any DB write; "Save to Catalog" then creates the record and runs discovery.
Health checks (GET /inference-servers/health) pass parseModels: false to avoid redundant model parsing during periodic polling.
Automatic TTL-based discovery refresh removed from Catalog — model lists are refreshed only on explicit user action (per-card icon, refresh-all, or server save).
CONNECTIVITY_POLL_INTERVAL_MS renamed to INFERHARNESS_HEALTH_POLL_INTERVAL and now accepts seconds instead of milliseconds (default: 30).
probeServer() extracted into a dedicated inference-server-probe.ts service, eliminating duplicated HTTP probe logic across refreshDiscovery and checkInferenceServerHealth.
"Last probe" timestamp removed from server cards and the server detail rail.
Capabilities and maxParams filters cleared on server deselect and rail clear.
Server create/edit drawer now uses dropdown fields and a two-column layout.
Mistral /v1/models discovery now keeps only canonical entries where id == name, dropping alias rows before DB persistence.
Run-groups endpoints and data model removed; benchmark pipeline replaces the former grouped-run concept.

Fixed

Deleting an inference server no longer throws a FOREIGN KEY constraint error; child records (metric samples, test results, runs, evaluations, models) are now deleted in dependency order within a transaction.
Contract and integration tests for benchmark schemas now reference committed fixture files instead of the gitignored specs/ directory, fixing all 26 CI failures.
Root-level vitest run no longer fails due to missing or misrouted test configuration.

Assets 4

11 May 14:59

github-actions

v0.4.1

6854e3d

v0.4.1

Added

Results dashboard now compares raw cold-start performance across servers and models with sample-backed summary rows and box plots for cold penalty, cold total, and hot total metrics.
Results run detail drawers now support guarded hard deletion of completed runs, removing result documents, metric samples, queue skips, and run-group item links while preserving linked evaluations.
Server discovery now upserts discovered models with persisted parser-derived metadata, including clean base names, quantized providers, parameter labels, active MoE labels, formats, quantization bits, and use-case tags.

Changed

Catalog and Models metadata filters/details now use persisted /models records as their source of truth instead of inferring provider, format, quantized provider, or use cases from raw model IDs.
Catalog Servers now keeps Filter, Archived, and + Add server in the section header, opens the filter rail only on demand, defaults to active servers, and starts server cards unselected with click-to-toggle detail rails.
Catalog model inspection now uses the routed /catalog/models/:id handoff layout while preserving the Catalog header, Servers/Models sub-tabs, and inference context bar.

Fixed

Catalog server archive actions now keep the selected server available in the archived view so the detail rail immediately exposes the matching Unarchive action.

Assets 4

10 May 16:10

github-actions

v0.4.0

2972389

v0.4.0

Added

Backend run groups now persist grouped Run executions, instantiate selected templates per target, launch child runs concurrently, expose /run-groups create/read/cancel endpoints, and isolate per-target failures.
Results now has a run-backed /results-view/query API and /results-view/runs/:runId detail API for the merged Dashboard/History experience, including filter metadata, scorecards, chart series, recent runs, dense history rows, and drawer data.
Evaluation detail is now available at GET /evaluations/:evaluationId so leaderboard rows can open a detail drawer for the representative evaluation.
Inference parameter presets are now persisted through /inference-param-presets CRUD endpoints and exposed in the shared frontend context bar.
Evaluate now has a queue API backed by completed test_results, with source-linked scoring and skip persistence while preserving the existing five 1-5 leaderboard score fields.

Changed

CI, release, and local Node version guidance now target Node.js 25 while declaring the supported runtime range as >=22.19 <26, matching Undici 8 requirements without claiming Node 26 support before native SQLite dependencies allow it.
better-sqlite3 is now pinned to the latest verified 12.9 release line for the current Node runtime window.
Frontend styling now loads the new design-system foundation tokens, vendored IBM Plex fonts, and shared component primitives for cards, buttons, inputs, health pills, metrics, and architecture-tree surfaces.
The frontend shell now uses React Router with a 220px always-expanded five-item sidebar, URL-backed Catalog/Results sub-tabs, legacy route redirects, and sidebar health/count status instead of the former global metric-card header.
Catalog now replaces the legacy Inference Servers and Models bodies with a merged Servers/Models funnel, URL-backed server/model filters, server health view, slide-over add/edit drawer, card grids, and a full-width model inspector layout.
Run now uses a unified 1-8 model workflow with query-backed model chips, shared template/options controls, single-target detail rendering, multi-target comparison columns, and summary aggregation.
Results now uses a single merged Dashboard/Leaderboard/History page with a shared 240px filter rail, URL-owned tab/filter/sort/pagination/detail state, export/share/reset actions, run detail drawers for Dashboard and History, and evaluation detail drawers for Leaderboard.
Package 06 polish adds shared reg-lights, a persistent inference context bar on Run/Templates/Results/Evaluate, a two-pane Templates layout, and a manual Evaluate scoring queue.
Run, Templates, Results, and Evaluate now share merged page headers with the inference context bar aligned directly below the page header.
Results now uses a full-width staged funnel with relationship-aware Servers -> Models -> Tests/range filtering, a full-width empty dashboard state, and downstream pruning when upstream selections change.
Results and Catalog Models funnels now share numbered stages, aligned Clear/Collapse controls, Catalog-style collapsible rail treatment, and persisted collapse state.
Results Tests/range and Catalog Models filter rails now use scoped Clear actions that preserve upstream selections while clearing only the filters owned by that rail.
Leaderboard remains backed by evaluations while accepting server, model, score range, sort, and group query parameters, including grouping by server and inference_config.quantization_level.
Inference server authentication can now use stored raw bearer/custom-header tokens for backend probes and runs while preserving the existing token_env fallback.

Fixed

Backend Vitest runs now ignore production SQLite database defaults, use a dedicated backend-test.sqlite by default, and fail fast if a backend test tries to open the production DB.
Backend proxy support now sends plain HTTP outbound requests to the configured proxy in absolute-form while retaining CONNECT tunneling for HTTPS targets, routes backend outbound fetches through the configured Undici dispatcher directly, and no longer lets process-level NO_PROXY bypass backend proxy routing unless AITESTBENCH_INFERENCE_NO_PROXY is set.
Inference server API responses now mask stored raw auth tokens and expose only token presence metadata.

Assets 4

05 May 18:50

github-actions

v0.3.2

af8c388

v0.3.2

Added

Backend inference-server calls can now be routed through an optional Undici proxy configured with AITESTBENCH_INFERENCE_PROXY and AITESTBENCH_INFERENCE_NO_PROXY, without exposing proxy settings to the frontend.

Changed

CI and release workflows now run on Node.js 22 to match current backend dependency requirements.

Fixed

Results dashboard performance graphs now link repeated runs from the same template/model into one series even when generated active test IDs differ.
Results dashboard merged metric graphs now keep different models as separate lines instead of collapsing same-test metrics together.
Results dashboard default date ranges now include the newest result even when its timestamp has seconds or milliseconds, preventing single-run dashboards from appearing empty.
Settings Empty database now clears all application SQLite tables, including evaluation prompts and evaluations that feed the leaderboard.
Leaderboard view now clears stale displayed rows immediately after the database is emptied from settings.
Architecture inspection errors now show visible, non-empty diagnostics in the model detail page instead of leaving only a red button state.
MLX architecture inspection now uses config-backed estimation directly, avoiding PyTorch-dependent AutoModel construction and allowing models such as /inferencerlabs/Qwen3-Coder-30B-A3B-Instruct-MLX-6.5bit to inspect successfully from config.json.
Architecture inspector subprocess failures now include captured output or an explicit timeout diagnostic when the Python process exits or is killed without a structured error.
Models page filters now infer provider, quantized provider, format, quantization bit-depth, and use-case metadata from discovered model IDs, and collapse provider-prefixed aliases so the model filter shows clean base model names only.

Assets 4

03 May 15:51

github-actions

v0.3.1

07f6f7b

v0.3.1

Changed

Model format handling now accepts GCUF as a compatibility alias for canonical GGUF.
Architecture inspection now supports local GGUF files, MLX models with local config.json directories, and local-server MLX IDs that point back to HF-style repos, including leading-slash IDs such as /lmstudio-community/...-MLX-6bit.
Architecture inspection now uses a layered pipeline: exact Transformers construction first, then format-aware config/header fallback with explicit provenance and accuracy metadata.
Config fallback now normalizes nested decoder configs, estimates dense decoder, multimodal projector, and MoE structures, respects tied embeddings, and returns a clear unsupported error when required dimensions are missing.
GPTQ, AWQ, SafeTensors, MLX, and GGUF inspection targets now route through the appropriate exact, config-backed, or header-only strategy without downloading weight tensors.
Architecture cache entries now include inspector metadata and invalidate stale zero-parameter root-only results.

Assets 4

Releases: Fango2007/InferHarness

v0.10.0

Uh oh!

v0.9.0

Added

Changed

Fixed

Uh oh!

v0.8.0

Added

Changed

Uh oh!

v0.7.0

Added

Fixed

Uh oh!

v0.6.0

Added

Changed

Security

Uh oh!

v0.5.0

Added

Changed

Fixed

Uh oh!

v0.4.1

Added

Changed

Fixed

Uh oh!

v0.4.0

Added

Changed

Fixed

Uh oh!

v0.3.2

Added

Changed

Fixed

Uh oh!

v0.3.1

Changed

Uh oh!