Releases: Fango2007/InferHarness
Releases · Fango2007/InferHarness
v0.10.0
- Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
- Duplicate tool-call argument scoring —
tool_arguments_validnow consumes matched tool calls so repeated calls to the same function with different arguments are scored consistently withtool_call_assertion_pass. - Legacy Runs API cleanup — removed the orphaned public
/runslist/delete routes, their route-specific service, and route-only tests now that Results deletion uses/results-view/runs/:runId, while retaining the underlying run/result tables for active benchmark, evaluation, retention, and cleanup flows. - Datasets editor checkpoint — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under
INFERHARNESS_BENCHMARK_DATASET_ROOT, with synceddataset_manifestdocuments, copy-down editing for repeated fields, and clamped long-prompt display. - Benchmark plan cleanup — removed the transitional inline
/benchmark/plans/runexecution API and staleINFERHARNESS_TEST_TEMPLATES_DIRexample so plan execution goes through persistedbenchmark_plandocuments. - Tool-call assertion metric — benchmark tool-call templates now include
tool_call_assertion_pass, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures. - Tool-call assertion UI — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.
- Onboarding prompt scope — the Run page completion handoff now appears only for the onboarding first-run step, canceling the onboarding-launched server drawer stops setup with an explicit normal-mode notice, and the three-step welcome layout is centered.
- Built-in template reload after DB clear — clearing the database from Settings now reloads the built-in benchmark library immediately, keeping shipped templates available without restarting the backend.
- Catalog empty server card — the empty Servers catalog now presents a dashed first-server card with the add action instead of a centered empty-state panel.
- Catalog model auto-selection — opening the Models catalog without a server filter now selects the first available inference server so discovered models render immediately.
- Run empty preview — the empty Run workspace now shows a dummy benchmark result with sample model, prompt, metrics, and audit rows instead of a generic empty panel.
v0.9.0
Added
- File-backed benchmark document library — benchmark templates, datasets, runtime profiles, and plans now load from built-in JSON documents plus a writable local library, with API saves persisted to files so documents can be reconstructed after SQLite loss.
- Native Anthropic and Gemini benchmark tool calls — benchmark execution now resolves Anthropic Messages and Gemini GenerateContent operations, maps dataset tools and
tool_choiceinto provider-native payloads, and normalizes returned tool calls and usage metrics. - Adaptive Results performance views — the Results dashboard now has an Auto performance view with manual modes for cold-start comparison, latency trend, pass-rate trend, latency histogram, and model-summary table comparisons backed by filtered model aggregates.
- Project workflow guardrails —
AGENTS.mdnow combines the main branch workflow, Node 25 rules, challenge-and-skill behavior instructions, and a static-data rule that keeps prompts, schemas, fixtures, and examples out of application code. - Benchmark template agent — Templates now includes a review-first benchmark-template agent that uses a database-persisted Settings model, challenges underspecified requests, loads its prompt from Markdown with the full
test_templateschema and example injected, validates generated drafts server-side, and applies drafts to the existing editor without auto-saving. - Run-page persisted benchmark plan checkpoint — Run can now select saved chat benchmark templates, prepare inline or server-side dataset manifests, persist unique runtime/dataset/plan artifacts per click, execute
/benchmark/plans/:id/run, and render per-target results including failed targets without result documents. - Run smoke chat benchmark template — the built-in benchmark document library now includes a real "Run smoke chat"
test_templatefor first-run prompt checks. - Templates LLM-first layout — Templates now uses an AI-first authoring split with live JSON, Advanced form, and Raw JSON tabs, plus a redesigned preview/list layout for JSON-only
test_templatedocuments.
Changed
- Human-readable agent workflow guidance —
AGENTS.mdnow groups workflow rules into clearer sections, documents parallel worktree expectations includingorigin/mainchecks before commit/push requests and resync timing before validation or merge, directs agents to create a focused branch without pausing for confirmation, and asks agents to explicitly request commit approval with a suggested message and details. - Templates agent composer — The Templates authoring panel now gives the freeform request field more space and removes the preset suggestion chip.
- Run selection rail polish — the Run page model chips, benchmark template selector, and response header now use clearer selected-state borders/backgrounds and denser mono text, with redundant server summary and model helper copy removed.
Fixed
- Results sidebar count — the Results navigation badge now reads the benchmark-native results total instead of the legacy runs endpoint.
- Catalog sidebar count — the Catalog navigation item now shows the available model count in the sidebar badge.
- Run smoke template selection — the Run page now selects the built-in "Run smoke chat" template by default and requires a real template document before starting a benchmark.
- Run multi-model response labels — multi-model Run detail headers now reuse the same letter and accent color assigned in the selected model chips.
- Run multi-model layout — multi-model benchmark details now use an auto-fitting grid that shows more cards per row on wide screens while keeping each card readable.
- Run metrics placement — per-model metrics now sit directly under the model header in compact fields, with raw benchmark JSON moved beneath the benchmark audit.
- Run metric emphasis — metric values in Run result cards now use bold mono text for faster scanning.
- Run benchmark audit presentation — audit metadata now renders as compact 11px status lines with check, pending, and failure markers.
- Run placeholder actions — disabled "Open in Evaluate" and "Copy as cURL" buttons were removed from the Run metrics panel.
- Templates authoring draft preservation — Switching between Live JSON, Advanced form, and Raw JSON now preserves the agent-inferred benchmark draft instead of reverting to the starter document.
- Benchmark-only Results history — Results dashboard, history, detail drawers, and deletion now read benchmark test run records instead of legacy run/result tables, so benchmark smoke runs appear after completion.
- Template agent starter drafting — The benchmark-template agent now drafts conservative starter templates for recognizable benchmark families such as tool-call compliance instead of blocking on follow-up questions when reasonable assumptions are available.
- Built-in template onboarding — first-run onboarding now tracks only server connection, model selection, and first successful run, auto-selects installed chat templates on Run, and no longer asks users to create a starter template.
- Benchmark foundation stress test timeout — the indexed lookup stress test now has an explicit timeout that matches its own 10-second performance budget, avoiding Vitest preemption on slower CI runners.
- Restored tracked
AGENTS.mdproject workflow rules while keeping the Node 25.x native-module guidance, restoredCLAUDE.mdtracking, and aligned Claude-specific project guidance with the enforced Node 25.x runtime. - Template agent settings rate limiting —
/system/settingsand/system/settings/template-agent-modelnow use an in-memory per-client rate limit before reading or updating app settings. - Template agent message contrast — Assistant replies and validated draft previews now render with readable text on their light message backgrounds.
- Production token bootstrap — Production build and start scripts now run the local API token bootstrap so Vite has
VITE_INFERHARNESS_API_TOKENbefore bundling or previewing the frontend.
v0.8.0
Added
- Real first-run onboarding — added a frontend-only guided setup path that uses existing server, model, benchmark document, and run/result APIs to help users create their first production-ready server, model selection, starter benchmark template, and successful run without demo data or new backend endpoints.
- Automatic local API token bootstrap — first local startup now creates or syncs
INFERHARNESS_API_TOKENandVITE_INFERHARNESS_API_TOKENin.envwhen missing, keeping the frontend and backend able to communicate on a fresh install. - README project badges — the root README now shows version, Node.js, Python, CI, and MIT license badges.
Changed
- Onboarding-aware shell — added the setup pill, welcome page, progress ribbons, handoff prompts, Settings tour controls, sidebar setup locking, and first-run completion prompt while keeping users on the Run page after a successful benchmark.
- Starter benchmark creation — the Run page can create a valid reusable
test_templatestarter benchmark through the existing benchmark document API.
v0.7.0
Added
- Settings side-shell model selection — Settings now uses a categorized side-shell with a dedicated local-only
Model Selectionpicker backed by active/modelsrecords, plus foldable environment sections scoped to Runtime, Providers & Auth, Connectivity, Frontend, and Advanced instead of a duplicated catch-all environment tab. - README settings alignment — the root README now calls out active development status and Settings-managed environment values.
- Paired benchmark stage runner checkpoint —
paired_request_looptemplates now validate and run with pair-member preservation, pair metric paths such aspair.cold.elapsed_ms, and simpledifferencederived metrics while keeping paired-stage authoring in the Templates Raw JSON drawer. - Complete benchmark-template stage authoring — the Templates editor now exposes paired-stage fields including pair delays, pair members, simple difference derived metrics, stage observability JSON, and custom metric IDs while retaining Raw JSON as an escape hatch.
- Templates benchmark-template authoring checkpoint — the existing Templates page now authors benchmark
test_templatedocument CRUD through/benchmark/documents, while keepingbenchmark_plancreation out of the UI for the later Run-page flow. - BenchmarkPlan ref-document checkpoint — benchmark-native documents can now be persisted through
/benchmark/documents, storedbenchmark_plandocuments can be created/read through/benchmark/plans, and/benchmark/plans/:id/runresolves template/dataset/runtime/model refs into the existing multi-model plan runner while keeping the inline/benchmark/plans/runroute transitional. - Model load time metric —
load_duration_msextracted from server-native response metadata (Ollama reports exact load time in nanoseconds on every/api/chatand/api/generateresponse). Exposed as a first-class metric incomputeItemMetricsand aggregated asmax(load only fires on the cold request). Run page metrics panel shows a "model load" row when the value is non-null and > 0; hidden for servers that don't report it (llama.cpp, vLLM, TGI). - Ollama protocol timing metrics —
total_duration(ns) feedsserver_total_time_ms(server-measured total including load+prefill+decode);prompt_eval_duration(ns) →server_prompt_eval_ms;eval_duration(ns) →server_eval_ms. Applies to all Ollama-compatible servers (Ollama, Inferencer, etc.). Run page shows "server prefill" and "server decode" rows when non-null; server-reported, no red. - oMLX native metrics —
usage.model_load_duration(seconds) now feedsload_duration_msalongside Ollama'sload_duration(nanoseconds);usage.total_time(seconds) surfaces as newserver_total_time_msmetric representing server-measured processing time (excludes network, comparable toelapsed_ms). Run page shows "server time" row when non-null. - Request-triggered load estimator —
estimateRequestTriggeredLoad()computes a heuristicload_estimatefrom orderedmetric_resultswhen ≥ 3 samples exist: compares first-request latency against the median of warm requests; detects a load event when the cold spike exceedsmax(50% of warm baseline, 3× warm stddev). Prefersfirst_token_msoverelapsed_mswhen streaming data is present. Stored asload_estimateon the result document. Run page shows "model load (est.)" in bold red when detected and no nativeload_duration_msis available — signals heuristic rather than server-reported value.
Fixed
- Stream dropdown in Run page Step 4 options grid now matches the height of number inputs (
font-size: 12pxand explicitheight: 35pxapplied uniformly via.run-options-gridselector). - Derived/estimated metrics in the Run page metrics panel (
tok / s (decode),tok / s (overall),prefill tok / s,model load (est.)) now render in bold red via.is-estimatedclass, consistently distinguishing computed values from directly measured or server-reported ones.
v0.6.0
Added
- Benchmark metrics & aggregation — new
benchmark-metricsservice computing the full schema-advertised metric set per item (tokens_per_second,output_input_token_ratio,exact_match,contains_required_terms,json_valid,schema_valid,regex_match, and tool-call metrics) plus run-level aggregations (mean/median/min/max/sum/count/p50/p90/p95/p99/stddev/variance), with boolean metrics surfaced assuccess_rateand partial-execution sample accounting. - Run page right-side metrics panel now shows tokens-per-second, duration
p95and item count for multi-item runs, and a correctness section (per-metric success rate) when the template requests correctness metrics. - Generation parameters (temperature, top_p, max_tokens, stream) editable inline in the Run page Step 4 options grid; previously hardcoded to defaults.
- Decode-aware throughput metrics
decode_tokens_per_second(output_tokens / (elapsed_ms − first_token_ms)) andprefill_tokens_per_second(input_tokens / first_token_ms), isolating generation speed from prompt prefill on streaming runs; both null on non-streaming runs. Metrics panel shows decode / overall / prefill tok/s separately.
Changed
- Benchmark runner replaces the stub aggregator (
count/elapsed_ms_mean/output_tokens_sum) with template-driven metric computation and aggregation;metric_versionbumped frombasic-v1tometrics-v1. - Response normalizer now surfaces
tool_callsso tool-call metrics can be computed. - Run page smoke template requests
tokens_per_second,decode_tokens_per_second,prefill_tokens_per_second, andp95/countaggregations. - Run page metrics panel labels clarified:
latency→duration(total request time, distinct fromttft).
Security
- Upgraded
shell-quoteto^1.8.4via a root override to remediate a known advisory.
v0.5.0
Added
- Benchmark test pipeline (phase 1) — new
POST /benchmarkroute accepts structured benchmark plans and dispatches dataset-backed test runs against registered inference servers. - Seven JSON schemas for benchmark documents:
model_profile,model_snapshot,runtime_profile,dataset_manifest,test_template,test_instantiation,test_run_result, andbenchmark_plan, with schema-version-based kind inference. benchmark-schemasservice exposingvalidateBenchmarkDocument,benchmarkKindFromDocument, andbenchmarkSchemaPathfor typed document validation.benchmark-datasetsservice for loading, validating, and caching dataset manifests, with support for embedded, compressed-blob, and manifest-only dataset formats.benchmark-foundationservice for creating, storing, and reloading structured benchmark results against the SQLite schema.benchmark-runnerservice orchestrating full benchmark plan execution: instantiation, dataset injection, per-model inference dispatch, and result persistence.INFERHARNESS_BENCHMARK_DATASET_ROOTenvironment variable for server-side benchmark dataset file resolution.INFERHARNESS_INFERENCE_TLS_INSECUREenvironment variable (defaultfalse) to disable TLS certificate verification for outbound inference requests, equivalent tocurl --insecure.POST /inference-servers/probeendpoint tests connection and lists models without writing to DB, used by the server creation drawer before saving.- Per-server refresh icon button on server cards triggers
refreshInferenceServerDiscoveryfor that server on demand. - Refresh-all icon button in the servers section header re-probes all active servers in parallel.
probeServer()now acceptsparseModels: falsefor lightweight health checks that confirm reachability without parsing the model list.- Capabilities filter (thinking / coding / instruct / MoE) on the Catalog model rail, with URL-backed
capabilitiesquery parameter. - Parameter count upper-bound slider on the Catalog model rail, with URL-backed
maxParamsquery parameter and inline label. - Parameter count label pill displayed on model cards.
- GPU cores field added to the inference server create/edit drawer, collected through the extended server schema.
Changed
- Server creation drawer now uses a test-first workflow: "Test connection" probes the endpoint and shows discovered models before any DB write; "Save to Catalog" then creates the record and runs discovery.
- Health checks (
GET /inference-servers/health) passparseModels: falseto avoid redundant model parsing during periodic polling. - Automatic TTL-based discovery refresh removed from Catalog — model lists are refreshed only on explicit user action (per-card icon, refresh-all, or server save).
CONNECTIVITY_POLL_INTERVAL_MSrenamed toINFERHARNESS_HEALTH_POLL_INTERVALand now accepts seconds instead of milliseconds (default: 30).probeServer()extracted into a dedicatedinference-server-probe.tsservice, eliminating duplicated HTTP probe logic acrossrefreshDiscoveryandcheckInferenceServerHealth.- "Last probe" timestamp removed from server cards and the server detail rail.
- Capabilities and
maxParamsfilters cleared on server deselect and rail clear. - Server create/edit drawer now uses dropdown fields and a two-column layout.
- Mistral
/v1/modelsdiscovery now keeps only canonical entries whereid == name, dropping alias rows before DB persistence. - Run-groups endpoints and data model removed; benchmark pipeline replaces the former grouped-run concept.
Fixed
- Deleting an inference server no longer throws a FOREIGN KEY constraint error; child records (metric samples, test results, runs, evaluations, models) are now deleted in dependency order within a transaction.
- Contract and integration tests for benchmark schemas now reference committed fixture files instead of the gitignored
specs/directory, fixing all 26 CI failures. - Root-level
vitestrun no longer fails due to missing or misrouted test configuration.
v0.4.1
Added
- Results dashboard now compares raw cold-start performance across servers and models with sample-backed summary rows and box plots for cold penalty, cold total, and hot total metrics.
- Results run detail drawers now support guarded hard deletion of completed runs, removing result documents, metric samples, queue skips, and run-group item links while preserving linked evaluations.
- Server discovery now upserts discovered models with persisted parser-derived metadata, including clean base names, quantized providers, parameter labels, active MoE labels, formats, quantization bits, and use-case tags.
Changed
- Catalog and Models metadata filters/details now use persisted
/modelsrecords as their source of truth instead of inferring provider, format, quantized provider, or use cases from raw model IDs. - Catalog Servers now keeps
Filter,Archived, and+ Add serverin the section header, opens the filter rail only on demand, defaults to active servers, and starts server cards unselected with click-to-toggle detail rails. - Catalog model inspection now uses the routed
/catalog/models/:idhandoff layout while preserving the Catalog header, Servers/Models sub-tabs, and inference context bar.
Fixed
- Catalog server archive actions now keep the selected server available in the archived view so the detail rail immediately exposes the matching
Unarchiveaction.
v0.4.0
Added
- Backend run groups now persist grouped Run executions, instantiate selected templates per target, launch child runs concurrently, expose
/run-groupscreate/read/cancel endpoints, and isolate per-target failures. - Results now has a run-backed
/results-view/queryAPI and/results-view/runs/:runIddetail API for the merged Dashboard/History experience, including filter metadata, scorecards, chart series, recent runs, dense history rows, and drawer data. - Evaluation detail is now available at
GET /evaluations/:evaluationIdso leaderboard rows can open a detail drawer for the representative evaluation. - Inference parameter presets are now persisted through
/inference-param-presetsCRUD endpoints and exposed in the shared frontend context bar. - Evaluate now has a queue API backed by completed
test_results, with source-linked scoring and skip persistence while preserving the existing five 1-5 leaderboard score fields.
Changed
- CI, release, and local Node version guidance now target Node.js 25 while declaring the supported runtime range as
>=22.19 <26, matching Undici 8 requirements without claiming Node 26 support before native SQLite dependencies allow it. better-sqlite3is now pinned to the latest verified 12.9 release line for the current Node runtime window.- Frontend styling now loads the new design-system foundation tokens, vendored IBM Plex fonts, and shared component primitives for cards, buttons, inputs, health pills, metrics, and architecture-tree surfaces.
- The frontend shell now uses React Router with a 220px always-expanded five-item sidebar, URL-backed Catalog/Results sub-tabs, legacy route redirects, and sidebar health/count status instead of the former global metric-card header.
- Catalog now replaces the legacy Inference Servers and Models bodies with a merged Servers/Models funnel, URL-backed server/model filters, server health view, slide-over add/edit drawer, card grids, and a full-width model inspector layout.
- Run now uses a unified 1-8 model workflow with query-backed model chips, shared template/options controls, single-target detail rendering, multi-target comparison columns, and summary aggregation.
- Results now uses a single merged Dashboard/Leaderboard/History page with a shared 240px filter rail, URL-owned tab/filter/sort/pagination/detail state, export/share/reset actions, run detail drawers for Dashboard and History, and evaluation detail drawers for Leaderboard.
- Package 06 polish adds shared reg-lights, a persistent inference context bar on Run/Templates/Results/Evaluate, a two-pane Templates layout, and a manual Evaluate scoring queue.
- Run, Templates, Results, and Evaluate now share merged page headers with the inference context bar aligned directly below the page header.
- Results now uses a full-width staged funnel with relationship-aware Servers -> Models -> Tests/range filtering, a full-width empty dashboard state, and downstream pruning when upstream selections change.
- Results and Catalog Models funnels now share numbered stages, aligned Clear/Collapse controls, Catalog-style collapsible rail treatment, and persisted collapse state.
- Results Tests/range and Catalog Models filter rails now use scoped Clear actions that preserve upstream selections while clearing only the filters owned by that rail.
- Leaderboard remains backed by
evaluationswhile accepting server, model, score range, sort, and group query parameters, including grouping by server andinference_config.quantization_level. - Inference server authentication can now use stored raw bearer/custom-header tokens for backend probes and runs while preserving the existing
token_envfallback.
Fixed
- Backend Vitest runs now ignore production SQLite database defaults, use a dedicated
backend-test.sqliteby default, and fail fast if a backend test tries to open the production DB. - Backend proxy support now sends plain HTTP outbound requests to the configured proxy in absolute-form while retaining CONNECT tunneling for HTTPS targets, routes backend outbound fetches through the configured Undici dispatcher directly, and no longer lets process-level
NO_PROXYbypass backend proxy routing unlessAITESTBENCH_INFERENCE_NO_PROXYis set. - Inference server API responses now mask stored raw auth tokens and expose only token presence metadata.
v0.3.2
Added
- Backend inference-server calls can now be routed through an optional Undici proxy configured with
AITESTBENCH_INFERENCE_PROXYandAITESTBENCH_INFERENCE_NO_PROXY, without exposing proxy settings to the frontend.
Changed
- CI and release workflows now run on Node.js 22 to match current backend dependency requirements.
Fixed
- Results dashboard performance graphs now link repeated runs from the same template/model into one series even when generated active test IDs differ.
- Results dashboard merged metric graphs now keep different models as separate lines instead of collapsing same-test metrics together.
- Results dashboard default date ranges now include the newest result even when its timestamp has seconds or milliseconds, preventing single-run dashboards from appearing empty.
- Settings Empty database now clears all application SQLite tables, including evaluation prompts and evaluations that feed the leaderboard.
- Leaderboard view now clears stale displayed rows immediately after the database is emptied from settings.
- Architecture inspection errors now show visible, non-empty diagnostics in the model detail page instead of leaving only a red button state.
- MLX architecture inspection now uses config-backed estimation directly, avoiding PyTorch-dependent
AutoModelconstruction and allowing models such as/inferencerlabs/Qwen3-Coder-30B-A3B-Instruct-MLX-6.5bitto inspect successfully fromconfig.json. - Architecture inspector subprocess failures now include captured output or an explicit timeout diagnostic when the Python process exits or is killed without a structured error.
- Models page filters now infer provider, quantized provider, format, quantization bit-depth, and use-case metadata from discovered model IDs, and collapse provider-prefixed aliases so the model filter shows clean base model names only.
v0.3.1
Changed
- Model format handling now accepts
GCUFas a compatibility alias for canonicalGGUF. - Architecture inspection now supports local GGUF files, MLX models with local
config.jsondirectories, and local-server MLX IDs that point back to HF-style repos, including leading-slash IDs such as/lmstudio-community/...-MLX-6bit. - Architecture inspection now uses a layered pipeline: exact Transformers construction first, then format-aware config/header fallback with explicit provenance and accuracy metadata.
- Config fallback now normalizes nested decoder configs, estimates dense decoder, multimodal projector, and MoE structures, respects tied embeddings, and returns a clear unsupported error when required dimensions are missing.
- GPTQ, AWQ, SafeTensors, MLX, and GGUF inspection targets now route through the appropriate exact, config-backed, or header-only strategy without downloading weight tensors.
- Architecture cache entries now include inspector metadata and invalidate stale zero-parameter root-only results.