You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Backend run groups now persist grouped Run executions, instantiate selected templates per target, launch child runs concurrently, expose /run-groups create/read/cancel endpoints, and isolate per-target failures.
Results now has a run-backed /results-view/query API and /results-view/runs/:runId detail API for the merged Dashboard/History experience, including filter metadata, scorecards, chart series, recent runs, dense history rows, and drawer data.
Evaluation detail is now available at GET /evaluations/:evaluationId so leaderboard rows can open a detail drawer for the representative evaluation.
Inference parameter presets are now persisted through /inference-param-presets CRUD endpoints and exposed in the shared frontend context bar.
Evaluate now has a queue API backed by completed test_results, with source-linked scoring and skip persistence while preserving the existing five 1-5 leaderboard score fields.
Changed
CI, release, and local Node version guidance now target Node.js 25 while declaring the supported runtime range as >=22.19 <26, matching Undici 8 requirements without claiming Node 26 support before native SQLite dependencies allow it.
better-sqlite3 is now pinned to the latest verified 12.9 release line for the current Node runtime window.
Frontend styling now loads the new design-system foundation tokens, vendored IBM Plex fonts, and shared component primitives for cards, buttons, inputs, health pills, metrics, and architecture-tree surfaces.
The frontend shell now uses React Router with a 220px always-expanded five-item sidebar, URL-backed Catalog/Results sub-tabs, legacy route redirects, and sidebar health/count status instead of the former global metric-card header.
Catalog now replaces the legacy Inference Servers and Models bodies with a merged Servers/Models funnel, URL-backed server/model filters, server health view, slide-over add/edit drawer, card grids, and a full-width model inspector layout.
Run now uses a unified 1-8 model workflow with query-backed model chips, shared template/options controls, single-target detail rendering, multi-target comparison columns, and summary aggregation.
Results now uses a single merged Dashboard/Leaderboard/History page with a shared 240px filter rail, URL-owned tab/filter/sort/pagination/detail state, export/share/reset actions, run detail drawers for Dashboard and History, and evaluation detail drawers for Leaderboard.
Package 06 polish adds shared reg-lights, a persistent inference context bar on Run/Templates/Results/Evaluate, a two-pane Templates layout, and a manual Evaluate scoring queue.
Run, Templates, Results, and Evaluate now share merged page headers with the inference context bar aligned directly below the page header.
Results now uses a full-width staged funnel with relationship-aware Servers -> Models -> Tests/range filtering, a full-width empty dashboard state, and downstream pruning when upstream selections change.
Results and Catalog Models funnels now share numbered stages, aligned Clear/Collapse controls, Catalog-style collapsible rail treatment, and persisted collapse state.
Results Tests/range and Catalog Models filter rails now use scoped Clear actions that preserve upstream selections while clearing only the filters owned by that rail.
Leaderboard remains backed by evaluations while accepting server, model, score range, sort, and group query parameters, including grouping by server and inference_config.quantization_level.
Inference server authentication can now use stored raw bearer/custom-header tokens for backend probes and runs while preserving the existing token_env fallback.
Fixed
Backend Vitest runs now ignore production SQLite database defaults, use a dedicated backend-test.sqlite by default, and fail fast if a backend test tries to open the production DB.
Backend proxy support now sends plain HTTP outbound requests to the configured proxy in absolute-form while retaining CONNECT tunneling for HTTPS targets, routes backend outbound fetches through the configured Undici dispatcher directly, and no longer lets process-level NO_PROXY bypass backend proxy routing unless AITESTBENCH_INFERENCE_NO_PROXY is set.
Inference server API responses now mask stored raw auth tokens and expose only token presence metadata.