You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Settings side-shell model selection — Settings now uses a categorized side-shell with a dedicated local-only Model Selection picker backed by active /models records, plus foldable environment sections scoped to Runtime, Providers & Auth, Connectivity, Frontend, and Advanced instead of a duplicated catch-all environment tab.
README settings alignment — the root README now calls out active development status and Settings-managed environment values.
Paired benchmark stage runner checkpoint — paired_request_loop templates now validate and run with pair-member preservation, pair metric paths such as pair.cold.elapsed_ms, and simple difference derived metrics while keeping paired-stage authoring in the Templates Raw JSON drawer.
Complete benchmark-template stage authoring — the Templates editor now exposes paired-stage fields including pair delays, pair members, simple difference derived metrics, stage observability JSON, and custom metric IDs while retaining Raw JSON as an escape hatch.
Templates benchmark-template authoring checkpoint — the existing Templates page now authors benchmark test_template document CRUD through /benchmark/documents, while keeping benchmark_plan creation out of the UI for the later Run-page flow.
BenchmarkPlan ref-document checkpoint — benchmark-native documents can now be persisted through /benchmark/documents, stored benchmark_plan documents can be created/read through /benchmark/plans, and /benchmark/plans/:id/run resolves template/dataset/runtime/model refs into the existing multi-model plan runner while keeping the inline /benchmark/plans/run route transitional.
Model load time metric — load_duration_ms extracted from server-native response metadata (Ollama reports exact load time in nanoseconds on every /api/chat and /api/generate response). Exposed as a first-class metric in computeItemMetrics and aggregated as max (load only fires on the cold request). Run page metrics panel shows a "model load" row when the value is non-null and > 0; hidden for servers that don't report it (llama.cpp, vLLM, TGI).
Ollama protocol timing metrics — total_duration (ns) feeds server_total_time_ms (server-measured total including load+prefill+decode); prompt_eval_duration (ns) → server_prompt_eval_ms; eval_duration (ns) → server_eval_ms. Applies to all Ollama-compatible servers (Ollama, Inferencer, etc.). Run page shows "server prefill" and "server decode" rows when non-null; server-reported, no red.
oMLX native metrics — usage.model_load_duration (seconds) now feeds load_duration_ms alongside Ollama's load_duration (nanoseconds); usage.total_time (seconds) surfaces as new server_total_time_ms metric representing server-measured processing time (excludes network, comparable to elapsed_ms). Run page shows "server time" row when non-null.
Request-triggered load estimator — estimateRequestTriggeredLoad() computes a heuristic load_estimate from ordered metric_results when ≥ 3 samples exist: compares first-request latency against the median of warm requests; detects a load event when the cold spike exceeds max(50% of warm baseline, 3× warm stddev). Prefers first_token_ms over elapsed_ms when streaming data is present. Stored as load_estimate on the result document. Run page shows "model load (est.)" in bold red when detected and no native load_duration_ms is available — signals heuristic rather than server-reported value.
Fixed
Stream dropdown in Run page Step 4 options grid now matches the height of number inputs (font-size: 12px and explicit height: 35px applied uniformly via .run-options-grid selector).
Derived/estimated metrics in the Run page metrics panel (tok / s (decode), tok / s (overall), prefill tok / s, model load (est.)) now render in bold red via .is-estimated class, consistently distinguishing computed values from directly measured or server-reported ones.