Skip to content

v0.7.0

Choose a tag to compare

@github-actions github-actions released this 14 Jun 13:38
· 88 commits to main since this release

Added

  • Settings side-shell model selection — Settings now uses a categorized side-shell with a dedicated local-only Model Selection picker backed by active /models records, plus foldable environment sections scoped to Runtime, Providers & Auth, Connectivity, Frontend, and Advanced instead of a duplicated catch-all environment tab.
  • README settings alignment — the root README now calls out active development status and Settings-managed environment values.
  • Paired benchmark stage runner checkpointpaired_request_loop templates now validate and run with pair-member preservation, pair metric paths such as pair.cold.elapsed_ms, and simple difference derived metrics while keeping paired-stage authoring in the Templates Raw JSON drawer.
  • Complete benchmark-template stage authoring — the Templates editor now exposes paired-stage fields including pair delays, pair members, simple difference derived metrics, stage observability JSON, and custom metric IDs while retaining Raw JSON as an escape hatch.
  • Templates benchmark-template authoring checkpoint — the existing Templates page now authors benchmark test_template document CRUD through /benchmark/documents, while keeping benchmark_plan creation out of the UI for the later Run-page flow.
  • BenchmarkPlan ref-document checkpoint — benchmark-native documents can now be persisted through /benchmark/documents, stored benchmark_plan documents can be created/read through /benchmark/plans, and /benchmark/plans/:id/run resolves template/dataset/runtime/model refs into the existing multi-model plan runner while keeping the inline /benchmark/plans/run route transitional.
  • Model load time metricload_duration_ms extracted from server-native response metadata (Ollama reports exact load time in nanoseconds on every /api/chat and /api/generate response). Exposed as a first-class metric in computeItemMetrics and aggregated as max (load only fires on the cold request). Run page metrics panel shows a "model load" row when the value is non-null and > 0; hidden for servers that don't report it (llama.cpp, vLLM, TGI).
  • Ollama protocol timing metricstotal_duration (ns) feeds server_total_time_ms (server-measured total including load+prefill+decode); prompt_eval_duration (ns) → server_prompt_eval_ms; eval_duration (ns) → server_eval_ms. Applies to all Ollama-compatible servers (Ollama, Inferencer, etc.). Run page shows "server prefill" and "server decode" rows when non-null; server-reported, no red.
  • oMLX native metricsusage.model_load_duration (seconds) now feeds load_duration_ms alongside Ollama's load_duration (nanoseconds); usage.total_time (seconds) surfaces as new server_total_time_ms metric representing server-measured processing time (excludes network, comparable to elapsed_ms). Run page shows "server time" row when non-null.
  • Request-triggered load estimatorestimateRequestTriggeredLoad() computes a heuristic load_estimate from ordered metric_results when ≥ 3 samples exist: compares first-request latency against the median of warm requests; detects a load event when the cold spike exceeds max(50% of warm baseline, 3× warm stddev). Prefers first_token_ms over elapsed_ms when streaming data is present. Stored as load_estimate on the result document. Run page shows "model load (est.)" in bold red when detected and no native load_duration_ms is available — signals heuristic rather than server-reported value.

Fixed

  • Stream dropdown in Run page Step 4 options grid now matches the height of number inputs (font-size: 12px and explicit height: 35px applied uniformly via .run-options-grid selector).
  • Derived/estimated metrics in the Run page metrics panel (tok / s (decode), tok / s (overall), prefill tok / s, model load (est.)) now render in bold red via .is-estimated class, consistently distinguishing computed values from directly measured or server-reported ones.