Skip to content

v0.11.0

Latest

Choose a tag to compare

@github-actions github-actions released this 25 Jun 13:15
· 2 commits to main since this release
7155c1f

Added

  • Long-context Python function retrieval benchmark — added seven built-in context-window templates linked to file-backed Python function-retrieval datasets, with front/middle/late function placement, two-function retrieval, and a negative control.
  • Long-context Python needle benchmark — added seven built-in context-window templates linked to file-backed Python positional-recall datasets, with front/middle/late needle placement, 4k-256k context sizes, two-fact retrieval, and a negative control.

Changed

  • Project license — switched the project license from MIT to Apache License 2.0 and updated the README badge.
  • Changelog category workflowAGENTS.md now requires changelog updates to preserve Keep a Changelog category headings and place entries under the appropriate Added, Changed, Fixed, Removed, or Security section instead of flattening release notes.
  • Release workflow guidanceAGENTS.md now records the release-prep workflow for reading RELEASING.md, keeping workspace versions aligned, using annotated tags, and checking local recursive specs symlink artifacts before changing tracked files when release checks hit ELOOP.
  • Run fatal upstream errors — Run-created benchmark profiles now cancel on the first fatal upstream error, context-window retrieval stops on the first failed item, and HTTP diagnostics preserve upstream provider codes such as prefill_memory_exceeded.
  • Run template capability filtering — Run now disables benchmark templates that exceed a selected model's declared context window or require tool calling when the selected model/server is not tool-capable.
  • Run audit and functional checks split — Run now separates pipeline execution health from functional benchmark checks, and treats missing required terms as a visible functional failure when exact matching is disabled.
  • Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
  • Templates catalog shell/templates now opens as a browse-first catalog with category buckets, grouped template cards, dedicated preview state, and the existing AI-first authoring flow embedded in the new shell instead of auto-selecting the first template row.