You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Long-context Python function retrieval benchmark — added seven built-in context-window templates linked to file-backed Python function-retrieval datasets, with front/middle/late function placement, two-function retrieval, and a negative control.
Long-context Python needle benchmark — added seven built-in context-window templates linked to file-backed Python positional-recall datasets, with front/middle/late needle placement, 4k-256k context sizes, two-fact retrieval, and a negative control.
Changed
Project license — switched the project license from MIT to Apache License 2.0 and updated the README badge.
Changelog category workflow — AGENTS.md now requires changelog updates to preserve Keep a Changelog category headings and place entries under the appropriate Added, Changed, Fixed, Removed, or Security section instead of flattening release notes.
Release workflow guidance — AGENTS.md now records the release-prep workflow for reading RELEASING.md, keeping workspace versions aligned, using annotated tags, and checking local recursive specs symlink artifacts before changing tracked files when release checks hit ELOOP.
Run fatal upstream errors — Run-created benchmark profiles now cancel on the first fatal upstream error, context-window retrieval stops on the first failed item, and HTTP diagnostics preserve upstream provider codes such as prefill_memory_exceeded.
Run template capability filtering — Run now disables benchmark templates that exceed a selected model's declared context window or require tool calling when the selected model/server is not tool-capable.
Run audit and functional checks split — Run now separates pipeline execution health from functional benchmark checks, and treats missing required terms as a visible functional failure when exact matching is disabled.
Run functional failure clue — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
Templates catalog shell — /templates now opens as a browse-first catalog with category buckets, grouped template cards, dedicated preview state, and the existing AI-first authoring flow embedded in the new shell instead of auto-selecting the first template row.