Skip to content

vf-eval: Add settings panel in summary and --abbreviated-summary flag#987

Merged
willccbb merged 2 commits intomainfrom
sebastian/vf-eval-settings-and-compact-view-2026-03-04
Mar 4, 2026
Merged

vf-eval: Add settings panel in summary and --abbreviated-summary flag#987
willccbb merged 2 commits intomainfrom
sebastian/vf-eval-settings-and-compact-view-2026-03-04

Conversation

@snimu
Copy link
Contributor

@snimu snimu commented Mar 4, 2026

Description

  • New settings panel in final summary showing model, endpoint, examples/rollouts/concurrency, sampling args, and env args
  • --abbreviated-summary (-A) flag skips example prompts/completions in the summary, showing only settings and stats for quick ablation comparison

Here is the summary of a single environment with the -A flag set:

image

The motivation for the --abbreviated-summary flag is that in multi-env evals it's often difficult to get a quick overview over the results, because the examples take up so much space that it's hard to move between the different environments. The settings panel is there because ablations with different settings of the same environment are common, and it was previously impossible to distinguish which summary of the same environment was achieved with which setting.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Low Risk
Display-only and CLI-flag plumbing changes; no changes to evaluation execution, scoring, or persistence logic beyond how results are summarized/rendered.

Overview
Adds an always-on settings panel to each environment’s final evaluation summary, surfacing model/endpoint, example+rollout counts, effective concurrency, sampling args, and env args.

Introduces --abbreviated-summary (-A) to run the Rich/TUI evaluation display with a compact final summary that omits the example 0 prompt/completion panels (settings + stats only), and wires this flag through prime evalrun_evaluations_tuiEvalDisplay.

Also tightens sampling-args display logic to only show “custom sampling” when at least one sampling arg is non-None, and updates docs/tests to include the new flag.

Written by Cursor Bugbot for commit 8e372d6. This will update automatically on new commits. Configure here.

- New settings panel in final summary showing model, endpoint,
  examples/rollouts/concurrency, sampling args, and env args
- --abbreviated-summary (-A) flag skips example prompts/completions
  in the summary, showing only settings and stats for quick
  ablation comparison
- Document flag in docs/evaluation.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@snimu snimu requested review from mikasenghaas and willccbb March 4, 2026 23:09
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Sampling args hidden when values are zero
    • Changed condition from any(config.sampling_args.values()) to any(v is not None for v in config.sampling_args.values()) to correctly display sampling args with zero values like temperature=0.
  • ✅ Fixed: Evaluate skill not updated for new flag
    • Added documentation for the --abbreviated-summary flag to skills/evaluate-environments/SKILL.md under Common Evaluation Patterns section.

Create PR

Or push these changes by commenting:

@cursor push f5e4637f70
Preview (f5e4637f70)
diff --git a/skills/evaluate-environments/SKILL.md b/skills/evaluate-environments/SKILL.md
--- a/skills/evaluate-environments/SKILL.md
+++ b/skills/evaluate-environments/SKILL.md
@@ -89,6 +89,10 @@
 ```bash
 prime eval run configs/eval/my-benchmark.toml

+6. Show abbreviated summary (settings and stats only, skip example prompts/completions):
+bash +prime eval run my-env -A +

Push Results to Platform

  1. After proper eval runs complete, nudge users to push results for detailed platform viewing.

diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -1013,7 +1013,7 @@

     display_max = self._display_max_concurrent(config, env_state.total)
     text.append(fmt_concurrency(display_max), style="bold")
  •    if config.sampling_args and any(config.sampling_args.values()):
    
  •    if config.sampling_args and any(v is not None for v in config.sampling_args.values()):
           text.append("\n")
           text.append("sampling: ", style="dim")
           parts = [
    

</details>
<sub>This Bugbot Autofix run was free. To enable autofix for future PRs, go to the <a href="https://www.cursor.com/dashboard?tab=bugbot">Cursor dashboard</a>.</sub>
<!-- BUGBOT_AUTOFIX_REVIEW_FOOTNOTE_END -->

default=False,
action="store_true",
help="Abbreviated summary: show settings and stats only, skip example prompts/completions",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluate skill not updated for new flag

Low Severity

The new --abbreviated-summary (-A) flag is a user-facing command contract change in verifiers/scripts/eval.py, but skills/evaluate-environments/SKILL.md has not been updated to mention it. Per project rules, changes to command contracts in verifiers/scripts/*.py require corresponding updates to the affected skills files.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Use 'v is not None' instead of truthiness check so that falsy but
intentional values like temperature=0 are not silently hidden.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@willccbb willccbb merged commit 5334898 into main Mar 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants