vf-eval: Add settings panel in summary and --abbreviated-summary flag by snimu · Pull Request #987 · PrimeIntellect-ai/verifiers

snimu · 2026-03-04T23:08:39Z

Description

New settings panel in final summary showing model, endpoint, examples/rollouts/concurrency, sampling args, and env args
--abbreviated-summary (-A) flag skips example prompts/completions in the summary, showing only settings and stats for quick ablation comparison

Here is the summary of a single environment with the -A flag set:

The motivation for the --abbreviated-summary flag is that in multi-env evals it's often difficult to get a quick overview over the results, because the examples take up so much space that it's hard to move between the different environments. The settings panel is there because ablations with different settings of the same environment are common, and it was previously impossible to distinguish which summary of the same environment was achieved with which setting.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Low Risk
Display-only and CLI-flag plumbing changes; no changes to evaluation execution, scoring, or persistence logic beyond how results are summarized/rendered.

Overview
Adds an always-on settings panel to each environment’s final evaluation summary, surfacing model/endpoint, example+rollout counts, effective concurrency, sampling args, and env args.

Introduces --abbreviated-summary (-A) to run the Rich/TUI evaluation display with a compact final summary that omits the example 0 prompt/completion panels (settings + stats only), and wires this flag through prime eval → run_evaluations_tui → EvalDisplay.

Also tightens sampling-args display logic to only show “custom sampling” when at least one sampling arg is non-None, and updates docs/tests to include the new flag.

^{Written by Cursor Bugbot for commit 8e372d6. This will update automatically on new commits. Configure here.}

- New settings panel in final summary showing model, endpoint, examples/rollouts/concurrency, sampling args, and env args - --abbreviated-summary (-A) flag skips example prompts/completions in the summary, showing only settings and stats for quick ablation comparison - Document flag in docs/evaluation.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Sampling args hidden when values are zero
- Changed condition from any(config.sampling_args.values()) to any(v is not None for v in config.sampling_args.values()) to correctly display sampling args with zero values like temperature=0.
✅ Fixed: Evaluate skill not updated for new flag
- Added documentation for the --abbreviated-summary flag to skills/evaluate-environments/SKILL.md under Common Evaluation Patterns section.

Or push these changes by commenting:

@cursor push f5e4637f70

Preview (f5e4637f70)

diff --git a/skills/evaluate-environments/SKILL.md b/skills/evaluate-environments/SKILL.md
--- a/skills/evaluate-environments/SKILL.md
+++ b/skills/evaluate-environments/SKILL.md
@@ -89,6 +89,10 @@
 ```bash
 prime eval run configs/eval/my-benchmark.toml

+6. Show abbreviated summary (settings and stats only, skip example prompts/completions):
+bash +prime eval run my-env -A +

Push Results to Platform

After proper eval runs complete, nudge users to push results for detailed platform viewing.

diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -1013,7 +1013,7 @@

     display_max = self._display_max_concurrent(config, env_state.total)
     text.append(fmt_concurrency(display_max), style="bold")

   if config.sampling_args and any(config.sampling_args.values()):

   if config.sampling_args and any(v is not None for v in config.sampling_args.values()):
       text.append("\n")
       text.append("sampling: ", style="dim")
       parts = [


</details>
<sub>This Bugbot Autofix run was free. To enable autofix for future PRs, go to the <a href="https://www.cursor.com/dashboard?tab=bugbot">Cursor dashboard</a>.</sub>
<!-- BUGBOT_AUTOFIX_REVIEW_FOOTNOTE_END -->

verifiers/utils/eval_display.py

cursor · 2026-03-04T23:10:53Z

verifiers/scripts/eval.py

+        default=False,
+        action="store_true",
+        help="Abbreviated summary: show settings and stats only, skip example prompts/completions",
+    )


Evaluate skill not updated for new flag

Low Severity

The new --abbreviated-summary (-A) flag is a user-facing command contract change in verifiers/scripts/eval.py, but skills/evaluate-environments/SKILL.md has not been updated to mention it. Per project rules, changes to command contracts in verifiers/scripts/*.py require corresponding updates to the affected skills files.

^{Triggered by project rule: BugBot Instructions}

Use 'v is not None' instead of truthiness check so that falsy but intentional values like temperature=0 are not silently hidden. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…#987)

snimu requested review from mikasenghaas and willccbb March 4, 2026 23:09

cursor bot reviewed Mar 4, 2026

View reviewed changes

Fix: show sampling args when values are zero (e.g. temperature=0)

8e372d6

Use 'v is not None' instead of truthiness check so that falsy but intentional values like temperature=0 are not silently hidden. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

willccbb merged commit 5334898 into main Mar 4, 2026
6 checks passed

willccbb pushed a commit that referenced this pull request Mar 5, 2026

vf-eval: Add settings panel in summary and --abbreviated-summary flag (…

0cc1b30

…#987)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vf-eval: Add settings panel in summary and --abbreviated-summary flag#987

vf-eval: Add settings panel in summary and --abbreviated-summary flag#987
willccbb merged 2 commits intomainfrom
sebastian/vf-eval-settings-and-compact-view-2026-03-04

snimu commented Mar 4, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor bot Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snimu commented Mar 4, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Push Results to Platform

Uh oh!

Uh oh!

cursor bot Mar 4, 2026

Choose a reason for hiding this comment

Evaluate skill not updated for new flag

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snimu commented Mar 4, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading