Conversation
- New settings panel in final summary showing model, endpoint, examples/rollouts/concurrency, sampling args, and env args - --abbreviated-summary (-A) flag skips example prompts/completions in the summary, showing only settings and stats for quick ablation comparison - Document flag in docs/evaluation.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Sampling args hidden when values are zero
- Changed condition from
any(config.sampling_args.values())toany(v is not None for v in config.sampling_args.values())to correctly display sampling args with zero values like temperature=0.
- Changed condition from
- ✅ Fixed: Evaluate skill not updated for new flag
- Added documentation for the --abbreviated-summary flag to skills/evaluate-environments/SKILL.md under Common Evaluation Patterns section.
Or push these changes by commenting:
@cursor push f5e4637f70
Preview (f5e4637f70)
diff --git a/skills/evaluate-environments/SKILL.md b/skills/evaluate-environments/SKILL.md
--- a/skills/evaluate-environments/SKILL.md
+++ b/skills/evaluate-environments/SKILL.md
@@ -89,6 +89,10 @@
```bash
prime eval run configs/eval/my-benchmark.toml+6. Show abbreviated summary (settings and stats only, skip example prompts/completions):
+bash +prime eval run my-env -A +
Push Results to Platform
- After proper eval runs complete, nudge users to push results for detailed platform viewing.
diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -1013,7 +1013,7 @@
display_max = self._display_max_concurrent(config, env_state.total)
text.append(fmt_concurrency(display_max), style="bold")
-
if config.sampling_args and any(config.sampling_args.values()):
-
if config.sampling_args and any(v is not None for v in config.sampling_args.values()): text.append("\n") text.append("sampling: ", style="dim") parts = [
</details>
<sub>This Bugbot Autofix run was free. To enable autofix for future PRs, go to the <a href="https://www.cursor.com/dashboard?tab=bugbot">Cursor dashboard</a>.</sub>
<!-- BUGBOT_AUTOFIX_REVIEW_FOOTNOTE_END -->
| default=False, | ||
| action="store_true", | ||
| help="Abbreviated summary: show settings and stats only, skip example prompts/completions", | ||
| ) |
There was a problem hiding this comment.
Evaluate skill not updated for new flag
Low Severity
The new --abbreviated-summary (-A) flag is a user-facing command contract change in verifiers/scripts/eval.py, but skills/evaluate-environments/SKILL.md has not been updated to mention it. Per project rules, changes to command contracts in verifiers/scripts/*.py require corresponding updates to the affected skills files.
Triggered by project rule: BugBot Instructions
Use 'v is not None' instead of truthiness check so that falsy but intentional values like temperature=0 are not silently hidden. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>



Description
Here is the summary of a single environment with the -A flag set:
The motivation for the --abbreviated-summary flag is that in multi-env evals it's often difficult to get a quick overview over the results, because the examples take up so much space that it's hard to move between the different environments. The settings panel is there because ablations with different settings of the same environment are common, and it was previously impossible to distinguish which summary of the same environment was achieved with which setting.
Type of Change
Testing
uv run pytestlocally.Checklist
Note
Low Risk
Display-only and CLI-flag plumbing changes; no changes to evaluation execution, scoring, or persistence logic beyond how results are summarized/rendered.
Overview
Adds an always-on settings panel to each environment’s final evaluation summary, surfacing model/endpoint, example+rollout counts, effective concurrency, sampling args, and env args.
Introduces
--abbreviated-summary(-A) to run the Rich/TUI evaluation display with a compact final summary that omits the example 0 prompt/completion panels (settings + stats only), and wires this flag throughprime eval→run_evaluations_tui→EvalDisplay.Also tightens sampling-args display logic to only show “custom sampling” when at least one sampling arg is non-
None, and updates docs/tests to include the new flag.Written by Cursor Bugbot for commit 8e372d6. This will update automatically on new commits. Configure here.