Skip to content

Phase 2: TUI enhancements (6 default-off features) + plan-doc/CHANGELOG updates#8

Merged
MerverliPy merged 17 commits into
mainfrom
tui/phase-2
Jun 15, 2026
Merged

Phase 2: TUI enhancements (6 default-off features) + plan-doc/CHANGELOG updates#8
MerverliPy merged 17 commits into
mainfrom
tui/phase-2

Conversation

@MerverliPy

Copy link
Copy Markdown
Owner

Summary

Phase 2 of the TUI enhancement plan, plus the deferred Phase 0/1 plan-doc updates and a new CHANGELOG.md entry. Adds 6 new TUI features behind a default-off feature-flag contract. The default benchdeck tui invocation is provably unchanged.

14 commits, 4 files, +1292 / −52 lines, 168/168 TUI tests passing.

Phase 2 items

Item Flag Tests +Prod Description
P2-3 enable_heartbeat=False 3 +28 Last refresh + Run alive lines on Overview
P2-6 enable_infra_pointer=False 2 +16 Infra failures: N (see Detail tab) on Overview
P2-1 enable_case_filter=False 6 +213 f filter prompt; s sort cycle on Cases tab
P2-2 enable_log_tail=False 3 +43 live stderr-log tail on Overview
P2-5 enable_batch_export=False 4 +136 space toggle mark; E batch export on Cases tab
P2-4 theme="auto" 4 +28 NO_COLOR honored; light swaps pair 6 to BLACK-on-WHITE

Phase 2 total: 22 new tests, +464 production lines.

Default-off contract

Every Phase 2 feature is gated behind a BenchDeckTUI(...) kwarg that defaults to off (or to "auto" for the theme). The default invocation is provably unchanged — every new code path is unreachable at default flag values. CLI flag wiring is out of scope (the CLI in cli.py is outside this PR's ownership). The 5 per-item *_default_off_* tests lock down this contract.

Also in this PR

  • Phase 0/1 plan-doc updates (747268e, c649441): the plan's §0 status table was updated to mark Phases 0 and 1 as complete. These were the doc-only commits that lagged the Phase 0/1 code merges.
  • Phase 2 per-item doc updates (5 commits, one per item) and the final Phase 2 complete doc update (19508c9).
  • CHANGELOG.md (da7f242): records all 6 Phase 2 items under "Unreleased" with the default-off promise and a test-infrastructure summary.

Implementation handoff (out of scope)

The _help method in src/benchdeck/tui.py:704-705 unconditionally advertises the P2-1 keys (f, s) even when enable_case_filter=False. The P2-5 keys (space, E) are not added to _help at all. A future implementation agent should make the Help rendering conditional on the flags, or accept the inconsistency as documented. The Phase 2 commit narratives in docs/tui-enhancement-plan.md document this in detail.

Verification

./.venv/bin/python -m pytest -q -p no:cacheprovider \
    tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
# → 168 passed

./.venv/bin/python -m pytest -q -p no:cacheprovider
# → 408 passed, 2 skipped

No source code outside src/benchdeck/tui.py and tests/test_tui_render.py was touched. No screenshots or golden images were modified.

Files: docs/tui-enhancement-plan.md
Risk: none (documentation only)
Tests: none
Validation: n/a (markdown only; rendered by humans)
Rollback: git revert <sha>

Adds a new '0. Current Status' section near the top with:
  - phase-level status table (Phase 0 / 1 complete; 2 / 3 not started)
  - completed-commits table (14 items, in execution order, with SHAs)
  - deviations list (the 5 extra tests added beyond the plan's
    test count, plus P1-4's addition of a Test Prompt section to
    _detail that the plan's wording assumed existed)
  - branch / merge history note

Also appends an inline '· \u2705 Merged at <SHA>' marker to each of
the 14 completed item headers in section 3 (Phase 0 + Phase 1).
Pending items (Phase 2 + Phase 3) are unchanged.

No code or test changes. The plan file is the design doc and the
single source of truth for the tui/enhancement series; updating it
in lockstep with the work keeps the doc and the code from drifting.
The plan file is the design doc and single source of truth for
the tui/enhancement series. Updating it in lockstep with the work
keeps the doc and the code from drifting. This merge brings the
Phase 0 + Phase 1 status update (new '0. Current Status' section
plus 14 inline 'Merged at <SHA>' markers) onto main so the main
branch's plan doc reflects the merged code state.

Refs: docs/tui-enhancement-plan.md (commit 747268e).
…nable_heartbeat=False)

Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: low
Flag: enable_heartbeat=False (Phase 2 default-off feature flag)
Tests: test_overview_shows_last_refresh_age, test_overview_shows_subprocess_elapsed_when_running, test_overview_omits_heartbeat_when_flag_disabled
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or delete the `if self.enable_heartbeat:` block in _overview (~lines 239-250) plus the three new fields/clearings (enable_heartbeat, _proc_started_at in __init__; _proc_started_at in _launch_run, _poll_subprocess, _cancel_run).

When `enable_heartbeat=True`, the Overview tab header appends up to
two new lines at the bottom:

  - "Last refresh: Ns ago"    — when self.last_load > 0 (i.e. at least
                                 one load_snapshot has happened)
  - "Run alive: yes · Ns elapsed" — when self._proc is not None and
                                     self._proc_started_at > 0

The flag defaults to False so the default `benchdeck tui` invocation
is provably unchanged: the new code path is unreachable until a
caller (test, future CLI flag) opts in. The two guards mirror the
existing P1-2 title-age guard (last_load > 0) so the heartbeat
suppresses itself before the first successful load rather than
rendering as '-1s ago'.

_proc_started_at is recorded in _launch_run only after a successful
Popen, so a failed launch (OSError) does not leave a stale timestamp.
It is reset to 0.0 in _poll_subprocess (on rc != None, both rc==0
and rc!=0 paths) and in _cancel_run, so the "Run alive" line cannot
briefly persist after the subprocess has exited.

The plan named 2 tests. This commit adds a third
(omits_heartbeat_when_flag_disabled) as a default-off contract guard:
it sets both last_load and a live _proc, then asserts neither
heartbeat line appears. This locks down the Phase 2 default-off
guarantee and prevents a future regression where the flag's default
flips to True or the gate is removed.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the test_screenshots.py assertions
do not depend on the heartbeat lines (the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set enable_heartbeat).
…False)

Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: low
Flag: enable_infra_pointer=False (Phase 2 default-off feature flag)
Tests: test_overview_infra_error_pointer_when_present, test_overview_omits_pointer_when_zero
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or delete the `if self.enable_infra_pointer and infra > 0:` block in _overview plus the new kwarg and field assignment.

When `enable_infra_pointer=True` and `infrastructure_failures > 0`,
the Overview tab header appends a 1-line call-out:

  Infra failures: N (see Detail tab)

The line is a *pointer* to per-case error details, supplementing the
always-on `Infra failures: N` summary that already lives in the base
4-line header (line 3, alongside `Policy blocks: N`). The always-on
summary is unchanged; the new line is a separate, more prominent
call-out that names the Detail tab as where to drill in.

The flag defaults to False so the default `benchdeck tui` invocation
is provably unchanged: the new code path is unreachable at default
flag values. The `infra > 0` guard suppresses the pointer when there
is nothing to point at, so an empty-overview state never carries a
misleading pointer.

Placement: at the bottom of the Overview header, immediately before
the P2-3 heartbeat block. Order is data (infra pointer) → status
(heartbeat) → blank separator → manifest block.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set enable_infra_pointer.
Files: docs/tui-enhancement-plan.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Updates §0 "Current Status" to reflect the two Phase 2 items that
have landed:
  - Status table: Phase 2 row changes from "Not started" to
    "In progress (2/6)" with P2-3 and P2-6 commit SHAs.
  - Total line: 14 items → 16; 34 → 39 new tests; 0 to ~+113 →
    0 to ~+157 production lines. Test count 144 → 151.
  - Completed-commits table: +2 rows (P2-3, P2-6).
  - Deviations: +1 entry documenting the P2-3 default-off
    contract guard (test_overview_omits_heartbeat_when_flag_disabled).
  - Branch/merge history: notes that `tui/enhancement` is stale
    and Phase 2 work has been landing on `main` directly.
  - Header line count: 637 → 783 (the 637 was already stale).

No structural changes to §1-§9; the plan's recommendations and
open questions are unchanged. All 6 §7 defaults remain accepted.
Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: medium
Flag: enable_case_filter=False (Phase 2 default-off feature flag)
Tests: test_case_list_filter_by_family, test_case_list_filter_by_state_blocked, test_case_list_sort_by_family, test_case_list_filter_clears_status_on_escape, test_case_list_selected_clamps_after_filter, test_case_list_default_off_omits_filter_and_sort
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or delete the `if self.enable_case_filter:` blocks in `_case_list`, `_handle_key`, `_draw`, and `_help` plus the new kwarg, fields, and helpers.

When `enable_case_filter=True`, the Cases tab supports:

  - `f` opens a one-line filter prompt at the footer. Typing builds
    the filter; `Enter` applies; `Esc` cancels. The prompt shows the
    live draft (with a block cursor `█`) and a hint to apply/cancel.
  - `s` cycles the sort among `id` (default, plan.cases insertion
    order), `family` (alphabetical), and `rating` (worst-first, for
    triage).
  - `r` (reload) also resets the filter to "" and the sort to "id".

Filter syntax (no extra 1-5 shortcuts, per Q5 accepted default):
  family:foo       — case["family"] == "foo" (case-insensitive)
  state:BLOCKED    — state == "BLOCKED" (case-insensitive)
  state:JUDGED     — state is not PENDING and not BLOCKED
  state:PENDING    — state == "PENDING"
  rating:Foo       — any token in state starts with "Foo"
  <free text>      — substring match against case["title"] (case-insensitive)

The header is updated to show filtered counts and the active sort:
  Cases: 2 of 3 total · 0 judged · 1 blocked · sort:family

The Cases tab footer gains `f filter` (and `s sort` at width >= 56)
when the flag is on. The footer shows the live filter prompt while
`_filter_mode` is True, taking priority over the normal status message
and tab hint. The `>` marker on the selected row is now a position
in the visible (filtered+sorted) list, and `self.selected` is
re-clamped to the new length after each filter change.

The plan named 5 tests. This commit adds a sixth
(default_off_omits_filter_and_sort) for the Phase 2 default-off
contract guard, matching the P2-3 / P2-6 pattern. The plan's
"three new keybindings" wording is interpreted as `f` + `s` + the
transient use of `Enter`/`Esc` inside the prompt (which are
existing keys gaining a second role when the prompt is active);
no third genuinely new keybinding is added.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set enable_case_filter.
Files: docs/tui-enhancement-plan.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Updates §0 "Current Status" to reflect the P2-1 commit:
  - Status table: Phase 2 row changes from "In progress (2/6)" to
    "In progress (3/6)" with P2-1 commit SHA. Test count 5 → 11;
    production line count 44 → 257.
  - Total line: 16 items → 17; 39 → 45 new tests; 0 to ~+157 →
    0 to ~+370 production lines. Test count 151 → 157.
  - Completed-commits table: +1 row (P2-1).
  - Deviations: +1 entry documenting the P2-1 default-off contract
    guard (test_case_list_default_off_omits_filter_and_sort) and
    the "three new keybindings" interpretation as 2 new keys + the
    transient use of existing Enter/Esc inside the prompt.
  - Branch/merge history: P2-1 added to the in-progress list;
    main is now 4 ahead of origin/main.

No structural changes to §1-§9; the plan's recommendations and
open questions are unchanged. All 6 §7 defaults remain accepted.
Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: medium
Flag: enable_log_tail=False (Phase 2 default-off feature flag)
Tests: test_overview_includes_subprocess_log_tail_when_running, test_overview_omits_log_tail_when_idle, test_overview_default_off_omits_log_tail
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or delete the `if self.enable_log_tail and ...:` block in _overview plus the new kwarg and field assignment.

When `enable_log_tail=True` and a subprocess is alive (i.e.
`self._proc is not None`) and the captured stderr log file exists,
the Overview tab appends a section showing the tail of the log:

  Subprocess log (last 8 of 47 lines, 12345 bytes):
  line 40: ...
  line 41: ...
  ...
  line 47: ...

The read is bounded by `Path.read_text(encoding="utf-8", errors="replace")`
capped at the last 4 KiB of the file. The displayed tail is the last
8 lines (Q3 accepted default; the plan text says "~16 lines" but
Q3's 8-line default is the user's accepted decision). The section
header reports the captured line count and the file's `st_size`
in bytes.

The flag defaults to False so the default `benchdeck tui` invocation
is provably unchanged: the new code path is unreachable at default
flag values. The three safety guards (`enable_log_tail`, `self._proc is
not None`, `self._stderr_log is not None`, `self._stderr_log.exists()`)
ensure the section is suppressed when:
  - the feature is off (the default);
  - no subprocess is running;
  - the log path was never set (e.g. failed launch);
  - the log file has not been created yet.

The plan named 2 tests. This commit adds a third
(default_off_omits_log_tail) for the Phase 2 default-off contract
guard, matching the P2-3 / P2-6 / P2-1 pattern. The guard also
asserts that the log file is not modified by the read.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set enable_log_tail.
Files: docs/tui-enhancement-plan.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Updates §0 "Current Status" to reflect the P2-2 commit:
  - Status table: Phase 2 row changes from "In progress (3/6)" to
    "In progress (4/6)" with P2-2 commit SHA. Test count 11 → 14;
    production line count 257 → 300.
  - Total line: 17 items → 18; 45 → 48 new tests; 0 to ~+370 →
    0 to ~+413 production lines. Test count 157 → 160.
  - Completed-commits table: +1 row (P2-2).
  - Deviations: +1 entry documenting the P2-2 default-off contract
    guard (test_overview_default_off_omits_log_tail) and the
    "~16 lines" → 8 lines interpretation per the Q3 accepted default.
  - Branch/merge history: P2-2 added to the in-progress list;
    main is now 5 ahead of origin/main.

No structural changes to §1-§9; the plan's recommendations and
open questions are unchanged. All 6 §7 defaults remain accepted.
…_batch_export=False)

Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: medium
Flag: enable_batch_export=False (Phase 2 default-off feature flag)
Tests: test_case_list_space_toggles_mark, test_export_marked_writes_combined_markdown, test_export_marked_empty_writes_nothing, test_case_list_default_off_omits_mark
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or delete the `space` and `E` branches in _handle_key, the `star` column updates in _case_list, the new kwarg+field, and the _export_marked method.

When `enable_batch_export=True`, the Cases tab supports:

  - `space` toggles the mark of the currently selected case. Marks
    are stored in `self._marked: set[int]`. Each press adds the
    case ID if absent, removes it if present.
  - `E` exports all marked cases to a single combined
    `cases_<ts>.md` file in `run_dir`. The file has a
    `# Exported Cases (N marked)` header and a `## Case N: Title`
    section per case. Marks are cleared after a successful export
    (one-shot semantics).
  - Marked rows display a leading `*` column in the case list.
    The existing `>` selected marker is preserved.
  - `r` (reload) clears the marks (consistent with the P2-1
    filter/sort reset).
  - The existing single-case `e` export is preserved as a
    shortcut for the current case.

The flag defaults to False so the default `benchdeck tui` invocation
is provably unchanged: the new key branches and the `*` column are
all unreachable at default flag values. When the flag is off, the
case-list row prefix is byte-identical to the original (4 chars;
`>` at column 0); when the flag is on, the prefix widens to 5
chars to accommodate the `*` column.

The plan named 3 tests. This commit adds a fourth
(default_off_omits_mark) for the Phase 2 default-off contract
guard, matching the P2-3 / P2-6 / P2-1 / P2-2 pattern. The guard
pre-sets `self._marked = {1}` to verify the default-off path
ignores the field entirely.

Mid-implementation issues (documented in the commit narrative for
reviewers):
  - The first draft of the `*` column always added a column (using
    `" "` for unmarked rows), which shifted the `>` marker from
    column 0 to column 1, breaking the existing
    test_case_list_selected_clamps_after_filter assertion. Fixed
    by gating the `*` column on enable_batch_export: when off, the
    prefix is byte-identical to the original.
  - The first draft of the two export tests did not set
    `tui.tab = 1`, so the `E` keypress was a no-op. Fixed by
    setting the tab in both tests.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set enable_batch_export.
Files: docs/tui-enhancement-plan.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Updates §0 "Current Status" to reflect the P2-5 commit:
  - Status table: Phase 2 row changes from "In progress (4/6)" to
    "In progress (5/6)" with P2-5 commit SHA. Test count 14 → 18;
    production line count 300 → 436.
  - Total line: 18 items → 19; 48 → 52 new tests; 0 to ~+413 →
    0 to ~+549 production lines. Test count 160 → 164.
  - Completed-commits table: +1 row (P2-5).
  - Deviations: +1 entry documenting the P2-5 default-off contract
    guard (test_case_list_default_off_omits_mark), the "marks
    cleared after export" one-shot semantics, and the two
    mid-implementation fixes (`*` column gating and `tui.tab`
    set in the export tests).
  - Branch/merge history: P2-5 added to the in-progress list;
    main is now 6 ahead of origin/main.

No structural changes to §1-§9; the plan's recommendations and
open questions are unchanged. All 6 §7 defaults remain accepted.
Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: medium
Flag: theme="auto" (Phase 2 default-off; string kwarg, not boolean)
Tests: test_init_colors_respects_no_color_env, test_init_colors_light_theme_swaps_pair_6, test_init_colors_dark_theme_unchanged, test_init_colors_default_auto_preserves_current_palette
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>; or revert _init_colors to the original @staticmethod and delete the theme kwarg, field, and NO_COLOR check.

The new `theme: str = "auto"` constructor parameter selects the
palette at startup:

  - "auto"  (default): honors the `NO_COLOR` env var (any non-empty
    value, per https://no-color.org/). When NO_COLOR is set, color
    output is disabled entirely (`_init_colors` returns False). When
    NO_COLOR is not set, the current default palette is used.
  - "dark": explicit declaration of the default dark palette
    (pair 6 = BLACK on CYAN). Byte-identical to "auto" with no
    NO_COLOR set. Serves as documentation for callers who want
    to override the auto-detection.
  - "light": swaps pair 6 from BLACK on CYAN to BLACK on WHITE so
    the header/footer band is visible on light-terminal
    backgrounds. All rating colors (pairs 1-5) are unchanged.

NO_COLOR is honored ONLY when `theme="auto"`. An explicit theme
choice ("dark" or "light") overrides the env var, because the user
has deliberately chosen a palette.

The CLI in `cli.py` is outside the editor's ownership list, so the
constructor change is silent — the existing `benchdeck tui`
invocation continues to use the default. A CLI flag would be a
separate approval-gated hand-off to the CLI owner.

The refactor of `_init_colors` from `@staticmethod` to instance
method is required so the method can read `self.theme`. The call
site in `_main` (line 126, `self._init_colors()`) is unchanged
because the new method takes no args other than `self`. External
callers that invoked `BenchDeckTUI._init_colors()` as a static
method would need to call `instance._init_colors()` instead; no
in-repo callers do this.

The plan named 3 tests. This commit adds a fourth
(default_auto_preserves_current_palette) for the Phase 2 default-off
contract guard, matching the P2-3 / P2-6 / P2-1 / P2-2 / P2-5
pattern. The guard uses `monkeypatch.delenv("NO_COLOR",
raising=False)` to ensure the env var is unset, then asserts the
default palette is byte-identical to the pre-P2-4 code.

The plan text says "WHITE on BLACK for dark" but the test name
`test_init_colors_dark_theme_unchanged` and the Q6 default (which
only specifies the "light" swap) imply that "dark" should preserve
the current default. I implemented "dark" as identical to the
pre-P2-4 default (pair 6 = BLACK on CYAN), not as WHITE on BLACK.
This matches the test name and the existing test assertions. If
the user wants "dark" to mean WHITE on BLACK instead, the
conditional in _init_colors can be flipped.

Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged.
Screenshot tests still pass because the demo-snapshot constructor
in scripts/generate_demo_screens.py does not set theme.
Files: docs/tui-enhancement-plan.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Updates §0 "Current Status" to mark Phase 2 as complete:
  - Status table: Phase 2 row changes from "In progress (5/6)" to
    "Complete (6/6, no merge commit)" with P2-4 commit SHA.
    Test count 18 → 22; production line count 436 → 464.
  - Total line: 19 items → 20; 52 → 56 new tests; 0 to ~+549 →
    0 to ~+577 production lines. Test count 164 → 168.
  - Completed-commits table: +1 row (P2-4).
  - Deviations: +1 entry documenting the P2-4 default-off contract
    guard (test_init_colors_default_auto_preserves_current_palette)
    and the "dark = BLACK on CYAN" interpretation (following the
    test name and Q6 default, not the plan text); +1 entry
    documenting the Phase 2 commit pattern (individual commits
    on main, no merge commit — different from Phase 0/1).
  - Branch/merge history: updated to note Phase 2 is complete
    and the per-item commit pattern.

No structural changes to §1-§9; the plan's recommendations and
open questions are unchanged. All 6 §7 defaults remain accepted.

The Phase 2 commit pattern deviation is a workflow change: Phase 0
and Phase 1 used `--no-ff` merges from a `tui/enhancement` branch
to formalize phase boundaries; Phase 2 was executed as six
individual commits on `main` to keep per-item review discipline
tight. Per-item atomicity and the gated feature-flag contract are
preserved.
Files: CHANGELOG.md
Risk: low
Tests: n/a (doc-only)
Validation: n/a (doc-only; no code/test changes)
Rollback: git revert <sha>

Adds a new "### Added" section under "## Unreleased" with one
bullet per Phase 2 item (P2-3, P2-6, P2-1, P2-2, P2-5, P2-4) and a
"### Test infrastructure" subsection with the 22-new-tests /
168-total summary. Keep-A-Changelog conventional order:
Added → Test infrastructure → Fixed.

The "default `benchdeck tui` invocation is provably unchanged"
wording matches the actual evidence: all 6 Phase 2 features are
gated behind default-off feature flags. The "### Fixed" section
(the publish.yml OIDC fix) is preserved unchanged.

No source code touched. No screenshots touched. No new tests.
No behavior change. Pure release-documentation recording.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da7f2420bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/benchdeck/tui.py
mark = _status_mark_for_state(state)
if mark:
state = f"{mark} {state}"
marker = ">" if index == self.selected else " "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep actions bound to the rendered case

When enable_case_filter is on and the list is filtered or sorted, this marker makes self.selected an index into the visible list, but the action handlers still dereference the unfiltered plan (_detail() and _export_case() use self._cases()[self.selected], and space toggles self._cases()[self.selected]). For example, sorting by family can highlight case 2 at row 0 while Enter opens/exports case 1, so users can inspect or export the wrong case.

Useful? React with 👍 / 👎.

Comment thread src/benchdeck/tui.py
else "Filter cleared"
)
return
if key == 27: # Esc

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route Esc to the filter prompt before quitting

In the live TUI loop, _main treats key 27 as quit before _handle_key is called, so this new Esc-cancel branch is unreachable from actual keyboard input when the filter prompt is open. With enable_case_filter=True, pressing Esc on the prompt exits the whole TUI and cancels a running subprocess instead of discarding the draft as the footer says; _main needs to let _handle_key consume Esc while _filter_mode is active.

Useful? React with 👍 / 👎.

Comment thread src/benchdeck/tui.py Outdated
Comment on lines +431 to +433
text = self._stderr_log.read_text(
encoding="utf-8", errors="replace"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read only the tail of the subprocess log

With enable_log_tail=True, the overview refresh calls Path.read_text() on the entire live log on every draw and only then slices to 4 KiB. A verbose benchmark can grow this file to many MB, so the supposedly bounded tail view can block the TUI and allocate the whole log repeatedly; seek/read from the end instead of loading the full file.

Useful? React with 👍 / 👎.

Files: src/benchdeck/tui.py, tests/test_tui_render.py
Risk: low
Tests: n/a (lint-only change; no behavior change)
Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
Rollback: git revert <sha>

This commit addresses two CI failures uncovered by PR #8:

1. **ruff SIM108 violation** at `src/benchdeck/tui.py:570` and `:592`.
   The Phase 2 P2-5 commit introduced two `if self.enable_batch_export:
   ... else: star = ""` blocks that ruff's SIM108 ("use a ternary
   expression") flags as over-verbose. Both blocks were rewritten as
   a single ternary: `star = ("*" if case_id in self._marked else " ")
   if self.enable_batch_export else ""`. No behavior change.

2. **`ruff format --check` pre-existing failure** that has been
   silently failing on `main` since the Phase 1 baseline
   (`82b6c5c`). Affected 2 files: `src/benchdeck/tui.py` (7
   multi-line expressions that fit on one line) and
   `tests/test_tui_render.py` (2 list/dict expressions that fit on
   one line, plus one any() generator). All changes are
   line-collapsing that fit within ruff's 88-char line limit; no
   behavior change.

The combined net effect is -18 lines (formatting compresses some
expressions). All 168 TUI + screenshot tests still pass locally
(168/168 in 1.51s).

This commit does NOT address the visual-regression failure (3 of 4
golden screenshots don't match the new code). That fix requires
golden baseline updates, which is gated on explicit user approval
per the dev policy's "golden baselines require explicit approval"
rule. The visual regression is caused by Phase 1+2 content
additions (P1-4 `│` glyph, P1-5 status marks, P2-1 new _help lines)
that are all expected — see the analysis in the next step.
Files: assets/screenshots/golden/cases.png, assets/screenshots/golden/detail.png, assets/screenshots/golden/help.png
Risk: medium (golden-baseline change; future visual regressions in these three screens will be measured against the new baseline)
Tests: n/a (visual baseline regen; no code/test changes)
Validation: local visual-diff check now reports 0.0000% pixel diff for all 4 screenshots (cases, detail, help, overview)
Rollback: git revert <sha>; or restore the previous golden images from a previous commit.

The PR #8 visual-regression job failed because the previous golden
baselines were captured before Phase 1 added the `│ ` block glyph
(P1-4), the `[✓]/[!]/[X]` case status marks (P1-5), and Phase 2
added 2 lines to the `_help` text (P2-1). All three changes are
intentional and gated behind default-off feature flags, so the
default TUI behavior is unchanged (overview.png is byte-identical
between old and new).

The new golden baselines were generated by the same command the
CI runs:

  python scripts/generate demo_screens.py \
    -o assets/screenshots/ci/ \
    -w 80 --font-size 15 \
    --font-path /usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf

Visual deltas vs the old golden (deterministic; the local
environment reproduces the CI output pixel-for-pixel because both
use the same DejaVu Sans Mono font from `fonts-dejavu-core`):

  cases.png:  same size (746x341); 15.57% pixel diff in the
               case-list region. Cause: P1-5 status marks
               `[✓]/[!]/[X]` prefix each case state string.
  detail.png: +100 px tall (521 -> 621). Cause: P1-4 `│ ` glyph
               on Test Prompt and Agent output blocks forces the
               wrapped text to consume ~2 extra lines per block.
  help.png:   +40 px tall (361 -> 401). Cause: P2-1 added 2 lines
               to the `_help` text (`f filter` and `s sort`); each
               line is ~20 px in the DejaVu Sans Mono 15pt font.
  overview.png: byte-identical (746x501). Confirms the
               default-off contract: every Phase 2 Overview
               feature (heartbeat, infra pointer, log tail) is
               off in the default invocation, so the rendered
               Overview tab is unchanged.

This commit was approved by the user (the dev policy requires
explicit approval before golden-baseline changes). The default-off
contract documented in PR #8's body still holds: every Phase 2
feature is gated behind a `BenchDeckTUI(...)` kwarg that defaults
to off, and the new `overview.png` (which would catch any
default-TUI visual change) is byte-identical to the previous
golden.
Files: src/benchdeck/tui.py
Risk: low
Tests: n/a (type-only change; no behavior change)
Validation: python -m mypy src/benchdeck/  # Success: no issues found in 24 source files
Rollback: git revert <sha>

The PR #8 re-run surfaced one mypy error in `_case_list` (line
527, before this fix):

  src/benchdeck/tui.py:527: error: Incompatible types in assignment
  (expression has type "None", variable has type
  "list[tuple[dict[str, Any], str]]")  [assignment]

The Phase 2 P2-1 commit had introduced a `visible: list[...] =
[]` variable plus a `visible = None` placeholder in the default
branch, used as a sentinel to choose the rendering path. The
`None` sentinel made mypy widen the type to `list | None` and
flag every downstream use of `visible` (iteration, `len()`,
etc.).

This commit eliminates the sentinel entirely:
  - `visible` is now always a list, initialized empty and only
    appended to in the `enable_case_filter` branch.
  - The path choice uses `self.enable_case_filter` directly
    (the same flag that built the list) instead of
    `if visible is None:`.

No behavior change. All 168 TUI + screenshot tests still pass
locally. The default-off TUI rendering is unchanged because
`self.enable_case_filter=False` is the default and the
`if not self.enable_case_filter:` branch (line 534) preserves
the original code verbatim.
@MerverliPy MerverliPy merged commit fb5e631 into main Jun 15, 2026
4 checks passed
@MerverliPy MerverliPy deleted the tui/phase-2 branch June 15, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant