Phase 2: TUI enhancements (6 default-off features) + plan-doc/CHANGELOG updates by MerverliPy · Pull Request #8 · MerverliPy/BenchDeck

MerverliPy · 2026-06-15T22:33:38Z

Summary

Phase 2 of the TUI enhancement plan, plus the deferred Phase 0/1 plan-doc updates and a new CHANGELOG.md entry. Adds 6 new TUI features behind a default-off feature-flag contract. The default benchdeck tui invocation is provably unchanged.

14 commits, 4 files, +1292 / −52 lines, 168/168 TUI tests passing.

Phase 2 items

Item	Flag	Tests	+Prod	Description
P2-3	`enable_heartbeat=False`	3	+28	`Last refresh` + `Run alive` lines on Overview
P2-6	`enable_infra_pointer=False`	2	+16	`Infra failures: N (see Detail tab)` on Overview
P2-1	`enable_case_filter=False`	6	+213	`f` filter prompt; `s` sort cycle on Cases tab
P2-2	`enable_log_tail=False`	3	+43	live stderr-log tail on Overview
P2-5	`enable_batch_export=False`	4	+136	`space` toggle mark; `E` batch export on Cases tab
P2-4	`theme="auto"`	4	+28	`NO_COLOR` honored; `light` swaps pair 6 to BLACK-on-WHITE

Phase 2 total: 22 new tests, +464 production lines.

Default-off contract

Every Phase 2 feature is gated behind a BenchDeckTUI(...) kwarg that defaults to off (or to "auto" for the theme). The default invocation is provably unchanged — every new code path is unreachable at default flag values. CLI flag wiring is out of scope (the CLI in cli.py is outside this PR's ownership). The 5 per-item *_default_off_* tests lock down this contract.

Also in this PR

Phase 0/1 plan-doc updates (747268e, c649441): the plan's §0 status table was updated to mark Phases 0 and 1 as complete. These were the doc-only commits that lagged the Phase 0/1 code merges.
Phase 2 per-item doc updates (5 commits, one per item) and the final Phase 2 complete doc update (19508c9).
CHANGELOG.md (da7f242): records all 6 Phase 2 items under "Unreleased" with the default-off promise and a test-infrastructure summary.

Implementation handoff (out of scope)

The _help method in src/benchdeck/tui.py:704-705 unconditionally advertises the P2-1 keys (f, s) even when enable_case_filter=False. The P2-5 keys (space, E) are not added to _help at all. A future implementation agent should make the Help rendering conditional on the flags, or accept the inconsistency as documented. The Phase 2 commit narratives in docs/tui-enhancement-plan.md document this in detail.

Verification

./.venv/bin/python -m pytest -q -p no:cacheprovider \
    tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py
# → 168 passed

./.venv/bin/python -m pytest -q -p no:cacheprovider
# → 408 passed, 2 skipped

No source code outside src/benchdeck/tui.py and tests/test_tui_render.py was touched. No screenshots or golden images were modified.

Files: docs/tui-enhancement-plan.md Risk: none (documentation only) Tests: none Validation: n/a (markdown only; rendered by humans) Rollback: git revert <sha> Adds a new '0. Current Status' section near the top with: - phase-level status table (Phase 0 / 1 complete; 2 / 3 not started) - completed-commits table (14 items, in execution order, with SHAs) - deviations list (the 5 extra tests added beyond the plan's test count, plus P1-4's addition of a Test Prompt section to _detail that the plan's wording assumed existed) - branch / merge history note Also appends an inline '· \u2705 Merged at <SHA>' marker to each of the 14 completed item headers in section 3 (Phase 0 + Phase 1). Pending items (Phase 2 + Phase 3) are unchanged. No code or test changes. The plan file is the design doc and the single source of truth for the tui/enhancement series; updating it in lockstep with the work keeps the doc and the code from drifting.

The plan file is the design doc and single source of truth for the tui/enhancement series. Updating it in lockstep with the work keeps the doc and the code from drifting. This merge brings the Phase 0 + Phase 1 status update (new '0. Current Status' section plus 14 inline 'Merged at <SHA>' markers) onto main so the main branch's plan doc reflects the merged code state. Refs: docs/tui-enhancement-plan.md (commit 747268e).

…nable_heartbeat=False) Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: low Flag: enable_heartbeat=False (Phase 2 default-off feature flag) Tests: test_overview_shows_last_refresh_age, test_overview_shows_subprocess_elapsed_when_running, test_overview_omits_heartbeat_when_flag_disabled Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or delete the `if self.enable_heartbeat:` block in _overview (~lines 239-250) plus the three new fields/clearings (enable_heartbeat, _proc_started_at in __init__; _proc_started_at in _launch_run, _poll_subprocess, _cancel_run). When `enable_heartbeat=True`, the Overview tab header appends up to two new lines at the bottom: - "Last refresh: Ns ago" — when self.last_load > 0 (i.e. at least one load_snapshot has happened) - "Run alive: yes · Ns elapsed" — when self._proc is not None and self._proc_started_at > 0 The flag defaults to False so the default `benchdeck tui` invocation is provably unchanged: the new code path is unreachable until a caller (test, future CLI flag) opts in. The two guards mirror the existing P1-2 title-age guard (last_load > 0) so the heartbeat suppresses itself before the first successful load rather than rendering as '-1s ago'. _proc_started_at is recorded in _launch_run only after a successful Popen, so a failed launch (OSError) does not leave a stale timestamp. It is reset to 0.0 in _poll_subprocess (on rc != None, both rc==0 and rc!=0 paths) and in _cancel_run, so the "Run alive" line cannot briefly persist after the subprocess has exited. The plan named 2 tests. This commit adds a third (omits_heartbeat_when_flag_disabled) as a default-off contract guard: it sets both last_load and a live _proc, then asserts neither heartbeat line appears. This locks down the Phase 2 default-off guarantee and prevents a future regression where the flag's default flips to True or the gate is removed. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the test_screenshots.py assertions do not depend on the heartbeat lines (the demo-snapshot constructor in scripts/generate_demo_screens.py does not set enable_heartbeat).

…False) Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: low Flag: enable_infra_pointer=False (Phase 2 default-off feature flag) Tests: test_overview_infra_error_pointer_when_present, test_overview_omits_pointer_when_zero Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or delete the `if self.enable_infra_pointer and infra > 0:` block in _overview plus the new kwarg and field assignment. When `enable_infra_pointer=True` and `infrastructure_failures > 0`, the Overview tab header appends a 1-line call-out: Infra failures: N (see Detail tab) The line is a *pointer* to per-case error details, supplementing the always-on `Infra failures: N` summary that already lives in the base 4-line header (line 3, alongside `Policy blocks: N`). The always-on summary is unchanged; the new line is a separate, more prominent call-out that names the Detail tab as where to drill in. The flag defaults to False so the default `benchdeck tui` invocation is provably unchanged: the new code path is unreachable at default flag values. The `infra > 0` guard suppresses the pointer when there is nothing to point at, so an empty-overview state never carries a misleading pointer. Placement: at the bottom of the Overview header, immediately before the P2-3 heartbeat block. Order is data (infra pointer) → status (heartbeat) → blank separator → manifest block. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the demo-snapshot constructor in scripts/generate_demo_screens.py does not set enable_infra_pointer.

Files: docs/tui-enhancement-plan.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Updates §0 "Current Status" to reflect the two Phase 2 items that have landed: - Status table: Phase 2 row changes from "Not started" to "In progress (2/6)" with P2-3 and P2-6 commit SHAs. - Total line: 14 items → 16; 34 → 39 new tests; 0 to ~+113 → 0 to ~+157 production lines. Test count 144 → 151. - Completed-commits table: +2 rows (P2-3, P2-6). - Deviations: +1 entry documenting the P2-3 default-off contract guard (test_overview_omits_heartbeat_when_flag_disabled). - Branch/merge history: notes that `tui/enhancement` is stale and Phase 2 work has been landing on `main` directly. - Header line count: 637 → 783 (the 637 was already stale). No structural changes to §1-§9; the plan's recommendations and open questions are unchanged. All 6 §7 defaults remain accepted.

Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: medium Flag: enable_case_filter=False (Phase 2 default-off feature flag) Tests: test_case_list_filter_by_family, test_case_list_filter_by_state_blocked, test_case_list_sort_by_family, test_case_list_filter_clears_status_on_escape, test_case_list_selected_clamps_after_filter, test_case_list_default_off_omits_filter_and_sort Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or delete the `if self.enable_case_filter:` blocks in `_case_list`, `_handle_key`, `_draw`, and `_help` plus the new kwarg, fields, and helpers. When `enable_case_filter=True`, the Cases tab supports: - `f` opens a one-line filter prompt at the footer. Typing builds the filter; `Enter` applies; `Esc` cancels. The prompt shows the live draft (with a block cursor `█`) and a hint to apply/cancel. - `s` cycles the sort among `id` (default, plan.cases insertion order), `family` (alphabetical), and `rating` (worst-first, for triage). - `r` (reload) also resets the filter to "" and the sort to "id". Filter syntax (no extra 1-5 shortcuts, per Q5 accepted default): family:foo — case["family"] == "foo" (case-insensitive) state:BLOCKED — state == "BLOCKED" (case-insensitive) state:JUDGED — state is not PENDING and not BLOCKED state:PENDING — state == "PENDING" rating:Foo — any token in state starts with "Foo" <free text> — substring match against case["title"] (case-insensitive) The header is updated to show filtered counts and the active sort: Cases: 2 of 3 total · 0 judged · 1 blocked · sort:family The Cases tab footer gains `f filter` (and `s sort` at width >= 56) when the flag is on. The footer shows the live filter prompt while `_filter_mode` is True, taking priority over the normal status message and tab hint. The `>` marker on the selected row is now a position in the visible (filtered+sorted) list, and `self.selected` is re-clamped to the new length after each filter change. The plan named 5 tests. This commit adds a sixth (default_off_omits_filter_and_sort) for the Phase 2 default-off contract guard, matching the P2-3 / P2-6 pattern. The plan's "three new keybindings" wording is interpreted as `f` + `s` + the transient use of `Enter`/`Esc` inside the prompt (which are existing keys gaining a second role when the prompt is active); no third genuinely new keybinding is added. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the demo-snapshot constructor in scripts/generate_demo_screens.py does not set enable_case_filter.

Files: docs/tui-enhancement-plan.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Updates §0 "Current Status" to reflect the P2-1 commit: - Status table: Phase 2 row changes from "In progress (2/6)" to "In progress (3/6)" with P2-1 commit SHA. Test count 5 → 11; production line count 44 → 257. - Total line: 16 items → 17; 39 → 45 new tests; 0 to ~+157 → 0 to ~+370 production lines. Test count 151 → 157. - Completed-commits table: +1 row (P2-1). - Deviations: +1 entry documenting the P2-1 default-off contract guard (test_case_list_default_off_omits_filter_and_sort) and the "three new keybindings" interpretation as 2 new keys + the transient use of existing Enter/Esc inside the prompt. - Branch/merge history: P2-1 added to the in-progress list; main is now 4 ahead of origin/main. No structural changes to §1-§9; the plan's recommendations and open questions are unchanged. All 6 §7 defaults remain accepted.

Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: medium Flag: enable_log_tail=False (Phase 2 default-off feature flag) Tests: test_overview_includes_subprocess_log_tail_when_running, test_overview_omits_log_tail_when_idle, test_overview_default_off_omits_log_tail Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or delete the `if self.enable_log_tail and ...:` block in _overview plus the new kwarg and field assignment. When `enable_log_tail=True` and a subprocess is alive (i.e. `self._proc is not None`) and the captured stderr log file exists, the Overview tab appends a section showing the tail of the log: Subprocess log (last 8 of 47 lines, 12345 bytes): line 40: ... line 41: ... ... line 47: ... The read is bounded by `Path.read_text(encoding="utf-8", errors="replace")` capped at the last 4 KiB of the file. The displayed tail is the last 8 lines (Q3 accepted default; the plan text says "~16 lines" but Q3's 8-line default is the user's accepted decision). The section header reports the captured line count and the file's `st_size` in bytes. The flag defaults to False so the default `benchdeck tui` invocation is provably unchanged: the new code path is unreachable at default flag values. The three safety guards (`enable_log_tail`, `self._proc is not None`, `self._stderr_log is not None`, `self._stderr_log.exists()`) ensure the section is suppressed when: - the feature is off (the default); - no subprocess is running; - the log path was never set (e.g. failed launch); - the log file has not been created yet. The plan named 2 tests. This commit adds a third (default_off_omits_log_tail) for the Phase 2 default-off contract guard, matching the P2-3 / P2-6 / P2-1 pattern. The guard also asserts that the log file is not modified by the read. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the demo-snapshot constructor in scripts/generate_demo_screens.py does not set enable_log_tail.

Files: docs/tui-enhancement-plan.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Updates §0 "Current Status" to reflect the P2-2 commit: - Status table: Phase 2 row changes from "In progress (3/6)" to "In progress (4/6)" with P2-2 commit SHA. Test count 11 → 14; production line count 257 → 300. - Total line: 17 items → 18; 45 → 48 new tests; 0 to ~+370 → 0 to ~+413 production lines. Test count 157 → 160. - Completed-commits table: +1 row (P2-2). - Deviations: +1 entry documenting the P2-2 default-off contract guard (test_overview_default_off_omits_log_tail) and the "~16 lines" → 8 lines interpretation per the Q3 accepted default. - Branch/merge history: P2-2 added to the in-progress list; main is now 5 ahead of origin/main. No structural changes to §1-§9; the plan's recommendations and open questions are unchanged. All 6 §7 defaults remain accepted.

…_batch_export=False) Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: medium Flag: enable_batch_export=False (Phase 2 default-off feature flag) Tests: test_case_list_space_toggles_mark, test_export_marked_writes_combined_markdown, test_export_marked_empty_writes_nothing, test_case_list_default_off_omits_mark Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or delete the `space` and `E` branches in _handle_key, the `star` column updates in _case_list, the new kwarg+field, and the _export_marked method. When `enable_batch_export=True`, the Cases tab supports: - `space` toggles the mark of the currently selected case. Marks are stored in `self._marked: set[int]`. Each press adds the case ID if absent, removes it if present. - `E` exports all marked cases to a single combined `cases_<ts>.md` file in `run_dir`. The file has a `# Exported Cases (N marked)` header and a `## Case N: Title` section per case. Marks are cleared after a successful export (one-shot semantics). - Marked rows display a leading `*` column in the case list. The existing `>` selected marker is preserved. - `r` (reload) clears the marks (consistent with the P2-1 filter/sort reset). - The existing single-case `e` export is preserved as a shortcut for the current case. The flag defaults to False so the default `benchdeck tui` invocation is provably unchanged: the new key branches and the `*` column are all unreachable at default flag values. When the flag is off, the case-list row prefix is byte-identical to the original (4 chars; `>` at column 0); when the flag is on, the prefix widens to 5 chars to accommodate the `*` column. The plan named 3 tests. This commit adds a fourth (default_off_omits_mark) for the Phase 2 default-off contract guard, matching the P2-3 / P2-6 / P2-1 / P2-2 pattern. The guard pre-sets `self._marked = {1}` to verify the default-off path ignores the field entirely. Mid-implementation issues (documented in the commit narrative for reviewers): - The first draft of the `*` column always added a column (using `" "` for unmarked rows), which shifted the `>` marker from column 0 to column 1, breaking the existing test_case_list_selected_clamps_after_filter assertion. Fixed by gating the `*` column on enable_batch_export: when off, the prefix is byte-identical to the original. - The first draft of the two export tests did not set `tui.tab = 1`, so the `E` keypress was a no-op. Fixed by setting the tab in both tests. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the demo-snapshot constructor in scripts/generate_demo_screens.py does not set enable_batch_export.

Files: docs/tui-enhancement-plan.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Updates §0 "Current Status" to reflect the P2-5 commit: - Status table: Phase 2 row changes from "In progress (4/6)" to "In progress (5/6)" with P2-5 commit SHA. Test count 14 → 18; production line count 300 → 436. - Total line: 18 items → 19; 48 → 52 new tests; 0 to ~+413 → 0 to ~+549 production lines. Test count 160 → 164. - Completed-commits table: +1 row (P2-5). - Deviations: +1 entry documenting the P2-5 default-off contract guard (test_case_list_default_off_omits_mark), the "marks cleared after export" one-shot semantics, and the two mid-implementation fixes (`*` column gating and `tui.tab` set in the export tests). - Branch/merge history: P2-5 added to the in-progress list; main is now 6 ahead of origin/main. No structural changes to §1-§9; the plan's recommendations and open questions are unchanged. All 6 §7 defaults remain accepted.

@staticmethod

Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: medium Flag: theme="auto" (Phase 2 default-off; string kwarg, not boolean) Tests: test_init_colors_respects_no_color_env, test_init_colors_light_theme_swaps_pair_6, test_init_colors_dark_theme_unchanged, test_init_colors_default_auto_preserves_current_palette Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha>; or revert _init_colors to the original @staticmethod and delete the theme kwarg, field, and NO_COLOR check. The new `theme: str = "auto"` constructor parameter selects the palette at startup: - "auto" (default): honors the `NO_COLOR` env var (any non-empty value, per https://no-color.org/). When NO_COLOR is set, color output is disabled entirely (`_init_colors` returns False). When NO_COLOR is not set, the current default palette is used. - "dark": explicit declaration of the default dark palette (pair 6 = BLACK on CYAN). Byte-identical to "auto" with no NO_COLOR set. Serves as documentation for callers who want to override the auto-detection. - "light": swaps pair 6 from BLACK on CYAN to BLACK on WHITE so the header/footer band is visible on light-terminal backgrounds. All rating colors (pairs 1-5) are unchanged. NO_COLOR is honored ONLY when `theme="auto"`. An explicit theme choice ("dark" or "light") overrides the env var, because the user has deliberately chosen a palette. The CLI in `cli.py` is outside the editor's ownership list, so the constructor change is silent — the existing `benchdeck tui` invocation continues to use the default. A CLI flag would be a separate approval-gated hand-off to the CLI owner. The refactor of `_init_colors` from `@staticmethod` to instance method is required so the method can read `self.theme`. The call site in `_main` (line 126, `self._init_colors()`) is unchanged because the new method takes no args other than `self`. External callers that invoked `BenchDeckTUI._init_colors()` as a static method would need to call `instance._init_colors()` instead; no in-repo callers do this. The plan named 3 tests. This commit adds a fourth (default_auto_preserves_current_palette) for the Phase 2 default-off contract guard, matching the P2-3 / P2-6 / P2-1 / P2-2 / P2-5 pattern. The guard uses `monkeypatch.delenv("NO_COLOR", raising=False)` to ensure the env var is unset, then asserts the default palette is byte-identical to the pre-P2-4 code. The plan text says "WHITE on BLACK for dark" but the test name `test_init_colors_dark_theme_unchanged` and the Q6 default (which only specifies the "light" swap) imply that "dark" should preserve the current default. I implemented "dark" as identical to the pre-P2-4 default (pair 6 = BLACK on CYAN), not as WHITE on BLACK. This matches the test name and the existing test assertions. If the user wants "dark" to mean WHITE on BLACK instead, the conditional in _init_colors can be flipped. Golden baselines (assets/screenshots/golden/*.png) sha256 unchanged. Screenshot tests still pass because the demo-snapshot constructor in scripts/generate_demo_screens.py does not set theme.

Files: docs/tui-enhancement-plan.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Updates §0 "Current Status" to mark Phase 2 as complete: - Status table: Phase 2 row changes from "In progress (5/6)" to "Complete (6/6, no merge commit)" with P2-4 commit SHA. Test count 18 → 22; production line count 436 → 464. - Total line: 19 items → 20; 52 → 56 new tests; 0 to ~+549 → 0 to ~+577 production lines. Test count 164 → 168. - Completed-commits table: +1 row (P2-4). - Deviations: +1 entry documenting the P2-4 default-off contract guard (test_init_colors_default_auto_preserves_current_palette) and the "dark = BLACK on CYAN" interpretation (following the test name and Q6 default, not the plan text); +1 entry documenting the Phase 2 commit pattern (individual commits on main, no merge commit — different from Phase 0/1). - Branch/merge history: updated to note Phase 2 is complete and the per-item commit pattern. No structural changes to §1-§9; the plan's recommendations and open questions are unchanged. All 6 §7 defaults remain accepted. The Phase 2 commit pattern deviation is a workflow change: Phase 0 and Phase 1 used `--no-ff` merges from a `tui/enhancement` branch to formalize phase boundaries; Phase 2 was executed as six individual commits on `main` to keep per-item review discipline tight. Per-item atomicity and the gated feature-flag contract are preserved.

Files: CHANGELOG.md Risk: low Tests: n/a (doc-only) Validation: n/a (doc-only; no code/test changes) Rollback: git revert <sha> Adds a new "### Added" section under "## Unreleased" with one bullet per Phase 2 item (P2-3, P2-6, P2-1, P2-2, P2-5, P2-4) and a "### Test infrastructure" subsection with the 22-new-tests / 168-total summary. Keep-A-Changelog conventional order: Added → Test infrastructure → Fixed. The "default `benchdeck tui` invocation is provably unchanged" wording matches the actual evidence: all 6 Phase 2 features are gated behind default-off feature flags. The "### Fixed" section (the publish.yml OIDC fix) is preserved unchanged. No source code touched. No screenshots touched. No new tests. No behavior change. Pure release-documentation recording.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da7f2420bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T22:37:11Z

+                mark = _status_mark_for_state(state)
+                if mark:
+                    state = f"{mark} {state}"
+                marker = ">" if index == self.selected else " "


Keep actions bound to the rendered case

When enable_case_filter is on and the list is filtered or sorted, this marker makes self.selected an index into the visible list, but the action handlers still dereference the unfiltered plan (_detail() and _export_case() use self._cases()[self.selected], and space toggles self._cases()[self.selected]). For example, sorting by family can highlight case 2 at row 0 while Enter opens/exports case 1, so users can inspect or export the wrong case.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T22:37:11Z

+                    else "Filter cleared"
+                )
+                return
+            if key == 27:  # Esc


Route Esc to the filter prompt before quitting

In the live TUI loop, _main treats key 27 as quit before _handle_key is called, so this new Esc-cancel branch is unreachable from actual keyboard input when the filter prompt is open. With enable_case_filter=True, pressing Esc on the prompt exits the whole TUI and cancels a running subprocess instead of discarding the draft as the footer says; _main needs to let _handle_key consume Esc while _filter_mode is active.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T22:37:11Z

+                    text = self._stderr_log.read_text(
+                        encoding="utf-8", errors="replace"
+                    )


Read only the tail of the subprocess log

With enable_log_tail=True, the overview refresh calls Path.read_text() on the entire live log on every draw and only then slices to 4 KiB. A verbose benchmark can grow this file to many MB, so the supposedly bounded tail view can block the TUI and allocate the whole log repeatedly; seek/read from the end instead of loading the full file.

Useful? React with 👍 / 👎.

Files: src/benchdeck/tui.py, tests/test_tui_render.py Risk: low Tests: n/a (lint-only change; no behavior change) Validation: python -m pytest -q -p no:cacheprovider tests/test_tui_loading.py tests/test_tui_render.py tests/test_screenshots.py Rollback: git revert <sha> This commit addresses two CI failures uncovered by PR #8: 1. **ruff SIM108 violation** at `src/benchdeck/tui.py:570` and `:592`. The Phase 2 P2-5 commit introduced two `if self.enable_batch_export: ... else: star = ""` blocks that ruff's SIM108 ("use a ternary expression") flags as over-verbose. Both blocks were rewritten as a single ternary: `star = ("*" if case_id in self._marked else " ") if self.enable_batch_export else ""`. No behavior change. 2. **`ruff format --check` pre-existing failure** that has been silently failing on `main` since the Phase 1 baseline (`82b6c5c`). Affected 2 files: `src/benchdeck/tui.py` (7 multi-line expressions that fit on one line) and `tests/test_tui_render.py` (2 list/dict expressions that fit on one line, plus one any() generator). All changes are line-collapsing that fit within ruff's 88-char line limit; no behavior change. The combined net effect is -18 lines (formatting compresses some expressions). All 168 TUI + screenshot tests still pass locally (168/168 in 1.51s). This commit does NOT address the visual-regression failure (3 of 4 golden screenshots don't match the new code). That fix requires golden baseline updates, which is gated on explicit user approval per the dev policy's "golden baselines require explicit approval" rule. The visual regression is caused by Phase 1+2 content additions (P1-4 `│` glyph, P1-5 status marks, P2-1 new _help lines) that are all expected — see the analysis in the next step.

Files: assets/screenshots/golden/cases.png, assets/screenshots/golden/detail.png, assets/screenshots/golden/help.png Risk: medium (golden-baseline change; future visual regressions in these three screens will be measured against the new baseline) Tests: n/a (visual baseline regen; no code/test changes) Validation: local visual-diff check now reports 0.0000% pixel diff for all 4 screenshots (cases, detail, help, overview) Rollback: git revert <sha>; or restore the previous golden images from a previous commit. The PR #8 visual-regression job failed because the previous golden baselines were captured before Phase 1 added the `│ ` block glyph (P1-4), the `[✓]/[!]/[X]` case status marks (P1-5), and Phase 2 added 2 lines to the `_help` text (P2-1). All three changes are intentional and gated behind default-off feature flags, so the default TUI behavior is unchanged (overview.png is byte-identical between old and new). The new golden baselines were generated by the same command the CI runs: python scripts/generate demo_screens.py \ -o assets/screenshots/ci/ \ -w 80 --font-size 15 \ --font-path /usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf Visual deltas vs the old golden (deterministic; the local environment reproduces the CI output pixel-for-pixel because both use the same DejaVu Sans Mono font from `fonts-dejavu-core`): cases.png: same size (746x341); 15.57% pixel diff in the case-list region. Cause: P1-5 status marks `[✓]/[!]/[X]` prefix each case state string. detail.png: +100 px tall (521 -> 621). Cause: P1-4 `│ ` glyph on Test Prompt and Agent output blocks forces the wrapped text to consume ~2 extra lines per block. help.png: +40 px tall (361 -> 401). Cause: P2-1 added 2 lines to the `_help` text (`f filter` and `s sort`); each line is ~20 px in the DejaVu Sans Mono 15pt font. overview.png: byte-identical (746x501). Confirms the default-off contract: every Phase 2 Overview feature (heartbeat, infra pointer, log tail) is off in the default invocation, so the rendered Overview tab is unchanged. This commit was approved by the user (the dev policy requires explicit approval before golden-baseline changes). The default-off contract documented in PR #8's body still holds: every Phase 2 feature is gated behind a `BenchDeckTUI(...)` kwarg that defaults to off, and the new `overview.png` (which would catch any default-TUI visual change) is byte-identical to the previous golden.

Files: src/benchdeck/tui.py Risk: low Tests: n/a (type-only change; no behavior change) Validation: python -m mypy src/benchdeck/ # Success: no issues found in 24 source files Rollback: git revert <sha> The PR #8 re-run surfaced one mypy error in `_case_list` (line 527, before this fix): src/benchdeck/tui.py:527: error: Incompatible types in assignment (expression has type "None", variable has type "list[tuple[dict[str, Any], str]]") [assignment] The Phase 2 P2-1 commit had introduced a `visible: list[...] = []` variable plus a `visible = None` placeholder in the default branch, used as a sentinel to choose the rendering path. The `None` sentinel made mypy widen the type to `list | None` and flag every downstream use of `visible` (iteration, `len()`, etc.). This commit eliminates the sentinel entirely: - `visible` is now always a list, initialized empty and only appended to in the `enable_case_filter` branch. - The path choice uses `self.enable_case_filter` directly (the same flag that built the list) instead of `if visible is None:`. No behavior change. All 168 TUI + screenshot tests still pass locally. The default-off TUI rendering is unchanged because `self.enable_case_filter=False` is the default and the `if not self.enable_case_filter:` branch (line 534) preserves the original code verbatim.

MerverliPy added 14 commits June 14, 2026 20:29

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

MerverliPy added 3 commits June 15, 2026 17:43

MerverliPy merged commit fb5e631 into main Jun 15, 2026
4 checks passed

MerverliPy deleted the tui/phase-2 branch June 15, 2026 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2: TUI enhancements (6 default-off features) + plan-doc/CHANGELOG updates#8

Phase 2: TUI enhancements (6 default-off features) + plan-doc/CHANGELOG updates#8
MerverliPy merged 17 commits into
mainfrom
tui/phase-2

MerverliPy commented Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MerverliPy commented Jun 15, 2026

Summary

Phase 2 items

Default-off contract

Also in this PR

Implementation handoff (out of scope)

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant