Visual regression: scale from 5-screen pilot to every `*_page.dart`

## Context

PR #541 (`feat/visual-regression-pilot`, head `60f0e87`, base `develop`, open and ready-for-review) lands a visual-regression pilot using `alchemist` 0.14.0 + Open Sans baseline font on the `dfx01` self-hosted Mac Studio runner. Pilot covers 5 pages with 8 baseline PNGs. The original `golden-bootstrap.yaml` workflow was removed in commit `60f0e87` ("revert drift probe + remove bootstrap workflow") because the initial baselines were committed and the workflow's purpose was fulfilled.

**PR #552 (`feat/visual-regression-scale`, stacked on #541) already ships the full 57-page scale-out** with 59 baselines committed. All CI checks green. This issue tracks acceptance + merge of #552, not greenfield work.

## Why the scale-out is the right move (verified against actual data)

A prior version of this issue recommended cutting to ~25 hot-path pages over concerns about maintenance tax and flake risk. That recommendation was overruled by the actual pilot data:

- **Pilot→full scale wall-clock**: 57 minutes from first pilot commit (`2de7eb1`) to 57-page complete (`ed19559`). Including bootstrap setup, deliberate drift probe + revert (`b279168` → `60f0e87`, framework's catch-mechanism verified end-to-end), scale to 57 pages, fix 8 deterministic test wiring issues, generate 51 more baselines on dfx01.
- **Marginal cost per page**: ~5 minutes once the framework is hardened. Cutting to 25 leaves 32 pages ungated; coverage holes are worse than maintenance cost when the maintenance is small.
- **Flake reality**: zero flake reports across pilot + scale-out CI runs. The 8 fix-once test failures were `MissingPluginException` for `no_screenshot` (channel stub, permanent fix) and `pumpAndSettle` hangs on `CircularProgressIndicator` pages (switched to `pumpOnce`, permanent fix). Neither is a flake.
- **Re-bake cadence in practice**: based on UI-touching commits in `lib/screens/` since 2026-04, ~5-8 PRs/year would need baseline updates. Each re-bake on dfx01 takes 3-5 minutes.
- **Visual-class bug history**: only 2 commits in repo history (`#103` ellipsis-on-overflow, `#59` scroll-for-small-screens). But golden tests also implicitly catch render-crashes (page must build to produce a PNG) — the pilot fix commit `9236617` itself caught real wiring bugs in `kyc-loading` and `settings_edit_loading`.

## Scope

PR #552 already contains the full set. This issue tracks:

- [ ] **Merge #541** (pilot)
- [ ] **Merge #552** (scale to 57 pages, 59 baselines)
- [ ] **Tighten `alchemist` pin**: `^0.14.0` (caret) → `0.14.0` (exact) in `pubspec.yaml`. Pixel determinism is the goal; even though only one 0.14.x release exists today, a future 0.14.1 would silently re-bake all baselines on the next `pub upgrade`.
- [ ] **Document baseline-update workflow** in `docs/visual-regression-tests.md` (depends on #555 — golden-update bootstrap mechanism after `golden-bootstrap.yaml` removal)
- [ ] **Bottom sheets**: `lib/screens/pin/widgets/enable_biometric_bottom_sheet.dart`, `lib/screens/pin/widgets/forgot_pin_bottom_sheet.dart`, `lib/screens/hardware_connect_bitbox/show_bitbox_reconnect_sheet.dart` — verify these are either in #552's scope or explicitly listed in `.test-coverage-allowlist` (#551 guard)

## Acceptance criteria

- All 57 `*_page.dart` files have at least one matching `*_golden_test.dart` after #552 merges
- Pages with materially different visual states get one baseline per state (#552 already implements this for KYC subpages, settings_edit_* states, sell-bitbox states)
- All baselines generated on dfx01 runner labels `[self-hosted, macOS, ARM64, m3-ultra, realunit-app]` (verified configured)
- `alchemist: 0.14.0` exact pin in `pubspec.yaml`
- Bottom-sheet trio explicitly accounted for (in scope or allow-listed)
- `docs/visual-regression-tests.md` documents the re-bake workflow once #555 lands

## Open decisions

1. Dark-mode baselines? App doesn't ship a dark theme today. Defer.
2. iPad / Android tablet variants? Pilot is iPhone-only (390×844); recommend keeping that.
3. Bootstrap-update mechanism after #555 — separate issue.

## Estimated effort

| Sub-task | Days |
|---|---:|
| Review + merge #541 and #552 | 0.5 |
| Tighten alchemist pin | 0.1 |
| Bottom-sheet coverage check + allow-list entries if needed | 0.25 |
| `docs/visual-regression-tests.md` re-bake workflow section (after #555) | 0.25 |
| **Total** | **~1 engineer-day** of finishing work; ~10 days already invested in #552 |

The original 10-12 engineer-day estimate stands for the work done in #552. This issue's remaining work is the wrap-up.

## ROI reassessment

| | Original V1 estimate | Verified |
|---|---|---|
| Maintenance tax | very high | low (~5-8 re-bakes/year, 3-5 min each on dfx01) |
| Flake rate | high (pixel drift even on dfx01) | zero observed in pilot + scale-out |
| Bug-catching | low (visual-only) | medium-to-high (also catches render-crashes implicitly) |
| Recommendation | scale to 25 hot-path pages | **stay at full 57** |

## Related

- #541 — pilot PR (5 pages, 8 baselines)
- #552 — scale PR (57 pages, 59 baselines, stacked on #541)
- `docs/visual-regression-tests.md` — to be added by PR #541
- `lib/screens/**/*_page.dart` — source of truth for the page list
- #544 — sister issue for `testWidgets` page tests
- #551 — file-pair guard (needs to tolerate bottom-sheet glob + combined specs)
- #555 — golden-update bootstrap mechanism (blocking the `docs/` workflow section)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visual regression: scale from 5-screen pilot to every `*_page.dart` #547

Context

Why the scale-out is the right move (verified against actual data)

Scope

Acceptance criteria

Open decisions

Estimated effort

ROI reassessment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sub-task	Days
Review + merge #541 and #552	0.5
Tighten alchemist pin	0.1
Bottom-sheet coverage check + allow-list entries if needed	0.25
`docs/visual-regression-tests.md` re-bake workflow section (after #555)	0.25
Total	~1 engineer-day of finishing work; ~10 days already invested in #552

	Original V1 estimate	Verified
Maintenance tax	very high	low (~5-8 re-bakes/year, 3-5 min each on dfx01)
Flake rate	high (pixel drift even on dfx01)	zero observed in pilot + scale-out
Bug-catching	low (visual-only)	medium-to-high (also catches render-crashes implicitly)
Recommendation	scale to 25 hot-path pages	stay at full 57

Visual regression: scale from 5-screen pilot to every *_page.dart #547

Description

Context

Why the scale-out is the right move (verified against actual data)

Scope

Acceptance criteria

Open decisions

Estimated effort

ROI reassessment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Visual regression: scale from 5-screen pilot to every `*_page.dart` #547