Skip to content

Comprehensive BitBox testing infrastructure (Tier 0–4) #314

@TaprootFreak

Description

@TaprootFreak

Comprehensive BitBox testing infrastructure (Tier 0–4)

Status (2026-05-19)

Phase Status Landed via
0 — Cubit unit tests ✅ Done #319
1 — Fake-BitBox + cross-layer tests ✅ Done (deviated) #320
2 — Firmware simulator ✅ Reframed via DFXswiss/bitbox-testkit (firmware side); Dart-side TCP transport not built (likely no longer needed) .github/workflows/bitbox-simulator.yml, .github/workflows/bitbox-simulator-slash.yml
3 — Maestro hardware flows 🟡 Open — next major piece of work
4 — VCR / replay 🟡 Stretch
Coverage gating (cross-cutting) 🟡 In progress — measurement baseline shipped, threshold gate pending #322 (rule), #323 (CI artifact)

The four-tier model is documented as the living source of truth in docs/testing.md — this issue tracks the rollout; the doc owns the tier definitions, conventions, and onboarding examples.

Motivation

PR #312 ("gate sensitive KYC steps behind BitBox EIP-712 sign") shipped with zero automated tests for the new Cubit logic. The seven manual scenarios in that PR description were the only validation, and most were unchecked at merge time. Every change to BitBox-gated code carried the same regression risk.

Concrete incidents that standardised infrastructure has caught (or would have caught):

  • bitbox_flutter PR Bitbox integration #6 removed sign dedup → broke long EIP-712 sign flows → discovered only when running the 13-page sign on real hardware
  • bitbox_flutter PR feat: Create Welcome Page #11 fixed a BLE frame-desync regression — exactly the class of bug FakeBitboxBehavior.malformed now reproduces deterministically
  • fix: gate sensitive KYC steps behind BitBox EIP-712 sign #312 had two reviewer-flagged "must-fix" issues that turned out to be non-bugs only after a deep flow trace — automated tests make that trivial
  • A Future.timeout race in KycCubit (commit 5c12676) was only discovered post-merge

The pattern: BitBox-related bugs surface late, manual tests don't get re-run consistently, review depth varies. Standardised, tiered test infrastructure breaks the cycle.

Goals

  1. Cubit-level regression coverage for every state transition in KycCubit, KycRegistrationSubmitCubit, Eip712Signer, DFXAuthService
  2. Full sign-flow automation without physical hardware via an SDK-boundary fake
  3. Firmware-level protocol validation in CI using the official BitBox02 simulator
  4. Reproducible hardware tests as executable YAML for the pre-release gate
  5. Reusable infrastructure that future DFX BitBox-integrated apps (e.g. dfx-wallet) can inherit

Non-goals

  • Replacing manual security-critical validation entirely — Tier 3 stays the gold standard for production release readiness
  • Testing BitBox firmware itself — upstream concern; we treat firmware as a black box
  • A hosted, always-on hardware CI farm

The four-tier model

Tier 0  Pure Dart logic (cubits, services, signers)        no device   CI
Tier 1  Cubit / widget + SDK-boundary fake                  no device   CI
Tier 2  Real firmware simulator (USB-style framing)         no device   CI (firmware side, via testkit)
Tier 3  Real BitBox02 hardware                              device      manual / on-demand
Tier 4  BLE capture / replay                                hybrid       stretch

Canonical definitions and "when to use which tier" guidance live in docs/testing.md. Do not re-derive them here.


Phase 0 — Cubit unit tests (DONE)

Landed in #319 (merged 2026-05-15).

Test files shipped:

  • test/screens/kyc/cubits/kyc/kyc_cubit_test.dart
  • test/screens/kyc/steps/registration/cubits/registration_submit/kyc_registration_submit_cubit_test.dart
  • test/packages/wallet/eip712_signer_test.dart
  • test/packages/wallet/eip712_signer_bitbox_test.dart
  • test/packages/service/dfx/dfx_auth_service_test.dart

Stack is flutter_test + bloc_test + mocktail (already in dev_dependencies).

Long-term enforcement is the coverage gate — see Cross-cutting: coverage gating. Once the threshold check lands, every BitBox-touching PR will have to add Tier 0 cases for new logic branches automatically.


Phase 1 — SDK-boundary fake + cross-layer tests (DONE, with deviations)

Landed in #320 (merged 2026-05-15).

Deviation 1 — location. FakeBitboxCredentials lives at lib/packages/hardware_wallet/fake_bitbox_credentials.dart in this repo, not in bitbox_flutter as the original plan proposed. It sits under lib/ (with test/packages/hardware_wallet/fake_bitbox_credentials_test.dart exercising it directly) so the type can be imported cleanly from outside the test/ tree — notably from test/integration/kyc_sign_flow_test.dart. The rationale: BitboxCredentials in bitbox_flutter is a concrete class, so the fake extends it (rather than implementing an interface), and there is no consumer outside realunit-app today that would benefit from having it in the SDK package. If dfx-wallet or another consumer ever adopts the same pattern, promoting the fake to bitbox_flutter is a mechanical refactor.

Deviation 2 — CI shape. The original plan called for a separate integration-test: job in pull-request.yaml driving the iOS Simulator. In practice the cross-layer tests under test/integration/ run headless and are picked up by the regular flutter test --coverage step, so no dedicated job was needed. If/when a future Tier 1 test requires a real integration_test/ binding (full app boot, platform channels), the dedicated job becomes worthwhile.

What shipped:

  • FakeBitboxCredentials with the FakeBitboxBehavior enum: success / cancel / disconnect / timeout / malformed. Each mode mirrors a real-world ceremony outcome — see the table in docs/testing.md.
  • Deterministic test private key, derived address 0x9F5713DEacB8e9CAB6c2D3FaE1AFc2715F8D2D71, shared with code paths that need a non-BitBox credential.
  • Cross-layer test suite at test/integration/kyc_sign_flow_test.dart exercising FakeBitboxCredentialsEip712Signer.signRegistrationSigningCancelledException.
  • Convention documented in docs/testing.md: test/integration/ for ≥ 2-layer scenarios using the fake; test/ for single-layer with mocked deps.

Phase 2 — Firmware simulator (REFRAMED)

The original plan called for hosting a Docker BitBox02 simulator on a DFX server and adding a TcpBitboxTransport to bitbox_flutter so the app could speak to it. That plan has been superseded by an upstream-aligned approach centred on DFXswiss/bitbox-testkit (an Apache-2.0 mirror with hardening).

What bitbox-testkit delivers

  • 30 documented BitBox02 firmware quirks (the engineering knowledge base the original plan would have had to rediscover)
  • A bitbox-audit CLI for protocol-level validation
  • Scriptable fakes for firmware.Communication (Go) and PairedBitBox (TS) — i.e. the same idea as FakeBitboxCredentials, one layer down the stack
  • A reusable bitbox-simulator GitHub Action

Current pin: v0.5.0 (commit 45a1253d23b545d801cf5a1f42c040b85e389c7d).

How realunit-app wires it in

Two workflows, both pinned to the same testkit SHA:

Workflow Trigger Purpose
.github/workflows/bitbox-simulator.yml pull_request with path filter on lib/packages/hardware_wallet/**, lib/packages/wallet/**, lib/screens/hardware_connect_bitbox/**, mirrored test dirs, and pubspec.yaml Automatic firmware-side validation on BitBox-touching PRs
.github/workflows/bitbox-simulator-slash.yml /bitbox-simulator PR comment, optional ref=... arg, member-or-above only On-demand maintainer validation, optionally against a non-default testkit ref

What this validates today

  • bitbox-api ↔ BitBox02 firmware Noise handshake round-trip
  • ETH-address derivation on chainId=1 AND chainId=137 (the multi-byte-v boundary that historically breaks EIP-155 consumers)
  • ETH personal-message signing at the firmware-doc 1024-byte upper boundary
  • EIP-1559 sign happy path

What it explicitly does NOT validate

  • realunit-app's Dart code talking to the BitBox via bitbox_flutter against the simulator. The testkit covers the FIRMWARE side of the protocol; the consumer side still needs Tier 3 (real hardware) for the BLE / USB transport.
  • iOS BLE-specific framing — same gap as the original Phase 2 plan (the simulator speaks U2F-HID over a TCP/USB-style channel, not fragmented BLE).

Sub-tier: in-process SimulatedBitboxPlatform

bitbox_flutter v0.0.7 exposes an in-process SimulatedBitboxPlatform (lib/testing/bitbox_testkit.dart) that stubs the USB platform-interface. It sits between Tier 1 (FakeBitboxCredentials at the credentials boundary) and Tier 2 (firmware simulator over the wire) — useful when a test needs to exercise bitbox_flutter internals without a real transport. Call it Tier 1.5 in informal discussion; not yet a separate row in docs/testing.md.

Status of the original sub-plan (preserved for reference)

Marked superseded unless a concrete future need re-opens them:

  • Docker BitBox02 firmware simulator hosted on a DFX server — not needed: GitHub Actions runs the simulator ephemerally per PR, no persistent infra to maintain.
  • TcpBitboxTransport in bitbox_flutter + --dart-define=BITBOX_HOST=... build flavour — not built. Whether this is still worth doing (to drive the Dart code path end-to-end against the simulator instead of only the upstream firmware side) is an open sub-question; a tracking issue is the right place if/when someone wants to pick that up.

See the follow-up comment on ci:full patterning for how a future e2e-simulator.yaml would label-gate its heavy run.


Phase 3 — Maestro hardware-checkpoint flows (OPEN — next major work)

For scenarios that need real BitBox hardware validation before each production release. Maestro YAML makes the test plan executable documentation.

Existing context: .maestro/handbook/

.maestro/handbook/ already contains 19 YAML flows (01-welcome.yaml19-settings-seed-revealed.yaml) used by the handbook screenshot pipeline — landed in #441 and deployed via handbook-dev.yaml / handbook-prd.yaml. These flows drive the app to specific UI states and capture screenshots; they are NOT hardware tests.

To avoid filename collision and reader confusion, the hardware-checkpoint flows below should be namespaced under .maestro/kyc/ — not at the .maestro/ root.

Proposed inventory (one YAML per scenario, under .maestro/kyc/)

  1. .maestro/kyc/01_fresh_wallet_full_flow.yaml — blank BitBox, fresh email, full 13-page sign, lands on ident
  2. .maestro/kyc/02_existing_user_low_level.yaml — wallet already linked, level < 30
  3. .maestro/kyc/03_different_dfx_user_merge.yaml — expects KycAccountMergePage
  4. .maestro/kyc/04_returning_level_30_plus.yaml — must STILL sign (security gate)
  5. .maestro/kyc/05_cancel_mid_sign.yaml — expects Signature was empty SnackBar
  6. .maestro/kyc/06_bitbox_disconnected.yaml — modal sheet → reconnect → retry
  7. .maestro/kyc/07_tfa_required.yaml — backend returns TFA_REQUIRED

Required tooling

  • .maestro/kyc/README.md documenting BitBox state prerequisites (blank, attached-to-fresh-user, attached-to-other-user-level-25, attached-to-other-user-level-35) and the DFX DEV backend test-user reset procedure.
  • A reset entry point for DFX DEV KYC state (see Q4 in Open questions). Until that exists, scenario 3 (account merge) requires manual pre-staging.
  • tools/generate_test_report.dart to assemble a release dry-run report from the screenshot output.

CI shape

Manual / on-demand only. The natural mechanism is a label-gated workflow following the ci:full pattern documented in the follow-up comment — apply the label → registered runner on the tester's machine picks it up → seven Maestro flows run. Keeps the hardware-dependent path out of every speculative push.

Acceptance

  • All 7 scenarios documented as YAML under .maestro/kyc/
  • README with setup, BitBox state matrix, reset procedure
  • One full release dry-run completed with all scenarios green
  • Tester onboarding ≤ 30 min from zero

Phase 4 — VCR / replay (stretch)

A "tape recorder" for BLE traffic — macOS proxy app sitting between iPhone and BitBox, capturing U2F-HID frames in both directions, replayable through a RecordedBitboxTransport. Use case: capture once on real hardware after a firmware upgrade, replay deterministically in CI thereafter to detect transport-layer regressions.

Stretch only — Tier 2 (firmware logic) + Tier 3 (real device, on demand) covers most of this need together. Document for later; do not start until everything else has landed.


Cross-cutting: coverage gating

The original issue's acceptance criterion ("every new BitBox-touching PR adds tests at the appropriate tier") needs an enforcement mechanism. That mechanism is the coverage gate.

What has landed:

  • docs: add features matrix + 100% test-coverage rule #322 — README rule: "new PRs may only merge into develop if test coverage is 100% on the activated surface", with a Coverage scope definition (lib/packages/**, lib/screens/<feature>/cubits|bloc/**) and an inline // coverage:ignore-* escape hatch for genuinely unreachable code.
  • ci: measure test coverage and upload as artifact #323flutter test --coverage runs in pull-request.yaml, lcov strips generated / main.dart, the filtered coverage/lcov.info is uploaded as a workflow artifact. This is the measurement baseline.

What is still pending (tracked in the README "Coverage infrastructure roadmap" section):

  • An lcov threshold check that fails the build below the configured percentage
  • GitHub branch protection on develop requiring the coverage check
  • Build-time feature-flag mechanism so non-MVP surface can be excluded from the gate

Until those land, the 100% rule is aspirational. Once they do, Tier 0 + Tier 1 enforcement becomes automatic.


The canonical seven scenarios

These map to PR #312's manual test plan and remain the test backbone across tiers.

# Scenario Tier 0 (Cubit) Tier 1 (FakeBitbox) Tier 2 (firmware side) Tier 3 (real HW)
1 Fresh wallet → email → form → 13-page sign → ident #319 #320 indirect (ETH sign primitives) open (Phase 3)
2 Existing user, level < 30 #319 #320 indirect open (Phase 3)
3 Wallet attached to different DFX user → merge page #319 #320 n/a (backend behaviour) open (Phase 3)
4 Returning user level >= 30, MUST still sign #319 #320 indirect open (Phase 3)
5 Cancel sign mid-ceremony → empty-sig SnackBar #319 #320 (cancel) n/a (firmware can't model UI cancel) open (Phase 3)
6 BitBox not connected → modal, reconnect, retry n/a (transport) #320 (disconnectsuccess) n/a (firmware-side only) open (Phase 3)
7 Backend TFA_REQUIRED → 2FA page #319 #320 n/a (backend behaviour) open (Phase 3)

"indirect" in the Tier 2 column means the testkit validates the underlying ETH-sign primitives the scenario depends on, not the scenario as a whole — the scenario itself is a UI / backend interaction that the firmware does not observe.


Open questions

  1. Simulator firmware-version pinning — ✅ ANSWERED. bitbox-testkit is pinned to v0.5.0 at SHA 45a1253d in both workflows; the slash-command variant accepts ref=... for ad-hoc overrides. Quarterly bump cadence is the working convention.
  2. iOS-specific BLE quirks — 🟡 STILL OPEN. The simulator emulates U2F-HID over a USB-style channel, not BLE-fragmented. docs/testing.md documents this gap explicitly. Tier 3 stays the only validation for the BLE transport — that's the entire point of Phase 3.
  3. DFX DEV backend coupling — 🟡 STILL OPEN. Phase 3 hardware flows write real KYC state to the DEV backend; a stable test-user pool and a reset endpoint need coordination with the API team. Today, scenarios 2 / 3 / 4 require manual pre-staging of a test user at the correct KYC level.
  4. Test data hygiene — 🟡 STILL OPEN. No tools/reset_test_users.sh, no DFX DEV admin endpoint for KYC reset. Block on Phase 3 acceptance; until it lands, manual cleanup is the only option.
  5. Hosting cost — ✅ MOOT. The testkit runs ephemerally per GitHub Actions run; no persistent simulator host to budget for. The Phase 3 runner runs on the tester's existing machine.

References

PRs that have landed (this repo):

Related repos:

  • DFXswiss/bitbox-testkit — current pin v0.5.0 (Apache-2.0 mirror with hardening; firmware-side Phase 2 engine)
  • DFXswiss/bitbox_flutter — current pin in pubspec.yaml is v0.0.5, latest tag is v0.0.7 (2026-05-18). v0.0.7 introduces the in-process SimulatedBitboxPlatform.

External:

Living doc:

  • docs/testing.md — canonical four-tier definitions, examples, conventions

Effort summary

Phase Already invested Remaining estimate
0 — Cubit unit tests ~3 days (#319)
1 — Fake-BitBox + cross-layer ~3 days (#320) promotion to bitbox_flutter is optional, ~1 day if a second consumer materialises
2 — Firmware simulator (testkit) ~1 week of testkit + workflow wiring bumping testkit pins quarterly; revisit Dart-side TcpBitboxTransport only if a concrete need surfaces
3 — Maestro hardware flows ~1–2 weeks: tooling + 7 scenarios + DEV backend reset coordination
4 — VCR / replay stretch, ~2 weeks if pursued
Coverage gating ~1 day (#322 + #323) ~1–2 days for threshold gate + branch protection once the activated-surface scope is finalised

Phases 0–2 cover the regression-risk surface that originally motivated the issue. Phase 3 is the next pre-release-readiness step. Phase 4 stays speculative.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions