Skip to content

Stale state leak in KycCubit on retry after timeout #315

@TaprootFreak

Description

@TaprootFreak

Stale state leak in KycCubit on retry after timeout

TL;DR

The _timedOut flag introduced in commit 5c12676 (#312) is reset at the start of every checkKyc() call, which means a late HTTP response from an earlier call can override a fresh response from a subsequent call. This is not just a UI flicker — the final user-visible state can be permanently stale until the cubit is recreated or another checkKyc() is triggered.

Empirically reproduced; fix proposal below.

Problem

After PR #312 merged, KycCubit.checkKyc() wraps the underlying work in a 30s timeout (commit 5c12676):

Future<void> checkKyc() async {
  _timedOut = false;                                  // ← reset every call
  try {
    await _runCheckKyc().timeout(_checkKycTimeout);
  } on TimeoutException {
    _timedOut = true;
    if (!isClosed) emit(const KycFailure('...'));
  }
}

Future<void> _runCheckKyc() async {
  emit(const KycLoading());
  final results = await Future.wait([...]);
  if (isClosed || _timedOut) return;                  // ← guard
  // ... emit state
}

Future.timeout does not cancel the underlying work. The original future continues running in the background even after the timeout fires. The _timedOut flag was intended to stop these late responses from emitting state — and it does, for the same call. But the flag is shared across all checkKyc() invocations and gets reset at the start of every new call.

Timeline of the leak

  1. t=0: User triggers checkKyc()_timedOut = false, Call 1 starts, Future.wait awaited
  2. t=30s: Call 1's outer .timeout() fires TimeoutException_timedOut = true, KycFailure emitted
  3. t=35s: User taps Retry on the failure page → checkKyc() called again → _timedOut = false (reset!), Call 2 starts
  4. t=36s: Call 2's Future.wait completes (backend recovered), guard _timedOut == false passes, fresh state emitted (e.g. KycSuccess(ident))
  5. t=45s: Call 1's late HTTP response arrives, Call 1's body resumes, guard _timedOut == false (was reset by Call 2), Call 1 emits its stale state (e.g. KycSuccess(registration))
  6. User sees the stale state from t=0 instead of the fresh state from t=35. State stays stale until the next checkKyc().

Empirical proof

Reproducer (pure Dart, no Flutter deps):

import 'dart:async';

class FakeCubit {
  String _state = 'initial';
  bool _timedOut = false;
  bool get isClosed => false;

  void emit(String s, {required String from}) {
    _state = s;
    print('  [emit from $from] $s');
  }

  Future<void> checkKyc(String id, Duration delay, String result) async {
    _timedOut = false;
    try {
      await _run(id, delay, result).timeout(const Duration(milliseconds: 100));
    } on TimeoutException {
      _timedOut = true;
      emit('Failure: timeout', from: id);
    }
  }

  Future<void> _run(String id, Duration delay, String result) async {
    emit('Loading', from: id);
    await Future.delayed(delay);
    if (_timedOut) return;
    emit(result, from: id);
  }
}

Future<void> main() async {
  final c = FakeCubit();
  c.checkKyc('CALL1', const Duration(milliseconds: 300), 'Call1-STALE');
  await Future.delayed(const Duration(milliseconds: 150));   // Call 1 timeout fires
  c.checkKyc('CALL2', const Duration(milliseconds: 50), 'Call2-FRESH');
  await Future.delayed(const Duration(milliseconds: 400));   // Wait for everything
  print('Final state: ${c._state}');
}

Output:

  [emit from CALL1] Loading
  [emit from CALL1] Failure: timeout
  [emit from CALL2] Loading
  [emit from CALL2] Call2-FRESH
  [emit from CALL1] Call1-STALE       ← stale leak overrides fresh state
Final state: Call1-STALE               ← BUG: final state is from the timed-out call

Impact

  • Severity: medium. The bug only triggers when the backend takes longer than 30s AND the user retries within the late-response window. In a healthy backend this never fires.
  • Correctness, not just UX. Earlier review notes called this a "UI flicker" — that's wrong. The stale emit is the last state change, so it persists. The user can land permanently on the wrong KYC step until they navigate away (cubit recreated) or trigger another checkKyc().
  • Concrete user-visible failure modes:
    • Call 2 says "go to ident", Call 1 (stale) overrides with "go to registration" → user signs again unnecessarily
    • Call 2 says "you need 2FA", Call 1 (stale) overrides with "completed" → user thinks they're done, isn't
    • Call 2 says "backend error", Call 1 (stale) overrides with success → user proceeds with broken backend state

The KYC + hardware-sign flow is exactly the place where this kind of subtle state corruption hurts most — it's a multi-minute ceremony with high friction to repeat.

Proposed fix: generation counter

Replace the boolean _timedOut flag with a monotonically-incrementing generation counter. Each checkKyc() call captures its own generation, and the _runCheckKyc body bails if the current generation no longer matches:

int _runGeneration = 0;

Future<void> checkKyc() async {
  final generation = ++_runGeneration;
  try {
    await _runCheckKyc(generation).timeout(_checkKycTimeout);
  } on TimeoutException {
    if (generation == _runGeneration && !isClosed) {
      emit(const KycFailure('KYC backend did not respond in time'));
    }
  } catch (e) {
    if (generation == _runGeneration && !isClosed) {
      emit(KycFailure(e.toString()));
    }
  }
}

Future<void> _runCheckKyc(int generation) async {
  if (isClosed || generation != _runGeneration) return;
  emit(const KycLoading());

  final results = await Future.wait([
    _kycService.getKycStatus(),
    _kycService.getUser(),
  ]);

  if (isClosed || generation != _runGeneration) return;
  // ... rest of body, threading `generation` through to recursive calls and `_continueKyc`
}

Future<void> _continueKyc(int generation) async {
  final kycStatus = await _kycService.continueKyc();
  if (isClosed || generation != _runGeneration) return;
  // ... emit
}

Verified with the same reproducer harness, now passing:

[CALL1] checkKyc starting, generation=1
  [emit from CALL1] Loading
  [emit from CALL1] Failure: timeout
[CALL2] checkKyc starting, generation=2
  [emit from CALL2] Loading
  [emit from CALL2] Call2-FRESH
[CALL1] bailed: gen 1 != current 2     ← stale response correctly rejected

Final state: Call2-FRESH                ← FIXED

This is the canonical "cancellation token" pattern for non-cancellable async work — generation numbers are the smallest possible identifier that survives the lack of a real cancellation mechanism in Dart Future.

Alternative considered: CancelableOperation from package:async

More idiomatic Dart but requires the underlying HTTP layer to support cancellation. http.Client does not propagate CancelableOperation's cancel signal. Either we accept the underlying HTTP work keeps running (same as today, just with cleaner cancellation API) or rewrite the HTTP layer (out of scope). Generation counter is the pragmatic choice.

Alternative considered: per-call local state

Pass a _Run object into _runCheckKyc and stash all per-call state there. Cleaner separation but heavier refactor for a single boolean's worth of state. Revisit if more per-call state accumulates.

Acceptance criteria

  • Generation counter replaces _timedOut in lib/screens/kyc/cubits/kyc/kyc_cubit.dart
  • generation threaded through _runCheckKyc (including the recursive call in the email-registration branch) and _continueKyc
  • Comment block updated to explain the cancellation-token semantics
  • Unit test in test/screens/kyc/cubits/kyc/kyc_cubit_test.dart (will be created as part of Comprehensive BitBox testing infrastructure (Tier 0–4) #314 Phase 0) covers the retry-race scenario explicitly:
    • Two checkKyc() calls, first one times out, second one is fast; assert final state matches the second call's result
  • flutter analyze clean, flutter test passes

Files affected

  • lib/screens/kyc/cubits/kyc/kyc_cubit.dart (the only file)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfixA fix for a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions