Skip to content

Improve receipt durability and observability when upload fails #92139

@MelvinBot

Description

@MelvinBot

Problem

In rare cases, receipts captured in NewDot can fail to upload and end up effectively lost — the user is left without a clear way to recover the image. A recent investigation suggested this can happen when something blocks the write queue between capture and a sign-out (or other lifecycle event), but we currently lack the observability to confirm the exact sequence in production. We're left guessing at what happened in individual cases.

Two related concerns:

  1. Receipt durability — once a user has taken a photo of a receipt inside the app, we should make it very hard for that image to be lost, even if the upload pipeline fails.
  2. Observability — when this does happen, we don't have enough logs to diagnose the root cause after the fact.

Proposals discussed

Option 1 — Save the receipt to the device's photo library on capture.

  • Pros: Simple, gives the user a clear recovery path (re-submit from gallery).
  • Cons: Requires photo library permission. Doesn't work if the user has denied that permission.

Option 2 — Persist the captured receipt in a dedicated local queue/store, and on next sign-in by the same user, prompt them to resubmit if anything is still pending.

  • Pros: Better UX. No extra OS permission needed. Doesn't rely on the user having access to the photo library.
  • Cons: If we persist images as base64 in Onyx, they'd remain on disk after sign-out in an unencrypted form. We don't currently encrypt anything at rest in Onyx, so this is arguably consistent with our current model — but it raises the bar of what's accessible to a technical user after sign-out. We'd want to think carefully about scope-down/cleanup behavior here.

Observability work

Independently of which recovery approach we pick, we need to substantially improve logging in the receipt capture → upload → SmartScan flow so that when a customer reports a lost receipt, we can:

  • Confirm whether the image was captured.
  • Confirm whether the upload request was enqueued.
  • Confirm whether/when it was sent, and what the server response was.
  • Confirm queue state at sign-out / app close / app foreground events.

The goal: when a similar report comes in, we should be able to answer "what happened" from logs rather than guessing.

Related

There was also a side discussion about doing a SmartScan queue (SQ) audit, similar to the prior offline-tracking audit, and documenting the findings in a doc analogous to NETWORK_STATE_DETECTION.md. That work is likely tracked separately, but is referenced here for context.

Suggested next steps

  • Decide between Option 1, Option 2, or a hybrid (e.g., photo library when permission is granted, fall back to local queue otherwise).
  • Define the cleanup/scope-down rules for any locally persisted receipt images.
  • Add structured logs at each step of the capture → upload → scan pipeline, plus lifecycle events (sign-out, background, foreground).

Slack thread: https://expensify.slack.com/archives/C05LX9D6E07/p1779949397375439

Issue OwnerCurrent Issue Owner: @adhorodyski

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status

CRITICAL

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions