HOLD fix network state detection#84760
Conversation
Introduce a hard stop model for offline detection: OS radio (NetInfo), sustained API failure tracking, and recovery probes with exponential backoff. This replaces the old RecheckConnection middleware with a more reliable system that distinguishes between no-radio and server-unreachable scenarios, and actively probes for recovery before going back online. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shouldForceOffline is a debug tool that should keep the app offline unconditionally. Previously, toggling it ON would start the recovery probe, which would immediately succeed (server is reachable), fail to clear shouldForceOffline, and restart — causing an infinite loop. Now updateState() only starts the probe for real connectivity triggers (noRadio, sustainedFailures) and stops it when shouldForceOffline is the cause. Also skips the foreground probe in that case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of subscribing to NetInfo from a useLayoutEffect in Expensify.tsx (which required prop-drilling accountID), use a module-level Onyx connection to SESSION in NetworkConnection.ts. This ensures the NetInfo subscription lifecycle is tied to accountID changes directly and avoids unnecessary re-renders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When NO_RADIO cleared (radio came back), noRadioActive was set to false but updateState() was never called, so Onyx isOffline stayed true. Now updateState() is always called when NO_RADIO clears. If no other hard stops are active the app goes online immediately. If sustained failures are still active, it stays offline and fires an immediate probe. Also removes temporary [NETWORK_DETECTION] log prefixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store radio state as a direct positive boolean (`hasRadio = true`) instead of inverted (`noRadioActive = false`), eliminating the double-negation in setNoRadio. Also fixes incorrect boolean values in the architecture docs and removes [NETWORK_DETECTION] log prefixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the custom RecoveryProbe (~115 lines) with NetInfo's built-in JS fetch polling by setting useNativeReachability: false. This aligns behavior across web and mobile: poll api/Ping every 60s when reachable, every 5s when unreachable. The NetInfo listener now tracks isInternetReachable transitions and calls onReachabilityRestored() when connectivity returns, replacing the probe-based recovery flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey! I see that you made changes to our Form component. Make sure to update the docs in FORMS.md accordingly. Cheers! |
FailureTracker now uses a listener Set instead of importing NetworkState directly, breaking the circular dependency. NetworkState registers as a consumer via onSustainedFailureChange(). When onReachabilityRestored() fires, FailureTracker counters are reset to zero. This prevents stale failure history from immediately re-triggering a hard stop after recovery on flapping networks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After successful re-authentication on READ requests, call reconnect() to re-sync app data — restoring behavior that was lost when the old triggerReconnectionCallbacks was removed. Add FailureTracker unit tests covering threshold logic, counter reset, and listener lifecycle. Bring back the re-auth reconnect test adapted for the new reconnect() function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug analysis: App stuck in offline/skeleton state despite working internetReported in: Slack thread Symptoms: App permanently stuck showing skeletons and offline indicator. Ping endpoint returns 200 in DevTools, but the app stays offline. Resolved only by clearing cache and restart. Logs showed "a couple of network state changes which led to being stuck in the offline state." How the old system (current The old architecture has a catch-22:
Additionally, How this PR prevents it:
Verdict: This PR breaks the catch-22 that caused this bug. The old system relied on request-driven rechecks (dead when the queue is paused). Ours uses NetInfo's independent polling that runs regardless of app state. |
…tion # Conflicts: # src/libs/Navigation/AppNavigator/AuthScreensInitHandler.tsx
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests now use NetworkState.setHasRadio() instead of directly manipulating ONYXKEYS.NETWORK via Onyx, which ensures they go through the real production code path (pause/unpause + Onyx merge). Source fixes: - Reconnect.ts: named imports instead of namespace import - FailureTracking.ts: Number() cast for jsonCode comparison - useNetwork.ts: wrap ref mutation in useEffect - NetworkTest.tsx: eslint-disable for required namespace import - AuthScreensInitHandlerTest.tsx: remove obsolete onReconnect tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I will provide test steps on tomorrow to unblock this going into full review. |
|
Provided repro steps, resolving failing tests now. |
|
I'm still working on the failing tests around SQ. cc @mjasikowski |
Two issues caused 7 test failures after replacing onReconnection(flush) with NetworkState pause/unpause: 1. unpause() called flush(false) synchronously, before pending Onyx.set callbacks from save()/update() during offline push() settled. Stale callbacks overwrote in-memory persistedRequests during process(), causing duplicate API calls (exposes known Issues 3c/4 in Expensify#80759). Fix: defer flush to next macrotask via setTimeout(0). 2. .finally() used isQueuePaused instead of isOffline() to resolve isReadyPromise. isQueuePaused is true for both offline AND shouldPauseQueue pauses — the latter should NOT resolve (WRITEs still pending). Fix: restore isOffline() check. Test adjustments: - ReportTest: await queue processing before going offline (sync pause() blocks immediately unlike old async Onyx.set) - APITest: account for deferred flush timing in reauthentication test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setTimeout(0) races with waitForBatchedUpdates() (also setTimeout(0)), causing MiddlewareTest failures where flush hasn't fired when assertions run. Switch to Promise.resolve().then() which is a microtask — drained by waitForBatchedUpdates() before its macrotask resolves. Same FIFO ordering guarantee for pending Onyx callbacks, no test timing issues. Also add "macrotask" to cspell.json and revert the extra waitForBatchedUpdates() in APITest that was only needed for setTimeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@abzokhattab addressed everything |
The test relied on a fire-and-forget Onyx.merge (triggered by NetworkState.setHasRadio) propagating before the next .then() in the chain. Add waitForBatchedUpdates to make the timing deterministic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0598e72329
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
rechecking |
|
Working on conflicts & AI comments now. |
Switch from a denylist (skip cancelled/duplicate, record everything else) to an allowlist: only FAILED_TO_FETCH and EXPENSIFY_SERVICE_INTERRUPTED count as connectivity failures. This prevents 429 throttle responses and other non-connectivity HTTP errors from incorrectly tripping the sustained failure hard stop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@mjasikowski still working on a sync version of isOffline that is not tied to Onyx, will push some updates soon |
Replace the async Onyx-based isOffline with a synchronous in-memory state machine in NetworkState.ts. All consumers now read isOffline() directly instead of subscribing to ONYXKEYS.NETWORK. Key changes: - NetworkState.ts is the single source of truth for offline status - useNetwork uses useSyncExternalStore for React integration - SequentialQueue subscribes to NetworkState for offline→online flush - Background location task uses NetInfo.fetch() for headless JS safety - Delete NetworkConnection.ts (consolidated into NetworkState.ts) - Reconnect.ts owns app foreground + reachability recovery listeners - Fix lastOfflineAt to update on every offline transition - Fix poor connection simulation timer leak Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion # Conflicts: # src/components/Form/FormProvider.tsx
|
@codex review |
Restore the original two separate if-guards so each condition has its own early return and distinct log message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same treatment as process() — restore two separate if-guards with distinct log messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 036abe49b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Previously only jsonCode 200 was counted as success. Non-200 server responses (407 auth required, 429 throttled, etc.) were ignored, leaving the failure counter dirty. This could cause false hard stops when auth errors preceded a transient network failure. Any resolved response proves connectivity — record success for all. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add section explaining the three layers of natural backoff that protect against thundering herd on server recovery. Fix references to deleted NetworkConnection.ts — all functionality is in NetworkState.ts now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents unnecessary re-renders of useNetwork() consumers when multiple hard stop triggers fire while already offline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Early return when failureCount is already 0 — avoids logging, counter writes, and listener notifications on every successful request when there's nothing to reset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each browser tab detects connectivity independently via module-level state rather than shared Onyx persistence. This is intentional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip triggering flush when offline — the request is already persisted and will be sent when back online. Matches the original behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1fc28fb to
bb5e511
Compare
…tion # Conflicts: # src/libs/Navigation/AppNavigator/AuthScreensInitHandler.tsx
The NetInfo listener was early-returning when shouldForceOffline was true, which meant hasRadio and prevIsInternetReachable stopped tracking reality. When force-offline was turned off, updateState() used stale values. Now the listener always updates hasRadio and reachability tracking. Only the onReachabilityRestored() reconnect side effect is gated behind !shouldForceOffline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Module-level listeners (reachability, foreground) fire even after logout. Guard reconnect() with a session check so we don't issue openApp/reconnectApp requests with no active session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@MelvinBot review |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…int errors The module-level isOffline export shadowed function parameters of the same name in IOU/index.ts, PolicyUtils.ts, and ReportActionsUtils.ts, causing 10 @typescript-eslint/no-shadow errors in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mjasikowski @mountiny @TMisiukiewicz @OlimpiaZurek
Explanation of Change
The existing
RecheckConnectionsystem was ad-hoc: a single middleware that re-ranNetInfo.fetch()after every failed request with no coordination, a grab-bag ofisOffline/shouldForceOffline/shouldFailAllRequestsflags scattered across Onyx, and no unified model for what "offline" means. This PR replaces it with a structured network state detection architecture:Central state machine (
NetworkState.ts) — single source of truth for offline status with three hard-stop triggers: no radio, sustained API failures, and force-offline. All offline decisions flow throughupdateState()instead of being scattered across files.Sustained failure detection (
FailureTracker.ts+FailureTrackingmiddleware) — replaces the oldRecheckConnectionmiddleware. Counts consecutive API failures with a dual threshold (3 failures over 10s) to distinguish transient errors from genuine outages. One successful request clears everything.Simplified Onyx network state — the
Networkonyx type is cleaned up:shouldForceOfflinestays (debug tool),shouldFailAllRequestsandshouldSimulatePoorConnectionstay (test tools), but all the derived/redundant flags are removed.isOfflineis now set exclusively byNetworkState.NetInfo as the reachability layer — NetInfo is configured with
useNativeReachability: falseso it uses JS fetch polling (api/Ping) on all platforms instead of trusting native OS reachability. This aligns behavior across web and mobile: polls every 60s when reachable, every 5s when unreachable. TheisInternetReachabletransition listener detects recovery.Reconnect action (
Reconnect.ts) — extracted from the oldNetworkConnectioninto a focused module. Called on recovery and app foreground to sync data viaopenApp/reconnectApp.Synchronous offline reads (
useNetwork+useSyncExternalStore) —useNetworknow readsisOfflinesynchronously fromNetworkStateviauseSyncExternalStoreinstead of subscribing to Onyx. This eliminates the render-lag from async Onyx reads and ensures components see the correct offline state immediately.SequentialQueue offline→online flush —
SequentialQueuesubscribes toNetworkStatedirectly and callsflush()on offline→online transitions. This replaces the oldpause()/unpause()coupling with a cleaner dependency direction (SequentialQueue → NetworkState).NetworkConnection.tsdeleted — all its responsibilities (NetInfo configuration, Onyx subscriptions, poor connection simulation,getDBTimeWithSkew) are consolidated intoNetworkState.ts, eliminating the split between the two modules.Background location task safety — the background location tracking task uses
NetInfo.fetch()directly instead of the in-memoryNetworkState.isOffline(), so it works correctly in Android headless JS contexts where module-level state hasn't been populated.Architecture documentation added at
contributingGuides/NETWORK_STATE_DETECTION.md.Fixed Issues
$ #80759
PROPOSAL:
Tests
1. Basic offline/online detection (Web)
https://dev.new.expensify.com:8082/)[NetworkState] Hard stop: NO_RADIOlog[NetworkState] Internet reachability restoredlog2. Basic offline/online detection (Mobile — iOS & Android)
3. Sustained failure detection (Web)
[FailureTracker]logs[FailureTracker] Failure #1,#2,#3logs in the console[FailureTracker] Sustained failure threshold reachedappears[FailureTracker] Success after N failures — resetting trackerappears4. Force offline toggle (Test Tool Menu)
5. App foreground recovery
[NetworkState] App became active6. Recovery after radio loss + restore
7. Transient failures do NOT trigger offline
[FailureTracker] Failure #Nlogs appear but NOSustained failure threshold reachedOffline tests
8. Optimistic updates while offline
9. Sequential queue behavior
QA Steps
10. End-to-end connectivity on staging
11. Real network interruption on staging
12. High traffic account regression
PR Author Checklist
### Fixed Issuessection aboveTestssectionOffline stepssectionQA stepssectiontoggleReportand notonIconClick)src/languages/*files and using the translation methodSTYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))npm run compress-svg)Avataris modified, I verified thatAvataris working as expected in all cases)Designlabel and/or tagged@Expensify/designso the design team can review the changes.ScrollViewcomponent to make it scrollable when more elements are added to the page.mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari