Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620
Fix stuck Concierge thinking indicator when client misses Onyx clear update#85620marcochavezf wants to merge 38 commits intomainfrom
Conversation
…dates When a client misses real-time Pusher events and catches up via GetMissingOnyxMessages, Onyx batches the SET and CLEAR merges into a single notification with the final (empty) value. The hook never sees the intermediate non-empty server label, so optimisticStartTime is never cleared — leaving the indicator stuck permanently. Fix: track NVP write count via a direct Onyx.connect subscription. When the indicator NVP is written (even if the final rendered value is empty), increment a version counter. On kickoff, snapshot the counter. The effect compares versions to detect server activity that React batching would otherwise hide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@abdulrahuman5196 Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button] |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dffb249ec
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
PR doesn’t have any new product considerations as a code clean-up PR. Unassigning and unsubscribing myself. |
Move NVP version tracking from the hook into a dedicated NVPIndicatorVersionTracker lib to satisfy the ESLint rule that restricts Onyx.connect() to /src/libs/**. This also fixes the TypeScript callback type mismatch. Replace "backgrounded" with "in the background" for spellcheck. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…early return - Change connectionMap type from Map<string, number> to Map<string, Connection> - Import Connection type from react-native-onyx - Use early return pattern in NVP version listener callback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Looks like you've decreased code coverage for some files. Please write tests to increase, or at least maintain, the existing level of code coverage. See our documentation here for how to interpret this table.
|
…sends The version-snapshot approach failed when two messages were sent rapidly: msg1's server response bumped the version past msg2's snapshot, clearing the thinking indicator prematurely. Replaced with a pending-request counter that increments on each kickoff and decrements when a full server roundtrip (SET+CLEAR = 2 version bumps) completes, so the indicator persists until all outstanding requests are processed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stuck indicator fix The thinking indicator is ephemeral data stored in a durable system (Onyx NVPs). When the client misses the server CLEAR update (Onyx batching coalesces SET+CLEAR, Pusher reconnect delivers stale state, or the CLEAR is dropped), the indicator gets stuck permanently. This implements the "lease pattern" from distributed systems: every state assertion is time-bounded. Without renewal or explicit clear, the indicator auto-expires. Changes: - Add 60s safety timeout that auto-clears the indicator when no CLEAR arrives - Reset timer on each new server label (lease renewal) - Clear indicator on network reconnect (like typing indicators) - Remove NVPIndicatorVersionTracker entirely (TTL handles all its failure modes) - Add clearAgentZeroProcessingIndicator action to Report actions - Add 6 new test cases covering TTL, reconnect, and version tracker removal The 60s timeout is appropriate because AI processing takes 30-45s on dev, and XMPP uses 30s for composing→paused transitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spellcheck CI job flags XMPP (a messaging protocol referenced in code comments) as an unknown word. Add it to the allowed list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nects When the 60s TTL safety timer expires or Pusher reconnects, the indicator is cleared but the Concierge response that triggered the CLEAR may also have been dropped. Call getNewerActions to pull missed report actions via HTTP so the user sees the response without a manual refresh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator Resolve merge conflicts after main refactored useAgentZeroStatusIndicator hook into AgentZeroStatusContext. Keep both: the context (from main) for standard status display, and the hook with safety timeout/TTL (from this branch) for the stuck indicator fix. Restore ConciergeReasoningStore.ts and add mocks for the combined test file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Long Concierge responses can take up to 2 minutes. A hard 60s TTL incorrectly clears a legitimate in-progress response. Instead, use a progressive retry schedule: - 60s: Call getNewerActions, keep indicator showing - 90s: Retry getNewerActions, keep indicator showing - 120s: Final retry — if still no new actions AND online, clear indicator If the Concierge reply arrives at any point (via Pusher or getNewerActions response), the normal Onyx update clears the indicator automatically. If offline at final retry, skip the clear — the response may arrive when reconnected (onReconnect handler covers that case). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hook now uses progressive retry intervals (60s → 90s → 120s) instead of a single 60s hard-clear timeout. Update the two failing tests to verify that intermediate retries (60s, 90s) only poll via getNewerActions while the indicator stays visible, and only the final retry (120s) clears the indicator. Also add getNewerActions to the Report mock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of three separate setTimeout calls at 60s/90s/120s, poll getNewerActions every 30s via setInterval while the thinking indicator is visible. This is simpler and more robust against WebSocket drops. The 120s safety timeout is kept as a hard clear fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The onReconnect handler was clearing the indicator immediately, which caused the thinking state to disappear while offline. Now it fetches newer actions and restarts polling without clearing — the indicator stays visible until the actual server CLEAR arrives via Onyx. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ear on reconnect The original design intentionally hides 'Concierge is thinking...' while offline (!isOffline in isProcessing). This is correct UX — showing a thinking indicator when the user can't receive a response is misleading. On reconnect, the indicator reappears naturally if the server NVP still has processing state. getNewerActions fetches any missed responses, and polling restarts as safety net. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR ReviewOverall this is a solid architectural improvement — extracting the indicator logic into a dedicated hook with polling recovery is the right approach for handling missed Pusher events. A few issues to address: Issues1.
Every other Since // Module-level subscribe that passes reportID to listeners
const subscribeToReasoningStore = useCallback((onStoreChange: () => void) => {
return ConciergeReasoningStore.subscribe((updatedReportID) => {
if (updatedReportID !== reportID) {
return;
}
onStoreChange();
});
}, [reportID]);
const getReasoningSnapshot = useCallback(
() => ConciergeReasoningStore.getReasoningHistory(reportID),
[reportID]
);Or better: make 2. The main 3. Unnecessary As 4. These functions are defined inline and capture refs + state. The eslint-disable comment says "stable via React Compiler," but this is fragile — if React Compiler isn't applied or the function identity changes, the effect will re-run and restart polling unexpectedly. At minimum, document which React Compiler version/config guarantees this, or wrap in Minor / Non-blocking
Next Steps: Reply with |
- Resolve merge conflict in AgentZeroStatusContext.tsx: adopt main chatType selector instead of prop, keep hook-based architecture - Remove unnecessary XMPP entry from cspell.json (situchan review) - Remove useMemo wrapper in AgentZeroStatusGate (React Compiler handles it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer Checklist
Screenshots/VideosAndroid: HybridAppandroid.movAndroid: mWeb Chromemchrome.moviOS: HybridAppios.moviOS: mWeb Safarimsafari.movMacOS: Chrome / Safariweb.mov |
|
TypeScript failing |
PR #87077 refactored AddNewCardPage to useConfirmModal but left a dangling isModalVisible reference (removed the useState but not the usage). This breaks typecheck and ESLint on any branch that merges main after that commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Updated! |
|
Concierge thinking indicator is still stuck for this scenario:
Screen.Recording.2026-04-09.at.3.32.54.AM.mov |
| break; | ||
| default: | ||
| CurrentStep = <SelectCountryStep disableAutoFocus={isModalVisible} />; | ||
| CurrentStep = <SelectCountryStep />; |
There was a problem hiding this comment.
Undo this change and merge main. They fixed another way in #87431
|
I compared the original 1. Missing: Clear optimistic state on network reconnectOriginal (main if (!isOffline && optimisticStartTime) {
setOptimisticStartTime(null);
}New ( This may be intentional (keep showing "thinking" until we confirm the response arrived rather than briefly hiding it), but it's a behavioral change from the original. 2.
|
|
Concierge thinking indicator not showing at all on mobile this branch: Screen.Recording.2026-04-09.at.12.57.14.PM.mov
Screen.Recording.2026-04-09.at.12.54.46.PM.mov |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix-stuck-thinking-indicator
When going offline→online, the optimistic pendingOptimisticRequests state was persisting from before the disconnect, causing the indicator to reappear even if Concierge had already responded. This matches the original behavior in AgentZeroStatusContext which cleared optimisticStartTime on reconnect. Also fixes misleading comment about counter "decrementing per reply" — the counter resets to 0 when any Concierge reply is detected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes to match the original AgentZeroStatusGate behavior: - Use agentZeroProcessingIndicatorSelector for NVP subscription (fewer re-renders) - Use displayedLabelRef to avoid displayedLabel in effect dependency array (fewer effect cycles) - Remove isAgentZeroChat guard from kickoffWaitingIndicator (provider already handles this) - Remove isAgentZeroChat from isProcessing (always true in context, simplifies logic) - Move MIN_DISPLAY_TIME/DEBOUNCE_DELAY to module scope (matching main's pattern - Update hardClearIndicator to also reset displayedLabelRef - Restructure label-sync effect to match main's early-return-via-ref pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> EOF )
|
Comments addressed, I triggered an ad-hoc build to test the issue on mobile |
|
🚧 @marcochavezf has triggered a test Expensify/App build. You can view the workflow run here. |
|
🧪🧪 Use the links below to test this adhoc build on Android, iOS, and Web. Happy testing! 🧪🧪
|
PR ReviewAll CI checks pass and the Issues1.
Every other Fix: wrap both in 2. Stuck indicator after offline→online reconnect (matches When the user goes offline while the indicator is showing, the server may process the request and clear the NVP while Pusher is disconnected. On reconnect, Onyx sync can deliver the stale NVP (set but not cleared — the clear was a Pusher event that was missed). The If The original code ( if (!isOffline && optimisticStartTime) {
setOptimisticStartTime(null);
}Suggested fix: in onReconnect: () => {
const wasOptimistic = pendingOptimisticRequests > 0;
if (wasOptimistic) {
setPendingOptimisticRequests(0);
clearAgentZeroProcessingIndicator(reportID);
}
getNewerActions(reportID, newestReportActionRef.current?.reportActionID);
if (serverLabel) {
startPolling();
}
}3. Comment at line 89-91 says "each Concierge reply decrements it" but every code path uses Next Steps: Reply with |
Is this correct? Indicator disappears when offline Screen.Recording.2026-04-16.at.1.09.00.AM.movAlso, as seen in this video, indicator not showing on Concierge chat, while showing on #admins |
- situchan regression: in a Concierge DM the newest existing action is usually Concierge's previous reply, so the naive Concierge-reply detection effect cleared the indicator the moment optimistic state activated. Capture a baseline reportActionID when the indicator becomes active and only clear on a different Concierge actionID. The baseline tracks raw active state (not UI-visible isProcessing) so it survives offline to online cycles. - useSyncExternalStore stability: memoize subscribeToReasoningStore and getReasoningSnapshot on [reportID] so the reasoning store isn't resubscribed every render. - Reconnect stuck indicator: when onReconnect fires with optimistic state, clear the NVP locally so a stale (server-cleared but not synced) indicator doesn't stick until the 120s safety timeout. Only restart polling if a server label is still active. - Clarify comment on pendingOptimisticRequests -- it resets to 0 on any resolve signal, not per-request decrement. Tests: 27 pass (24 existing + 3 new covering the DM regression and the reconnect NVP clear). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hmm yeah, I'm going to update the test steps because on main is expected to remove the indicator when the app is offline and there's no easy way to test this by dropping only the websocket connection.
Thanks for catching that! I think this regressed when I updated the PR. Now it's fixed:
Thanks for the review @situchan, now it's ready again |
|
Hmm some checks are failing, fixing them |
- React Compiler: wrap clearPolling, startPolling, hardClearIndicator,
and kickoffWaitingIndicator in useCallback with proper deps; drop both
react-hooks/exhaustive-deps disables. Only exhaustive-deps and
rules-of-hooks suppressions bail the compiler (per react-compiler's
DEFAULT_ESLINT_SUPPRESSIONS) -- set-state-in-effect disables are kept
with narrow, reasoned comments.
- Typecheck: switch test's Onyx.set(NETWORK, {...}) to Onyx.merge
(matches pattern used across the codebase; partial merge avoids the
NullishObjectDeep<Network> strict-literal check).
- Spellcheck: rename "situchan's" -> "the reporter's" in a test comment.
- ESLint: add exhaustive-deps entry for clearPolling on the Concierge
reply-detection effect now that clearPolling is a stable useCallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ix-stuck-thinking-indicator
Main's NetworkState refactor (7f67d35, "Make isOffline synchronous via NetworkState module-level state") removed isOffline from the ONYXKEYS.NETWORK Onyx value, so Onyx.merge(NETWORK, {isOffline}) now fails typecheck. Switch test to setForceOffline() from @libs/NetworkState (same helper AssignCardFeed tests use). Also add afterEach(() => setForceOffline(false)) to keep the module-level state from leaking across tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Ok GH checks passing now |


On hold for
https://github.com/Expensify/Web-Expensify/pull/51695andhttps://github.com/Expensify/Auth/pull/20741Explanation of Change
When a client misses real-time Pusher events (e.g., tab backgrounded, brief disconnect), the
agentZeroProcessingRequestIndicatorNVP can get permanently stuck in the "thinking" state. This is a fundamental architectural mismatch: the thinking indicator is ephemeral/transient data stored in a durable system (Onyx NVPs).Every major real-time platform (WhatsApp, Discord, Slack, XMPP, PubNub) treats processing/typing indicators as ephemeral signals with client-side TTL timeouts — none persist them durably. Expensify own typing indicator (
Report/index.ts:510-523) already solves this exact problem with a client-side timeout (10s for Concierge).Fix: Apply a polling + safety timeout pattern to recover from missed WebSocket events:
isProcessingbecomes true, start asetIntervalthat callsgetNewerActions(reportID, newestReportActionID)every 30s. This actively fetches any Concierge response that was sent while the WebSocket was disconnected. Polling only fires when online (checksisOfflineRef).useNetwork()to detect reconnection and clear the indicator + restart polling.Why polling over a hard TTL: Concierge responses can take up to 2 minutes for complex queries. A hard 60s TTL would incorrectly clear a legitimate in-progress response. The 30s poll keeps the indicator visible while actively checking for the response — if it arrives, the normal Onyx update clears the indicator. Only after 120s (4 failed polls) do we hard-clear.
New files:
src/hooks/useAgentZeroStatusIndicator.ts— Core hook with polling, TTL, reconnect logic, and Concierge response detectionsrc/libs/ConciergeReasoningStore.ts— Ephemeral in-memory store for reasoning summaries (not persisted to Onyx)Modified files:
src/pages/inbox/AgentZeroStatusContext.tsx— Simplified to delegate to the new hookFixed Issues
$ https://github.com/Expensify/Expensify/issues/612534
Tests
Offline tests
N/A
QA Steps
PR Author Checklist
### Fixed Issuessection aboveTestssectionOffline stepssectionQA stepssectiontoggleReportand notonIconClick)src/languages/*files and using the translation methodSTYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))npm run compress-svg)Avataris modified, I verified thatAvataris working as expected in all cases)Designlabel and/or tagged@Expensify/designso the design team can review the changes.ScrollViewcomponent to make it scrollable when more elements are added to the page.mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
N/A — This change is hook/logic only (no UI component changes). The thinking indicator renders identically across all platforms via existing components.
Android: mWeb Chrome
N/A — Hook/logic change only, tested via web screenshots below.
iOS: Native
N/A — Hook/logic change only, tested via web screenshots below.
iOS: mWeb Safari
N/A — Hook/logic change only, tested via web screenshots below.
MacOS: Chrome / Safari
Pusher disconnect recovery test (exact production failure mode):
Uses
window.getPusherInstance().disconnect()to drop only the WebSocket while keeping HTTP alive — this is exactly what happens in production when Pusher has a brief hiccup.Video (full Pusher-drop cycle, 60s disconnected):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-recovery.webm
Key result: The thinking indicator persisted for the full 60s while Pusher was disconnected. After Pusher reconnected:
getNewerActionspolling — 5 calls detected)actorAccountID === CONCIERGE)AuthenticatePusher+ 5GetNewerActionscalls confirmed recovery mechanismNote on offline behavior: When fully offline (
setOffline(true)), the indicator is intentionally hidden (!isOfflineinisProcessing) — this is the original design. The bug only manifests during WebSocket drops where the user appears online but misses Pusher events.Video (full Pusher-drop cycle with passing result):
https://github.com/Expensify/Expensify/releases/download/untagged-612534-polling-1774719056/pusher-drop-final.webm