Skip to content

[NoQA] Extract Platform interface from agent-device smoke driver (refactor)#90638

Draft
rustam-callstack wants to merge 13 commits into
Expensify:mainfrom
rustam-callstack:feat/agent-device-smoke-platform-refactor
Draft

[NoQA] Extract Platform interface from agent-device smoke driver (refactor)#90638
rustam-callstack wants to merge 13 commits into
Expensify:mainfrom
rustam-callstack:feat/agent-device-smoke-platform-refactor

Conversation

@rustam-callstack
Copy link
Copy Markdown

Explanation of Change

Pulls Android-specific boot dance, blocking-dialog recovery, keyevent dispatch, and log dumping out of the Phase 1 LLM-driven smoke driver into a new `agent-device-platform.ts` module behind a small `Platform` interface.

Why now: Phase 2 (iOS Simulator smoke, separate PR stacked on top of this one) needs an `IOSPlatform` sibling to `AndroidPlatform`. Doing the refactor in isolation keeps the iOS PR's diff focused on the iOS-specific code rather than mixing it with the abstraction extraction.

Zero behavior change for Android. `AndroidPlatform` is a verbatim move of today's inlined logic. The fork-test smoke produces the same artifacts before and after — see validation below.

Changes

File Change
`.github/scripts/agent-device-platform.ts` New (~290 LOC). `Platform` interface + `AndroidPlatform` impl. Owns `adb install`, `adb reverse`, `hide_error_dialogs=1`, `autofill_service=null`, ANR dialog recovery, `adb logcat` dump, `adb shell input keyevent` dispatch. Exports `detectPlatform()`, `startMetro()`, `locateBundle()`, `backgroundPids`.
`.github/scripts/agent-device-llm-driver.ts` -185 LOC. `bootApp()` becomes ~20 lines that delegate to `platform.install/setupNetworking/preBootHardening/launch`. Blocking-dialog recovery delegates to `platform.tryDismissBlockingDialog()`. `back()` and `dismiss_keyboard()` tools call `platform.back()` / `platform.dismissKeyboard()`. Cleanup trap calls `platform.dumpLogsToFile()`.
`.github/scripts/agent-device-cli.ts` `adbKey()` marked `@deprecated` (retained for skill-bundled `replay-only.ts` compat on a separate branch).

Platform auto-detects from `PLATFORM` env (defaults to `'android'` for backwards compatibility). `PLATFORM=ios` is intentionally not implemented in this PR — throws a clear "lands in the next PR" error. The iOS implementation arrives in the follow-up PR.

Cost guards

No change. The refactor preserves every existing guard:

  • `concurrency.cancel-in-progress` keyed by ref
  • `timeout-minutes: 35`
  • `paths-ignore` for docs/help/markdown
  • `LLM_TOKEN_BUDGET` env kill-switch
  • Cache-hit path stays ~$0; refactor doesn't add any LLM calls

Required secrets

None new. Same set as PR #90181.

Fixed Issues

$
PROPOSAL:

Refactor PR; opened as draft on the same trajectory as #90181.

Tests

This PR is a refactor only — no runtime app code changes. Verification is exercising the existing smoke workflow and showing byte-identical metrics.

  1. Run Actions → Android Smoke LLM (fork-test) → Run workflow against this branch on the fork.
  2. Compare the runner's final `::notice::smoke OK` line to the baseline from PR [NoQA] Add LLM-driven Android emulator smoke (agent-device · Phase 1) #90181's last green cache-hit run.

Pre-merge validation:

  • Baseline (Phase 1 PR head, run 25664772390): `cache_hits=3 llm_runs=1 bash_runs=0 tokens=5649`
  • This branch (run 25853396770): `cache_hits=3 llm_runs=1 bash_runs=0 tokens=4315`

Identical cache_hits/llm_runs/bash_runs split — confirms the refactor is functionally equivalent on Android. Token-count delta (4315 vs 5649) reflects routine variance in the LLM call for step 4's `wait_for(magic_code)`.

  • Verify that no errors appear in the JS console

Offline tests

N/A — refactor only, no app runtime change.

QA Steps

N/A — `.github/scripts/` files are not shipped to staging or production.

  • Verify that no errors appear in the JS console

PR Author Checklist

  • I linked the correct issue in the `### Fixed Issues` section above

    No tracking issue; refactor opened as a draft alongside [NoQA] Add LLM-driven Android emulator smoke (agent-device · Phase 1) #90181.

  • I wrote clear testing steps that cover the changes made in this PR

    • I added steps for local testing in the `Tests` section
    • I added steps for the expected offline behavior in the `Offline steps` section
    • I added steps for Staging and/or Production testing in the `QA steps` section
    • I added steps to cover failure scenarios
  • I included screenshots or videos for tests on all platforms — N/A, refactor only

  • I ran the tests on all platforms — N/A, refactor only:

    • Android: Native — N/A
    • Android: mWeb Chrome — N/A
    • iOS: Native — N/A
    • iOS: mWeb Safari — N/A
    • MacOS: Chrome / Safari — N/A
  • I verified there are no console errors

  • I followed proper code patterns

  • I followed the guidelines as stated in the Review Guidelines

  • I tested other components that can be impacted by my changes — N/A, additive interface

  • I verified all code is DRY

  • I verified any variables that can be defined as constants are

  • I verified that if a function's arguments changed that all usages have also been updated correctly

  • If any new file was added I verified that:

    • The file has a description of what it does and/or why is needed at the top of the file

Screenshots/Videos

Android: Native

N/A — refactor only. Verification is the byte-identical fork-test run above.

Android: mWeb Chrome

N/A

iOS: Native

N/A — `PLATFORM=ios` lands in the follow-up PR.

iOS: mWeb Safari

N/A

MacOS: Chrome / Safari

N/A

rustam-callstack and others added 13 commits May 8, 2026 11:22
Replaces the brittle bash assertion logic of Phase 0 with an LLM
runner that takes plain-text test cases (numbered English steps with
optional `expect:` postconditions) and uses Claude Sonnet to figure
out the right agent-device CLI calls. A committed replay cache at
tests/smoke/cache/<test>.json keeps the happy path deterministic
and ~\$0 in API spend; cache misses fall back to the LLM, and
final-tier failures fall back to a Phase-0-style bash recipe so an
Anthropic outage doesn't fail the build.

Phase 0 stays untouched. Phase 1 ships as `smokeAndroidLLM.yml` with
`continue-on-error: true` for the first 2 weeks; flip to required
once flake rate is at parity.

Files added:
- .github/scripts/agent-device-cli.ts          (typed wrapper around the CLI)
- .github/scripts/agent-device-snapshot-signature.ts  (structural cache key)
- .github/scripts/agent-device-expect.ts       (postcondition DSL)
- .github/scripts/agent-device-replay-cache.ts (cache load/lookup/diff)
- .github/scripts/agent-device-llm-client.ts   (Anthropic /v1/messages with prompt cache + backoff)
- .github/scripts/agent-device-llm-driver.ts   (orchestrator)
- .github/workflows/smokeAndroidLLM.yml        (PR + dispatch trigger)
- tests/smoke/android-signin.testcase.txt      (4 numbered steps for SignIn flow)
- package.json: smoke:android:llm script

See plan: $(printf '~/.claude/plans/buzzing-mixing-dusk.md')

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1's first fork-test run timed out at the 360s SignIn-wait budget
without uploading any diagnostics — the runner exited via fail() before
writing snapshots/screenshots, so post-mortem only had logcat.

- 360s -> 600s. Phase 0 saw 294s on a warm AVD; the first run of a new
  workflow can't reuse that cache (key includes the workflow filename),
  so it pays the cold-prime cost and needs more headroom.
- Every 30s during the wait, dump probe snapshot text to artifacts so
  we can see the timeline of UI states the app traversed.
- On final timeout, capture snapshot + appstate + PNG screenshot before
  failing so the failure is debuggable from a single artifact upload.
- Don't let a transient snapshot exception kill the whole wait — log
  and retry. The agent-device CLI occasionally times out under emulator
  load and the next poll usually succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fork-test runs showed every probe stuck on a system "Pixel
Launcher isn't responding" dialog with `Close app` / `Wait` buttons,
sitting on top of our (correctly-foregrounded) Expensify activity.
The 2-core ubuntu-latest runner can't keep up with Metro + APK launch
+ launcher init simultaneously, so the launcher ANRs and the
accessibility tree gets captured by the dialog overlay.

Two fixes:
1. Pre-emptively `settings put global hide_error_dialogs 1` so the
   OS suppresses ANR dialogs system-wide (the underlying ANR still
   happens but the foreground app stays uncovered).
2. In-loop recovery: if the snapshot looks like an ANR dialog
   (exactly two buttons labelled "Close app" + "Wait"), press Wait
   to dismiss, then `am start` our activity to force-foreground,
   and continue polling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured from run 25553622590 (51m, 4/4 LLM steps green, magic-code
reached). All 4 step entries plus structural pre/post signatures are
committed so future PR runs can replay the happy path without a
Claude API call.

Known caveat: step 2's recorded actions only contain `press` though
the field gets typed end-to-end. The runner's recording path drops
the fill action somewhere; the committed cache will not perfectly
replay step 2, so cache-hit will fail expect-verification on that
step and fall through to LLM. Tracking a fix; the smoke remains
correct because expect runs against the live UI, not the cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…drift

The cache-hit run 25556053751 failed at step 3 because verifyPostState
required *both* signature match AND expect-pass. The replay had pressed
Continue successfully — the app advanced to magic-code, but the post-
signature differed from what was recorded (cosmetic re-render, slightly
different node count). Runner treated it as drift, fell through to LLM,
LLM exhausted budget, bash fallback ran "press Continue" against the
magic-code screen (no Continue button), step failed.

The signature is a structural hash; the expect predicate is an
intentional deterministic check over the live UI. When expect passes
the step has succeeded by the test author's own definition, even if
the structural hash drifted. Re-prioritize: expect first, signature
becomes advisory (warning, not failure).

Steps with no `expect:` clause still fail on signature drift — that's
the only post-state check available there, and it stays useful as a
"did anything visibly change?" tripwire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnostic mode for tracking down the step-2 cache-recording bug
(cache stores `press(text-field)` even though the email gets typed
into the field). With DEBUG_LLM=1:

- llm-client trace adds a `request` entry per call with the last
  user text + every prior tool_use in the thread. Each `response`
  entry now includes the LLM's tool_use blocks (id, name, full
  input args) and any text preview.
- driver dispatchTool fill/press log entry args, refToLocator
  result, executed-array length after the push, and surface throws
  separately so a silent CLI failure becomes visible.

Off by default (env-gated) so production runs stay slim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for cache-hit reliability traced from run 25553622590's
logcat:

1. Android autofill silently filled the email field after the LLM
   pressed it (FillRequestEventLogger entry at the exact moment of
   step 2's press, BeginSignIn API fired with the email a second
   later — the LLM never called fill). Cache then recorded only the
   press; replay on a different AVD snapshot where autofill state
   had rotated broke deterministically. Disabling autofill via
   `settings put secure autofill_service null` at boot forces the
   LLM to call fill explicitly so both record and replay are
   self-contained.

2. ANR recovery via `am start` brought a half-loaded MainActivity
   to the foreground (run 25560886459 stuck on splash for 600s after
   recovery). force-stop + agent-device open --relaunch guarantees
   a clean process spawn so the next launch re-runs JS init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eout

Run 25568731827's LLM trace revealed the core failure pattern: every
step-2 user message carried `snapshot.node_count=10` with the same
pre-step text-field — `snap` is never refreshed after fill/press, so
the LLM sees its own actions had "no effect", retries the same fill,
gets caught by seen-hash dedup, then burns the wall-clock budget.

Three fixes:

1. After every batch of tool calls in runLLMStep that contains
   fill/press/wait, refresh `snap` + `app` so the next round sees
   the live state. snapshot/wait_for/back/dismiss_keyboard already
   refreshed via dispatchTool's onSnap callback; fill/press didn't.

2. agent-device fill gets its own 90s CLI timeout (was 30s). The
   30-char email took >30s to type on the 2-core ubuntu-latest;
   adCli.fill threw, the action wasn't pushed to executed[], and
   the device did get partially-typed text but the runner thought
   the call failed. Read-only commands keep the 30s tripwire.

3. 500ms settle gap after bash fallback before verifyPostState so
   the typed text propagates through React Native's onChange before
   the predicate snapshot reads back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured from run 25659967543 (green end-to-end, all 4 steps reached
magic-code with proper LLM tool sequence). This cache supersedes the
prior seed from 25553622590, which was recorded with Android Autofill
active — its step 2 stored a stale `press(text-field)` action that
worked only because the framework was silently filling the field on
focus, breaking cache-hit replay on AVD snapshots where autofill
state had rotated.

This cache contains the correct `fill(text-field, "rustam.zeinalov@…")`
recorded against an autofill-disabled emulator. Signatures rotate
relative to the old cache (autofill-related accessibility nodes are
gone), but the role-based locators stay portable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compared step-1-pre snapshots of runs 25659967543 and 25662443061
on the same SignIn screen: one had 3 extra dev-warning nodes
("!, The result of getSnapshot should be cached...") the other
didn't. Structural signature included those nodes, so the cache
key rotated between runs and replay never matched even though
the user-visible UI was identical.

Drop those transient dev-mode bubbles from the signature: any
group whose text starts with "!, ", any "!" indicator, and the
specific warning text strings that pair with them. Dev-only by
construction — they never reach release builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-computed pre/post signatures locally from run 25662443061's
step-N-{pre,post}.txt artifacts with the new filter (transient
RN dev-warning nodes excluded). Verified the same signatures
compute from run 25659967543's artifacts on the same UI despite
that run having different dev-warning node counts — filter is
doing its job.

Action sequences unchanged (filter affects only signature, not
locator resolution). Next dispatch should land cache_hits>=3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidates consecutive `//` lines into single `/* */` blocks across
all six engine files. Single-line and end-of-line comments become
inline `/* */`. ESLint/TS/Prettier directives are preserved as `//`
because they only work in that form. Strict tsc still clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls Android-specific boot dance, blocking-dialog recovery, keyevent
dispatch, and log dumping out of the driver into a new
agent-device-platform.ts module behind a small Platform interface.

AndroidPlatform is a verbatim move of today's inlined logic. The
fork-test smoke must produce byte-identical artifacts before and
after this change — that's the success criterion for landing this
PR. The cache and recorded actions reference the same locator
shapes; only the call chain changed.

Why now: Phase 2 (iOS Simulator smoke) needs an IOSPlatform sibling
to AndroidPlatform. Doing the refactor in isolation keeps the iOS
PR diff focused on the iOS-specific code.

Files:
- .github/scripts/agent-device-platform.ts  (new, ~290 LOC)
- .github/scripts/agent-device-llm-driver.ts  (-185 LOC, delegates
  to platform.foo())
- .github/scripts/agent-device-cli.ts  (adbKey marked @deprecated;
  retained for skill-bundled replay-only.ts compat)

The driver auto-detects platform from PLATFORM env (defaults to
'android'). PLATFORM=ios is intentionally not implemented in this
PR — throws with a clear "lands in PR B" message.

Strict tsc clean across all modules. No new dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ This PR is possibly changing native code and/or updating libraries, it may cause problems with HybridApp. Please check if any patch updates are required in the HybridApp repo and run an AdHoc build to verify that HybridApp will not break. Ask Contributor Plus for help if you are not sure how to handle this. ⚠️

@github-actions
Copy link
Copy Markdown
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
@rustam-callstack
@rustam Zeinalov
Rustam Zeinalov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant