[No QA] Add agent-device skill and flow metadata framework by kacper-mikolajczak · Pull Request #88403 · Expensify/App

kacper-mikolajczak · 2026-04-21T11:39:19Z

Note

Depends on #87662 - that PR introduces the agent-device skill, its flows/ directory, and the initial sign-in.ad recording. This PR layers a metadata framework on top. Review and merge #87662 first; once it lands, the diff here collapses to just the framework-specific work (headers, matcher loop, split peer flows, complete-onboarding.ad).

Details

Layers a snapshot-driven, composable flow framework on top of the agent-device skill added in #87662. Flows now self-describe their preconditions, postconditions, baked-in parameters, and tags via comment headers, enabling an agent to pre-filter which flow applies to the current screen before executing anything - making in-app automation token-efficient and resilient to UI drift.

Explanation of Change

This PR extends agent-device flows with a metadata layer. It does not reintroduce the base skill (that ships in #87662); it adds:

A comment-header convention on .ad files (# @desc / # @pre / # @post / # @param / # @tag) that the replay parser already treats as no-ops, so headers cost nothing at runtime.
An agent decision loop documented in SKILL.md that uses the headers plus existing CLI primitives (agent-device snapshot -i, agent-device is exists, agent-device replay) to match, pick, execute, and verify a flow against the current state.
A peer-flow split: the flat sign-in.ad from [NoQA] Add agent-device glue-code skill for mobile testing #87662 becomes sign-in-new.ad / sign-in-returning.ad so agents can pick by @param account_state and fall back on @post mismatch. A new complete-onboarding.ad lands the user on Home.

Why flows need metadata

A flow without metadata is an opaque script. The agent cannot tell, from current state, whether the flow will do the right thing. That forces the agent to either read English prose and guess, or replay optimistically and hope. Both waste tokens and frequently land the session in a bad state.

The framework addresses this by asking every flow to declare, in # @-prefixed comment headers, the conditions under which it applies (@pre), where it will leave the app (@post), the constants it bakes in (@param), and a free-form category (@tag). The replay parser already treats # lines as no-ops, so headers are free at runtime.

Flow file anatomy

flows/sign-in-returning.ad
┌────────────────────────────────────────────────────────────────┐
│ # @desc    Sign in with the shared test account (returning).   │  ← Metadata
│ # @pre     role="textfield" label="Phone or email"             │    headers:
│ # @pre     role="button" label="Continue"                      │    parser
│ # @post    text="Home"                                         │    treats as
│ # @post    role="button" label="Search"                        │    comments;
│ # @param   email=agent-device-testing@gmail.com                │    agent reads
│ # @param   account_state=returning                             │    via grep.
│ # @tag     auth                                                │
├────────────────────────────────────────────────────────────────┤
│ fill "id=\"username\" || role=\"textfield\"..." "..."          │  ← Body:
│ press "role=\"button\" label=\"Continue\" || ..."              │    executed
└────────────────────────────────────────────────────────────────┘    verbatim

Agent decision loop

With the metadata in place, an agent follows a single loop before touching the UI manually:

           ┌───────────────────┐
           │ snapshot current  │
           │ state             │
           └─────────┬─────────┘
                     ▼
           ┌───────────────────┐
           │ grep '^# @' over  │
           │ flows/*.ad        │
           └─────────┬─────────┘
                     ▼
           ┌───────────────────┐   none pass
           │ filter by @pre    │─────────────┐
           │ (is exists ...)   │             │
           └─────────┬─────────┘             │
                     ▼ some pass             │
           ┌───────────────────┐   mismatch  │
           │ filter by @param  │─────────────┤
           │ vs user intent    │             │
           └─────────┬─────────┘             │
                     ▼ match                 │
           ┌───────────────────┐             │
           │ pick by @post     │             │
           │ goal proximity    │             │
           └─────────┬─────────┘             │
                     ▼                       │
           ┌───────────────────┐             │
           │ agent-device      │             │
           │ replay <path>     │             │
           └─────────┬─────────┘             │
                     ▼                       │
           ┌───────────────────┐    fail     │
           │ verify @post      │─────────┐   │
           │ (is exists ...)   │         │   │
           └─────────┬─────────┘         ▼   ▼
                     ▼ pass          ┌──────────────┐
           ┌───────────────────┐     │ try peer,    │
           │ goal reached?     │     │ else go      │
           │ yes → done        │     │ manual       │
           │ no  → loop        │     └──────────────┘
           └───────────────────┘

Composition

Flows are narrow snippets, not self-contained scripts. They have no open / close / context and no fixed wait calls - the caller owns the session. That keeps them chainable:

          [auth wall]
              │
              ▼
  ┌─────────────────────────┐   @pre: textfield + Continue
  │ sign-in-new.ad          │   @param: account_state=new
  │                         │   @post: Welcome + Join
  └───────────┬─────────────┘
              ▼
       [Welcome / Join]
              │  (one manual tap - documented gap)
              ▼
    [onboarding step 1]
              │
              ▼
  ┌─────────────────────────┐   @pre: "What's your work email?"
  │ complete-onboarding.ad  │   @params: purpose, first_name, last_name
  │                         │   @post: Home + Search
  └───────────┬─────────────┘
              ▼
          [Home]  ✓ goal

Peer flows (e.g. sign-in-new / sign-in-returning) share the same @pre but differ on @param and @post. The agent tries the param-matching peer first and falls back to the other when the post-check fails - the decision loop catches the miss before the session is corrupted.

Matching primitives

Nothing new was added to the agent-device CLI. The framework uses commands already shipped with the tool:

Primitive	Purpose
`grep '^# @' flows/*.ad`	Discover the whole catalog in one read.
`agent-device snapshot -i`	See current UI state.
`agent-device is exists <sel>`	Check a single `@pre` or `@post` selector.
`agent-device replay <path>`	Execute the flow body.

Fixed Issues

$ #88388
PROPOSAL:

Tests

Offline tests

N/A - changes are agent-tooling under .claude/ and do not affect app runtime behavior or network state.

QA Steps

N/A - no shipped-app changes. Title includes [No QA].

Verify that no errors appear in the JS console

PR Author Checklist

Screenshots/Videos

Android: Native

Android: mWeb Chrome

iOS: Native

iOS: mWeb Safari

MacOS: Chrome / Safari

- Install callstackincubator/agent-device skill (+ bundled dogfood skill) - Add agent-device-app-testing wrapper skill with Expensify-specific context: package name, sign-in flow, usage guidance, and proactive triggers

- Flatten .agents/skills/ into .claude/skills/ (remove symlink indirection and skills-lock.json created by `npx skills add`) - Add CLI prerequisites section to wrapper skill - Replace .rock/cache/ CI paths with local build as primary flow - Add agent-device-output/ to .gitignore - Fix email pattern and dev/release package names - Tighten trigger scope to explicit user requests only - Reduce verbosity per reviewer feedback

Per Jules's comment: local testing is directed by user, not prescribed by the skill. Remove step-by-step workflow - the base agent-device skill handles interaction. Keep only the App-specific facts that avoid repetitive lookups (package names, build commands, sign-in creds, RN gotchas).

Reduces context overhead for the PoC. dogfood (autonomous QA) is better suited for Phase 2/Melvin. macOS desktop and remote tenancy references are not relevant for local mobile testing.

…flow - Replace removed scrollintoview command with scroll + re-snapshot pattern - Add shell loop example for off-screen element discovery - Add diff screenshot section to verification reference - Rework app-testing skill with gated startup flow (device, metro, dev app) - Remove release build references, enforce dev-only app policy

Remove all inlined agent-device skill files and references - the CLI's bundled skills are the canonical source. The repo skill is now a thin glue layer: pre-flight check, usage principles, and a pointer to read the bundled skills from the installed package.

- Widen skill trigger to cover testing, debugging, perf, bug repro, feature verification - Add usage principles (fail fast, deviations are signal) - Add early-development footnote with Expensify Slack contact - Add agent-device.json with iOS mobile defaults

Both sections were not guiding the agent and duplicated what the agent-device CLI's bundled skills already cover. SKILL.md is now strictly a pre-flight install gate.

Replace the inlined sign-in walkthrough in SKILL.md with a pointer to a flows/ directory of .ad replay recordings. Each flow is invoked on explicit developer intent (not via snapshot matching) to keep the deterministic path free of LLM reasoning. Adds flows/README.md as the index; actual .ad recordings will be added once captured against a running app.

Replace the manual "ask the agent to run agent-device --version and npm root -g" instructions with dynamic context injection using the !\`cmd\` syntax. Commands run at skill load time (preprocessing), so the resolved version and canonical skill path land in the skill content directly - no tool call required from the agent. Pre-approves Bash(agent-device *) via allowed-tools in the skill frontmatter and also via .claude/settings.json so fresh checkouts do not get a permission prompt during the preprocessing step. Addresses Expensify#87662 (comment)

Add a Mobile Device Testing subsection parallel to Browser Testing in CLAUDE.md, and an optional AI-assisted testing callout in README after Platform-Specific Setup. Makes the agent-device skill discoverable for Claude Code users without claiming it's required setup.

Introduce `# @desc` / `# @pre` / `# @post` / `# @param` / `# @tag` comment headers in `.ad` flows. The replay parser already treats `#` lines as no-ops, so headers cost nothing at replay time while giving agents a machine-matchable catalog. - `@pre` / `@post` are selectors (same syntax as the flow body) that agents verify with `agent-device is exists`. This enables catalog filtering by current snapshot state and post-replay success checks. - `@param` advertises baked-in constants (email, account_state, names) so agents can match flows to user intent and skip when mismatched. - `@tag` supports free-form coarse categorization. Document the matcher loop in `SKILL.md`: snapshot -> grep catalog -> filter by `@pre` -> filter by `@param` -> pick by `@post` goal -> replay -> verify. Flesh out `flows/README.md` with the header spec, authoring rules, and updated recording workflow. Split the prior `sign-in.ad` into peer flows `sign-in-new.ad` and `sign-in-returning.ad` that share `@pre` (auth wall) but differ on `@param account_state` and `@post`. Add `complete-onboarding.ad` that skips the work-email step, picks a generic purpose, fills placeholder name fields, and lands on Home.

kacper-mikolajczak · 2026-04-21T11:47:30Z

CC @BartekObudzinski

kacper-mikolajczak · 2026-04-21T19:53:21Z

Superseded by #88474 - relocated the head branch to callstack-internal/Expensify-App to allow co-operation on the PR with @BartekObudzinski.

kacper-mikolajczak added 15 commits April 10, 2026 19:25

Update build instructions to Rock workflow, HybridApp only

a63542b

Drop dogfood skill, macos-desktop and remote-tenancy references

7af0dc5

Reduces context overhead for the PoC. dogfood (autonomous QA) is better suited for Phase 2/Melvin. macOS desktop and remote tenancy references are not relevant for local mobile testing.

Trim app-testing skill: defer device bootstrap to base agent-device

9b21ec1

Remove agent-device.json from PR

d4abc1b

Drop footnote and how-this-works section from agent-device SKILL.md

4477236

Both sections were not guiding the agent and duplicated what the agent-device CLI's bundled skills already cover. SKILL.md is now strictly a pre-flight install gate.

melvin-bot Bot assigned kacper-mikolajczak Apr 21, 2026

kacper-mikolajczak changed the title ~~Add agent-device skill and flow metadata framework [No QA]~~ [No QA] Add agent-device skill and flow metadata framework Apr 21, 2026

kacper-mikolajczak closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[No QA] Add agent-device skill and flow metadata framework#88403

[No QA] Add agent-device skill and flow metadata framework#88403
kacper-mikolajczak wants to merge 15 commits intoExpensify:mainfrom
kacper-mikolajczak:agent-device-flow-metadata

kacper-mikolajczak commented Apr 21, 2026 •

edited

Loading

Uh oh!

kacper-mikolajczak commented Apr 21, 2026

Uh oh!

kacper-mikolajczak commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacper-mikolajczak commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Explanation of Change

Why flows need metadata

Flow file anatomy

Agent decision loop

Composition

Matching primitives

Fixed Issues

Tests

Offline tests

QA Steps

PR Author Checklist

Screenshots/Videos

Uh oh!

kacper-mikolajczak commented Apr 21, 2026

Uh oh!

kacper-mikolajczak commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kacper-mikolajczak commented Apr 21, 2026 •

edited

Loading

kacper-mikolajczak commented Apr 21, 2026 •

edited

Loading