Skip to content

[No QA] Add agent-device skill and flow metadata framework#88403

Closed
kacper-mikolajczak wants to merge 15 commits intoExpensify:mainfrom
kacper-mikolajczak:agent-device-flow-metadata
Closed

[No QA] Add agent-device skill and flow metadata framework#88403
kacper-mikolajczak wants to merge 15 commits intoExpensify:mainfrom
kacper-mikolajczak:agent-device-flow-metadata

Conversation

@kacper-mikolajczak
Copy link
Copy Markdown
Contributor

@kacper-mikolajczak kacper-mikolajczak commented Apr 21, 2026

Note

Depends on #87662 - that PR introduces the agent-device skill, its flows/ directory, and the initial sign-in.ad recording. This PR layers a metadata framework on top. Review and merge #87662 first; once it lands, the diff here collapses to just the framework-specific work (headers, matcher loop, split peer flows, complete-onboarding.ad).

Details

Layers a snapshot-driven, composable flow framework on top of the agent-device skill added in #87662. Flows now self-describe their preconditions, postconditions, baked-in parameters, and tags via comment headers, enabling an agent to pre-filter which flow applies to the current screen before executing anything - making in-app automation token-efficient and resilient to UI drift.

Explanation of Change

This PR extends agent-device flows with a metadata layer. It does not reintroduce the base skill (that ships in #87662); it adds:

  1. A comment-header convention on .ad files (# @desc / # @pre / # @post / # @param / # @tag) that the replay parser already treats as no-ops, so headers cost nothing at runtime.
  2. An agent decision loop documented in SKILL.md that uses the headers plus existing CLI primitives (agent-device snapshot -i, agent-device is exists, agent-device replay) to match, pick, execute, and verify a flow against the current state.
  3. A peer-flow split: the flat sign-in.ad from [NoQA] Add agent-device glue-code skill for mobile testing #87662 becomes sign-in-new.ad / sign-in-returning.ad so agents can pick by @param account_state and fall back on @post mismatch. A new complete-onboarding.ad lands the user on Home.

Why flows need metadata

A flow without metadata is an opaque script. The agent cannot tell, from current state, whether the flow will do the right thing. That forces the agent to either read English prose and guess, or replay optimistically and hope. Both waste tokens and frequently land the session in a bad state.

The framework addresses this by asking every flow to declare, in # @-prefixed comment headers, the conditions under which it applies (@pre), where it will leave the app (@post), the constants it bakes in (@param), and a free-form category (@tag). The replay parser already treats # lines as no-ops, so headers are free at runtime.

Flow file anatomy

flows/sign-in-returning.ad
┌────────────────────────────────────────────────────────────────┐
│ # @desc    Sign in with the shared test account (returning).   │  ← Metadata
│ # @pre     role="textfield" label="Phone or email"             │    headers:
│ # @pre     role="button" label="Continue"                      │    parser
│ # @post    text="Home"                                         │    treats as
│ # @post    role="button" label="Search"                        │    comments;
│ # @param   email=agent-device-testing@gmail.com                │    agent reads
│ # @param   account_state=returning                             │    via grep.
│ # @tag     auth                                                │
├────────────────────────────────────────────────────────────────┤
│ fill "id=\"username\" || role=\"textfield\"..." "..."          │  ← Body:
│ press "role=\"button\" label=\"Continue\" || ..."              │    executed
└────────────────────────────────────────────────────────────────┘    verbatim

Agent decision loop

With the metadata in place, an agent follows a single loop before touching the UI manually:

           ┌───────────────────┐
           │ snapshot current  │
           │ state             │
           └─────────┬─────────┘
                     ▼
           ┌───────────────────┐
           │ grep '^# @' over  │
           │ flows/*.ad        │
           └─────────┬─────────┘
                     ▼
           ┌───────────────────┐   none pass
           │ filter by @pre    │─────────────┐
           │ (is exists ...)   │             │
           └─────────┬─────────┘             │
                     ▼ some pass             │
           ┌───────────────────┐   mismatch  │
           │ filter by @param  │─────────────┤
           │ vs user intent    │             │
           └─────────┬─────────┘             │
                     ▼ match                 │
           ┌───────────────────┐             │
           │ pick by @post     │             │
           │ goal proximity    │             │
           └─────────┬─────────┘             │
                     ▼                       │
           ┌───────────────────┐             │
           │ agent-device      │             │
           │ replay <path>     │             │
           └─────────┬─────────┘             │
                     ▼                       │
           ┌───────────────────┐    fail     │
           │ verify @post      │─────────┐   │
           │ (is exists ...)   │         │   │
           └─────────┬─────────┘         ▼   ▼
                     ▼ pass          ┌──────────────┐
           ┌───────────────────┐     │ try peer,    │
           │ goal reached?     │     │ else go      │
           │ yes → done        │     │ manual       │
           │ no  → loop        │     └──────────────┘
           └───────────────────┘

Composition

Flows are narrow snippets, not self-contained scripts. They have no open / close / context and no fixed wait calls - the caller owns the session. That keeps them chainable:

          [auth wall]
              │
              ▼
  ┌─────────────────────────┐   @pre: textfield + Continue
  │ sign-in-new.ad          │   @param: account_state=new
  │                         │   @post: Welcome + Join
  └───────────┬─────────────┘
              ▼
       [Welcome / Join]
              │  (one manual tap - documented gap)
              ▼
    [onboarding step 1]
              │
              ▼
  ┌─────────────────────────┐   @pre: "What's your work email?"
  │ complete-onboarding.ad  │   @params: purpose, first_name, last_name
  │                         │   @post: Home + Search
  └───────────┬─────────────┘
              ▼
          [Home]  ✓ goal

Peer flows (e.g. sign-in-new / sign-in-returning) share the same @pre but differ on @param and @post. The agent tries the param-matching peer first and falls back to the other when the post-check fails - the decision loop catches the miss before the session is corrupted.

Matching primitives

Nothing new was added to the agent-device CLI. The framework uses commands already shipped with the tool:

Primitive Purpose
grep '^# @' flows/*.ad Discover the whole catalog in one read.
agent-device snapshot -i See current UI state.
agent-device is exists <sel> Check a single @pre or @post selector.
agent-device replay <path> Execute the flow body.

Fixed Issues

$ #88388
PROPOSAL:

Tests

Offline tests

N/A - changes are agent-tooling under .claude/ and do not affect app runtime behavior or network state.

QA Steps

N/A - no shipped-app changes. Title includes [No QA].

  • Verify that no errors appear in the JS console

PR Author Checklist

  • I linked the correct issue in the ### Fixed Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I added steps for the expected offline behavior in the Offline steps section
    • I added steps for Staging and/or Production testing in the QA steps section
    • I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
      • If any non-english text was added/modified, I used JaimeGPT to get English > Spanish translation. I then posted it in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
  • If new assets were added or existing ones were modified, I verified that:
    • The assets are optimized and compressed (for SVG files, run npm run compress-svg)
    • The assets load correctly across all supported platforms.
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • I added unit tests for any new feature or bug fix in this PR to help automatically prevent regressions in this user flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

- Install callstackincubator/agent-device skill (+ bundled dogfood skill)
- Add agent-device-app-testing wrapper skill with Expensify-specific context:
  package name, sign-in flow, usage guidance, and proactive triggers
- Flatten .agents/skills/ into .claude/skills/ (remove symlink indirection
  and skills-lock.json created by `npx skills add`)
- Add CLI prerequisites section to wrapper skill
- Replace .rock/cache/ CI paths with local build as primary flow
- Add agent-device-output/ to .gitignore
- Fix email pattern and dev/release package names
- Tighten trigger scope to explicit user requests only
- Reduce verbosity per reviewer feedback
Per Jules's comment: local testing is directed by user, not
prescribed by the skill. Remove step-by-step workflow - the base
agent-device skill handles interaction. Keep only the App-specific
facts that avoid repetitive lookups (package names, build commands,
sign-in creds, RN gotchas).
Reduces context overhead for the PoC. dogfood (autonomous QA) is
better suited for Phase 2/Melvin. macOS desktop and remote tenancy
references are not relevant for local mobile testing.
…flow

- Replace removed scrollintoview command with scroll + re-snapshot pattern
- Add shell loop example for off-screen element discovery
- Add diff screenshot section to verification reference
- Rework app-testing skill with gated startup flow (device, metro, dev app)
- Remove release build references, enforce dev-only app policy
Remove all inlined agent-device skill files and references - the CLI's
bundled skills are the canonical source. The repo skill is now a thin
glue layer: pre-flight check, usage principles, and a pointer to read
the bundled skills from the installed package.
- Widen skill trigger to cover testing, debugging, perf, bug repro, feature verification
- Add usage principles (fail fast, deviations are signal)
- Add early-development footnote with Expensify Slack contact
- Add agent-device.json with iOS mobile defaults
Both sections were not guiding the agent and duplicated what the
agent-device CLI's bundled skills already cover. SKILL.md is now
strictly a pre-flight install gate.
Replace the inlined sign-in walkthrough in SKILL.md with a pointer to a
flows/ directory of .ad replay recordings. Each flow is invoked on
explicit developer intent (not via snapshot matching) to keep the
deterministic path free of LLM reasoning.

Adds flows/README.md as the index; actual .ad recordings will be added
once captured against a running app.
Replace the manual "ask the agent to run agent-device --version and
npm root -g" instructions with dynamic context injection using the
!\`cmd\` syntax. Commands run at skill load time (preprocessing), so
the resolved version and canonical skill path land in the skill
content directly - no tool call required from the agent.

Pre-approves Bash(agent-device *) via allowed-tools in the skill
frontmatter and also via .claude/settings.json so fresh checkouts do
not get a permission prompt during the preprocessing step.

Addresses Expensify#87662 (comment)
Add a Mobile Device Testing subsection parallel to Browser Testing in
CLAUDE.md, and an optional AI-assisted testing callout in README after
Platform-Specific Setup. Makes the agent-device skill discoverable
for Claude Code users without claiming it's required setup.
Introduce `# @desc` / `# @pre` / `# @post` / `# @param` / `# @tag` comment
headers in `.ad` flows. The replay parser already treats `#` lines as
no-ops, so headers cost nothing at replay time while giving agents a
machine-matchable catalog.

- `@pre` / `@post` are selectors (same syntax as the flow body) that
  agents verify with `agent-device is exists`. This enables catalog
  filtering by current snapshot state and post-replay success checks.
- `@param` advertises baked-in constants (email, account_state, names)
  so agents can match flows to user intent and skip when mismatched.
- `@tag` supports free-form coarse categorization.

Document the matcher loop in `SKILL.md`: snapshot -> grep catalog ->
filter by `@pre` -> filter by `@param` -> pick by `@post` goal ->
replay -> verify. Flesh out `flows/README.md` with the header spec,
authoring rules, and updated recording workflow.

Split the prior `sign-in.ad` into peer flows `sign-in-new.ad` and
`sign-in-returning.ad` that share `@pre` (auth wall) but differ on
`@param account_state` and `@post`. Add `complete-onboarding.ad` that
skips the work-email step, picks a generic purpose, fills placeholder
name fields, and lands on Home.
@kacper-mikolajczak kacper-mikolajczak changed the title Add agent-device skill and flow metadata framework [No QA] [No QA] Add agent-device skill and flow metadata framework Apr 21, 2026
@kacper-mikolajczak
Copy link
Copy Markdown
Contributor Author

CC @BartekObudzinski

@kacper-mikolajczak
Copy link
Copy Markdown
Contributor Author

kacper-mikolajczak commented Apr 21, 2026

Superseded by #88474 - relocated the head branch to callstack-internal/Expensify-App to allow co-operation on the PR with @BartekObudzinski.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant