[NoQA] Add agent-device glue-code skill for mobile testing#87662
[NoQA] Add agent-device glue-code skill for mobile testing#87662rlinoz merged 15 commits intoExpensify:mainfrom
Conversation
- Install callstackincubator/agent-device skill (+ bundled dogfood skill) - Add agent-device-app-testing wrapper skill with Expensify-specific context: package name, sign-in flow, usage guidance, and proactive triggers
Julesssss
left a comment
There was a problem hiding this comment.
This isn't a full review yet, but some intitial thoughts:
...se after mobile/React Native code changes
-
This description is likely to trigger more that I would like for the initial skill implementation. Could we restrict this so that the
use whenscope is reduced -
The skills are a bit verbose in some cases. It would be good to reduce where possible, I can point out in more details with a second review shortly
|
Thanks for the feedback! I will work on adjusting the skill properly. Looking forward for further ideas from your side ❤️ |
- Flatten .agents/skills/ into .claude/skills/ (remove symlink indirection and skills-lock.json created by `npx skills add`) - Add CLI prerequisites section to wrapper skill - Replace .rock/cache/ CI paths with local build as primary flow - Add agent-device-output/ to .gitignore - Fix email pattern and dev/release package names - Tighten trigger scope to explicit user requests only - Reduce verbosity per reviewer feedback
Per Jules's comment: local testing is directed by user, not prescribed by the skill. Remove step-by-step workflow - the base agent-device skill handles interaction. Keep only the App-specific facts that avoid repetitive lookups (package names, build commands, sign-in creds, RN gotchas).
Reduces context overhead for the PoC. dogfood (autonomous QA) is better suited for Phase 2/Melvin. macOS desktop and remote tenancy references are not relevant for local mobile testing.
…flow - Replace removed scrollintoview command with scroll + re-snapshot pattern - Add shell loop example for off-screen element discovery - Add diff screenshot section to verification reference - Rework app-testing skill with gated startup flow (device, metro, dev app) - Remove release build references, enforce dev-only app policy
Remove all inlined agent-device skill files and references - the CLI's bundled skills are the canonical source. The repo skill is now a thin glue layer: pre-flight check, usage principles, and a pointer to read the bundled skills from the installed package.
- Widen skill trigger to cover testing, debugging, perf, bug repro, feature verification - Add usage principles (fail fast, deviations are signal) - Add early-development footnote with Expensify Slack contact - Add agent-device.json with iOS mobile defaults
|
Okay, we are directly referencing Let me know what you'd like to see next as a part of this integration! What I was thinking of:
Also, for the matter of performance testing we could consider agent-react-devtools integration, which is a really nice auxiliary package, but it covers a specific niche - not all developers are interested in the performance tooling. Keeping reference to it here as an inspiration for further discussions :) Here's how it works locally. As input, the agent was told to follow the testing steps of one of the PRs and it picked up the Screen.Recording.Apr.14.2026.from.Online.Video.Cutter.mp4 |
|
I do work with react worktrees a lot here and would personally appreciate |
that'd be lovely |
Yes one annoyance mentioned in the docs is onboarding modals, we have a few that. Though we also have the From the docs it sounds like Batching is how we would pre-define flows for repeating smoke tests, is that correct? Or would that be replayability? Anyway I like the idea of running exploratory tests and then recording these to expand the 'testing steps?' library for future reruns.
Oh interesting, as you know we are heavily focusing on performance so that is definately worth thinking about later 👍
Great point.
Interesting ideas. To avoid having too many duplicated workflows it might be preferred to trigger our existing triage agents. Enabling the triage agent to take advantage of this tool would be great though. Melvin already has the ability to verify (simple) web bugs via playwright. |
|
First things first, thank you all for all of the interesting feedback! There are couple of really nifty use-cases for We will benefit from defining what features are necessity as a part of initial integration and get that merged, so agent-device is already available for the developers. Then carry on with another set of PRs for functionality added on top. We'd get the developers feedback about current setup in parallel and we can enhance our discussion of those upcoming features this way. Let me know what you think and if you agree let's define the short-list of what's missing still. Thanks! |
Good question @Julesssss! There are three related but distinct pieces here:
So, all of them have a use-case in our methodology. We could (just an idea for now):
It should help with those, in cases we are not having other tooling set up like env flags already.
*Naming is really arbitrary - just wanted to convey the idea :D |
Agree! Let's keep it as an upcoming improvement after baseline is merged. |
@BartekObudzinski agent-device --session user-a --device "iPhone 16 Pro" open com.expensify.chat
agent-device --session user-b --device "iPhone 16" open com.expensify.chatEach session is independently addressable via However, coordinated cross-device orchestration (trigger on device A, assert on device B in one flow) is tracked as a backlog item (callstackincubator/agent-device#100) and not yet shipped. For now, the practical path for two-account flows is sequential on one simulator: complete account A's actions, log out, log in as account B, verify. Not as fast as true parallel, but would it be enough for approval/send-money scenarios? With that said, I agree we might benefit from either having pre-defined tests flows for such use-cases or add explanatory instructions in form of auxiliary skill. Let's track this as upcoming improvement! |
Sure thing @adhorodyski! Given it's a distinct tool with its own setup, I'd propose a separate follow-up PR - sounds good? |
| The `agent-device` CLI ships with built-in skills under `skills/` in the installed package. These contain the canonical reference for device automation - bootstrap, exploration, verification, debugging, and more. Use `agent-device --help` to discover available commands and skill names. Read the skill files directly from the installed package path when you need detailed guidance: | ||
|
|
||
| ```bash | ||
| # Find the package location |
There was a problem hiding this comment.
I wish we could run commands automatically instead of asking the agent to do something...
Locally this fails with permission though, so not sure how we could do it. Can we update the settings and allow this commands and inject the context without a tool call?
There was a problem hiding this comment.
Good call, thanks @rlinoz! I will look into permissions issue.
There was a problem hiding this comment.
Hi @rlinoz, latest PR commit uses dynamic calls to perform pre-flight. Let me know if it works for you the same way as it has worked for me.
CC @BartekObudzinski if you'd like to test it, too. Thanks!
There was a problem hiding this comment.
Tested end-to-end on iOS. Dynamic pre-flight works, devices, open, snapshot, press all ran cleanly LGTM
Add a Mobile Device Testing subsection parallel to Browser Testing in CLAUDE.md, and an optional AI-assisted testing callout in README after Platform-Specific Setup. Makes the agent-device skill discoverable for Claude Code users without claiming it's required setup.
|
Hi @Julesssss @rlinoz! The review remarks were addressed, let me know what you think of current state of the agent-device setup:
|
| **Optional AI-assisted mobile testing:** If you use Claude Code, the [`/agent-device` skill](.claude/skills/agent-device/SKILL.md) drives iOS and Android simulators or devices for interactive testing, debugging, and performance profiling. Requires `npm install -g agent-device`. | ||
|
|
There was a problem hiding this comment.
NAB: We should start documenting our skills better -- outside scope here of course
Julesssss
left a comment
There was a problem hiding this comment.
Looks good to me as a first step
|
@abzokhattab @Julesssss One of you needs to copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button] |
|
Hi @rlinoz! Jules told me he is going to be OOO for the rest of the week. Let me know if you need anything on my end to push this further. Thanks! Reassure tests are failing but the PR does not include any code-related changes, so it is most likely a CI issue. P.S. Created a separate issue for further exploration of automated flows as we have discussed as next steps #88388 |
|
@Julesssss since I'm assigned to the related issue, would you like me to review this PR?" |
|
Hey @cretadn22 sorry for the ping, no need for a review. |
Reviewer Checklist
Screenshots/VideosAndroid: HybridAppAndroid: mWeb ChromeiOS: HybridAppiOS: mWeb SafariMacOS: Chrome / Safari |
The skill bootstraps by reading files from the installed npm package, resolved via an echo bang-command. Without this allowlist entry, every skill load prompted for permission.
|
@rlinoz PR up-to-date and ready for merge ✅ |
|
🚧 @rlinoz has triggered a test Expensify/App build. You can view the workflow run here. |
|
🧪🧪 Use the links below to test this adhoc build on Android, iOS, and Web. Happy testing! 🧪🧪
|
|
✋ This PR was not deployed to staging yet because QA is ongoing. It will be automatically deployed to staging after the next production release. |
|
🚀 Deployed to staging by https://github.com/rlinoz in version: 9.3.62-0 🚀
Bundle Size Analysis (Sentry): |


Explanation of Change
Adds an agent-device glue-code skill (
.claude/skills/agent-device/SKILL.md) that enables Claude Code to drive iOS and Android devices for local mobile development - testing, debugging, performance profiling, bug reproduction, and feature verification.What this PR ships:
A single lean skill file that:
agent-deviceCLI is installed (hard stop if missing, with install instructions)What this PR does NOT ship:
Design decisions:
agent-devicenpm package and auto-update withnpm install -g agent-device.Fixed Issues
$ #87030
Tests
No runtime code is changed - these are Claude Code skill definitions only.
npm install -g agent-deviceagent-deviceskill is picked up.claude/skills/agent-device/SKILL.mdexists and contains only the pre-flight gate + usage principles (no inlined references)Offline tests
N/A - changes are Claude Code skill definitions only, no runtime app code affected.
QA Steps
N/A - no runtime code changes. These are developer tooling files (Claude Code skill definitions).
PR Author Checklist
### Fixed Issuessection aboveTestssectionOffline stepssectionQA stepssectiontoggleReportand notonIconClick)src/languages/*files and using the translation methodSTYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))npm run compress-svg)Avataris modified, I verified thatAvataris working as expected in all cases)Designlabel and/or tagged@Expensify/designso the design team can review the changes.ScrollViewcomponent to make it scrollable when more elements are added to the page.mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
N/A - no UI changes
Android: mWeb Chrome
N/A - no UI changes
iOS: Native
N/A - no UI changes
iOS: mWeb Safari
N/A - no UI changes
MacOS: Chrome / Safari
N/A - no UI changes