fix: macOS Tart E2E infrastructure β lsappinfo, a11y via SSH, provisioning#90
fix: macOS Tart E2E infrastructure β lsappinfo, a11y via SSH, provisioning#90Miyamura80 merged 16 commits intomasterfrom
Conversation
osascript calls to System Events require TCC Automation permission, which hangs indefinitely in Tart VMs without pre-granted TCC access. lsappinfo visibleProcessList provides the same information without any TCC permissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The vm-agent runs as a LaunchAgent which gets a restricted Aqua session, returning an empty accessibility tree even with TCC permissions granted. SSH sessions inherit full TCC permissions from sshd-keygen-wrapper, so running a11y-helper via `ssh localhost` gives complete UI element data. Changes: - observation.rs: MACOS_A11Y_CMD now uses ssh localhost to invoke a11y-helper - init_macos.rs: provisioning sets up passwordless SSH keys, installs execute-action.py, grants TCC permissions with proper csreq blobs, and configures Homebrew PATH - vm-agent-install.sh: add EnvironmentVariables with Homebrew PATH to the LaunchAgent plist so subprocesses find the right python3 - a11y-helper main.swift: make AXIsProcessTrustedWithOptions check non-fatal (warning instead of exit) since AX API calls may succeed even when the check returns false Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR completes the macOS Tart VM E2E testing infrastructure, resolving a set of interconnected issues discovered during a first real E2E run on Apple Silicon. The changes address seven distinct failure modes β from Key changes:
Minor findings:
Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/observability issues with no functional impact on the happy path. Previously flagged P1 issues (authorized_keys idempotency, SQL injection via unsanitized paths, PYTHON_BIN resolving the wrong binary) are all addressed in prior commits. The two remaining findings are P2: the tee -a duplication is neutralized by path_helper deduplication, and the silent csreq fallback only manifests if codesign/csreq tooling fails (unlikely on a Homebrew-equipped image). All 524 unit tests pass and the infrastructure is manually verified end-to-end. src/init_macos.rs β two minor idempotency/observability issues in the generated provisioning script Important Files Changed
Sequence DiagramsequenceDiagram
participant Host as Host (desktest CLI)
participant TartVM as Tart VM (vm-agent LaunchAgent)
participant SSH as SSH Session (localhost)
participant A11y as a11y-helper
participant TCC as TCC Database
Note over Host,TCC: Golden image provisioning (init-macos)
Host->>TartVM: Copy execute-action.py, a11y-helper, provision.sh
TartVM->>TartVM: ssh-keygen (ed25519, guarded)
TartVM->>TartVM: authorized_keys append (idempotent grep check)
TartVM->>TCC: INSERT OR REPLACE with csreq blob (codesign + csreq tool)
TartVM->>TartVM: sudo shutdown -h now (flush filesystem)
Host->>Host: tart clone β desktest-macos:latest
Note over Host,A11y: Test run (desktest run)
Host->>TartVM: tart run + shared dir mount
TartVM->>TartVM: vm-agent polls shared/requests/ (Homebrew PATH)
Host->>TartVM: lsappinfo visibleProcessList (readiness check)
Host->>TartVM: screencapture -x /tmp/screenshot.png
TartVM->>SSH: ssh -o BatchMode=yes localhost /usr/local/bin/a11y-helper
SSH->>A11y: AXUIElement queries (full Aqua session via sshd-keygen-wrapper)
A11y-->>SSH: 976+ lines UI element tree
SSH-->>Host: a11y tree (via shared dir)
Host->>TartVM: execute-action (PyAutoGUI via Homebrew python3)
Reviews (4): Last reviewed commit: "docs: update ci.md for macOS support (no..." | Re-trigger Greptile |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Deduplicate authorized_keys: check before appending SSH public key to prevent accumulation on re-provisioning runs - Escape single quotes in SQL variables: prevent malformed SQL if paths from `command -v python3` ever contain single quotes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rustfmt produces different method chain formatting on x86_64-linux vs aarch64-darwin for the same Rust version (1.94.1). Add #[rustfmt::skip] to pin the format that CI (linux) expects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
β¦ence The long expect strings caused rustfmt to produce different chain formatting on x86_64-linux vs aarch64-darwin (same Rust 1.94.1). Shortened the messages to stay well under the threshold where both platforms agree on the chain style. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shorten expect message in the Hybrid evaluator block that came from master's new monitor code, matching the style used elsewhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provisioning script now ends with `sudo shutdown -h now` so the guest OS flushes TCC DB writes, SSH keys, and other artifacts to disk before the host clones the image. The Rust side waits up to 60s for the `tart run` child to exit naturally (guest powered off) before falling back to force-kill. This prevents the race where `tart stop` + `child.kill()` could interrupt the VM mid-shutdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
command -v python3 resolves to /usr/bin/python3 (macOS stub) during provisioning because /etc/paths.d/homebrew hasn't taken effect yet. The vm-agent uses /opt/homebrew/bin/python3, so TCC grants must target that binary. Prefer the well-known Homebrew path with a fallback to command -v. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
β¦infra Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TCC permissions section: note that init-macos handles grants automatically, list all four permission types granted - TCC Database Setup: rewrite with csreq blob generation (old instructions used NULL csreq which modern macOS ignores), add grant_tcc helper function, document Homebrew Python path - Add SSH localhost section explaining why a11y-helper needs it (LaunchAgent restricted Aqua session) - Update golden image saving: document graceful shutdown requirement - Update limitations table: a11y tree is no longer "limited" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- macOS requirements: remove "planned" label, add sshpass and init-macos details, mention claude-cli provider - App types table: macos_tart no longer marked as planned - Architecture diagram: show Linux and macOS paths side by side - CLI commands: add init-macos Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Mark all 5 phases as complete - Add "Post-Phase 5: E2E Infrastructure Fixes" section documenting issues discovered during first real E2E run on Apple Silicon - Document the SSH localhost workaround for LaunchAgent Aqua sessions - Update Phase 4 readiness: osascript replaced with lsappinfo - Update risks table with newly discovered risks and mitigations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove "Planned" label from macOS section - GitHub Actions example: add sshpass install, use desktest init-macos instead of manual tart pull, add caching tip - Golden Image section: rewrite to document init-macos automated provisioning instead of manual setup steps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Fixes the macOS Tart VM E2E testing infrastructure so that the full pipeline works end-to-end: VM boot β desktop readiness β app deploy β agent loop (with accessibility tree) β evaluation β artifact collection.
These changes were developed and tested on an Apple Silicon Mac mini running macOS 26.2 with Tart 2.32.0.
What was done
1. Replace
osascriptwithlsappinfofor GUI process detection (readiness.rs)get_gui_process_list()usedosascriptβ System Events, which requires TCC Automation permissionlsappinfo visibleProcessListwhich provides the same information without any TCC permissions2. Enable accessibility tree extraction via SSH localhost (
observation.rs)sshd-keygen-wrapper(pre-granted in Tart base images)MACOS_A11Y_CMDnow usesssh -o BatchMode=yes localhost /usr/local/bin/a11y-helperβ this yields 976+ lines of UI element data vs ~22 empty lines before3. Fix golden image provisioning (
init_macos.rs)execute-action.py: the PyAutoGUI executor script was never copied to the VM β agent loop couldn't execute any actionsed25519keys for localhost, required by the SSH-based a11y extractioncsreqblobs (generated viacodesign -d -r-+csreqtool) for a11y-helper, Python, and screencapture/opt/homebrew/binto/etc/paths.d/homebrewso all processes find the rightpython34. Fix vm-agent LaunchAgent PATH (
vm-agent-install.sh)EnvironmentVariables, so the default PATH (/usr/bin:/bin:/usr/sbin:/sbin) was usedpython3 /usr/local/bin/execute-action) resolved to system Python which doesn't have PyAutoGUIEnvironmentVariableswith Homebrew PATH to the plist5. Make a11y-helper trust check non-fatal (
main.swift)AXIsProcessTrustedWithOptionsreturnsfalsein LaunchAgent context even when actual AX API calls succeedexit(1)to a stderr warning β the tree extraction continues and produces dataIssues encountered and resolved
osascripthangs in VMlsappinfo(no TCC needed)ssh localhostwhich gets proper sessionexecute-actionnot founddocker/execute-action.pywas never copied during provisioningprovision_vm()and install step to provisioning scriptEnvironmentVariableswith Homebrew PATH to LaunchAgent plistcsreqcolumn; modern macOS requires code signing requirement blobscodesign -d -r-+csreqtool and insert as hex blobtart stopdoesn't flush VM filesystemsudo shutdown -h nowfor graceful shutdown beforetart cloneAXIsProcessTrustedWithOptionsfalse positiveWhat is NOT solved yet
claude-cliprovider: with full a11y data (~976 lines), Claude CLI calls sometimes exceed 60s, causing retries and eventual timeout. Needs a configurable or provider-aware timeout./home/testerartifact collection:artifacts.rsstill tries to collect from Linux home dir path for Tart sessions. Should be skipped or use/Users/admin.desktest-macos:latesthas no version tagging strategy, risking drift.desktest-macos-electron:latestdoesn't exist yet (would need--with-electronflag during init).Test plan
cargo testβ all 524 unit tests + 3 validation tests passdoctor_shows_tart_statusβ passes on Apple Silicon with Tart installeddesktest run examples/macos-textedit.json --provider claude-cliβ infrastructure works (VM boots, a11y tree populated with 976+ lines, agent loop runs, evaluation works). Test fails at agent level (can't complete TextEdit task), not infrastructure.desktest init-macosfresh provisioning β not re-run after code changes (would take ~10 min to pull + provision). Manual provisioning verified all steps work individually.π€ Generated with Claude Code