Skip to content

feat: add PlannerGrounderAgent for dual-model GUI automation#134

Merged
abrichr merged 4 commits intomainfrom
feat/planner-grounder-agent
Mar 18, 2026
Merged

feat: add PlannerGrounderAgent for dual-model GUI automation#134
abrichr merged 4 commits intomainfrom
feat/planner-grounder-agent

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 18, 2026

Summary

PlannerGrounderAgent composes a planner and a grounder for dual-model GUI automation:

  • Planner sees screenshot + accessibility tree → high-level instruction
  • Grounder sees screenshot + instruction → pixel coordinates
  • Supports BenchmarkAgent instances, VLM API calls, or HTTP endpoints for each role
  • Action history tracking for planner context
  • DONE/FAIL handling, grounder retry with simplified prompt

Usage

agent = PlannerGrounderAgent(
    planner="claude-sonnet-4-6",
    grounder="http",
    grounder_endpoint="http://gpu-server:8000/v1",
)

Test plan

  • 23 tests passing (pipeline, history, retry, HTTP, mixed types, a11y tree)

🤖 Generated with Claude Code

abrichr and others added 4 commits March 18, 2026 16:10
Implements the planner-grounder architecture from the GUI agent
literature (SeeAct ICML 2024, UFO2 2025, CODA 2025):

- Planner sees screenshot + a11y tree, outputs high-level instruction
- Grounder sees screenshot + instruction, outputs pixel coordinates
- Supports agent instances, VLM API calls, or HTTP endpoints
- Action history tracking, DONE/FAIL handling, grounder retry
- Registered in agent registry

23 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HTTP grounder now uses OpenAI chat completions API format (compatible
  with vLLM, Ollama, any OpenAI-compatible server)
- Sends screenshots as base64-encoded images
- serve_grounder.sh: start UI-Venus-1.5-8B via vLLM
- run_planner_grounder.py: full experiment script (Claude planner +
  UI-Venus grounder against WAA VM)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nderAgent

Fixes from first live experiment (Claude planner + UI-Venus grounder
on WAA VM):

1. Parse planner instructions for type/key/scroll actions — these
   bypass the grounder (which only returns click coordinates)
2. Planner prompt now requires ONE ATOMIC action per step (no
   compound "click X and type Y")
3. Grounder bbox parser handles UI-Venus [x1,y1,x2,y2] format,
   JSON format, and coordinate pairs
4. Float conversion for coordinates in run script and base.py
5. Added UI-Venus RL training review doc

Experiment result: planner correctly navigated Start → Notepad →
text area. Grounder returned accurate bounding boxes. Typing failed
because compound instructions weren't decomposed — now fixed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four fixes from live experiment:

1. Key actions with modifiers (Ctrl+A) now use pyautogui.hotkey()
   instead of press(). Parser stores modifiers separately on
   BenchmarkAction. Adapter handles both modifier+key combos and
   legacy "ctrl+a" string format.

2. Planner prompt now includes anti-loop rule: "If your last 3
   actions were the same, try a completely different approach."

3. Logging shows planner instruction correctly (was showing "?").

4. --save-screenshots flag saves PNGs at each step for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit a6eac6e into main Mar 18, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 20, 2026
… training features

Covers ~20 PRs merged since March 17 (#134-#157): PlannerGrounderAgent dual-model
architecture, TaskConfig YAML custom tasks, 4-pass workflow extraction pipeline,
RL training infra (TRL GRPO rollout, AReaL workflow, OpenEnv), LocalAdapter +
ScrubMiddleware for governed desktop agent, correction flywheel, strict mode,
and task setup dispatch. Updated architecture tree and key files table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 20, 2026
… training features (#158)

Covers ~20 PRs merged since March 17 (#134-#157): PlannerGrounderAgent dual-model
architecture, TaskConfig YAML custom tasks, 4-pass workflow extraction pipeline,
RL training infra (TRL GRPO rollout, AReaL workflow, OpenEnv), LocalAdapter +
ScrubMiddleware for governed desktop agent, correction flywheel, strict mode,
and task setup dispatch. Updated architecture tree and key files table.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 20, 2026
…, and demo-guided execution

Add detailed usage documentation for all major features added in PRs #134-#165:

- Docker/WAA Container: --cap-add NET_ADMIN requirement, full docker run
  command, boot timeline, port reference
- Full Evaluation Runner (run_full_eval.py): all flags with defaults,
  example commands for smoke test, resume, parallel, HTTP grounder
- Distillation Pipeline: two-step workflow (collect_distillation_data.py
  + finetune_distilled.py), all flags, mock validation mode
- Demo-Guided Execution: DemoLibrary API, DemoGuidedAgent with self-
  verification, recording workflow
- Task Setup Config Types: all 15 supported types with example params
- Strict Mode: ScrubMiddleware, workflow pipeline, WAALiveAdapter
- Pool Execution: external agent_factory support via PoolManager.run()
- Updated Quick Start with copy-pasteable sequence
- Updated Architecture tree with new files (demo_library, demo_guided_agent,
  scripts/)
- Updated Key Files table with new entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 20, 2026
…, and demo-guided execution (#166)

Add detailed usage documentation for all major features added in PRs #134-#165:

- Docker/WAA Container: --cap-add NET_ADMIN requirement, full docker run
  command, boot timeline, port reference
- Full Evaluation Runner (run_full_eval.py): all flags with defaults,
  example commands for smoke test, resume, parallel, HTTP grounder
- Distillation Pipeline: two-step workflow (collect_distillation_data.py
  + finetune_distilled.py), all flags, mock validation mode
- Demo-Guided Execution: DemoLibrary API, DemoGuidedAgent with self-
  verification, recording workflow
- Task Setup Config Types: all 15 supported types with example params
- Strict Mode: ScrubMiddleware, workflow pipeline, WAALiveAdapter
- Pool Execution: external agent_factory support via PoolManager.run()
- Updated Quick Start with copy-pasteable sequence
- Updated Architecture tree with new files (demo_library, demo_guided_agent,
  scripts/)
- Updated Key Files table with new entries

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant