AutoGUI Desktop Agent

AutoGUI provides desktop automation in two forms:

A standalone Python CLI/TUI agent that connects any OpenWebUI instance to your desktop.
A native TypeScript Pi Coding Agent extension in pi-extension/ that lets Pi own the agent workflow while AutoGUI supplies desktop tools.

The standalone agent drives a ReAct-style loop (Reason → Act → Observe → repeat) and can run shell commands, read/write files, take screenshots, click, type, launch programs, and inspect accessibility trees — all via function-calling with any model available in your OpenWebUI instance.

The standalone Python agent architecture follows UFO and open-interpreter but is vendor-neutral: any model that supports OpenAI-compatible tool calling and is registered in your OpenWebUI works out of the box.

The Pi extension is decoupled from OpenWebUI. It uses whatever model/provider Pi is configured to use and exposes desktop tools plus /autogui.

⚠ Experimental Software — Use in a Sandbox

AutoGUI is a research prototype. It is not intended for, nor evaluated or deemed suitable for, any particular production use or critical workload. No warranty is provided, express or implied.

The agent operates at OS level: it can run shell commands, click anything, type anywhere, read and write files, and take screenshots. Run AutoGUI only in a sandbox, VM, or container that you are willing to reset. Restrict the REST API to loopback (AUTOGUI_API_HOST=127.0.0.1) and consider disabling shell access ("allowed_shell": false) if you do not fully trust the task or the model driving it. See the Security Notes section for further guidance.

Features

Category	What it does
Planner	One LLM call up front produces a numbered plan; executor follows it across the ReAct loop. Off-switchable, defaults on
ReAct loop	Reason → tool call → observe result → repeat, up to configurable iteration limit
Shell	Run any shell command with timeout, destructive-pattern guard, and confirmation delay
Filesystem	Read, write (or append), and list files/directories; optional pre-overwrite snapshots
Desktop (pixel)	Screenshot, click, double-click, type text, hotkeys, scroll, launch apps, list windows
Desktop (a11y-first)	`desktop_click_element(name, …)` clicks real UI controls via UIAutomation (Windows) and AT-SPI (Linux) — no pixel guessing; macOS AX supported in the Pi extension only
Set-of-Mark grounding	Numbered overlay on detected elements; the model clicks by id (`desktop_click_mark`) instead of pixel coords
Click-by-text	OCR-anchored click (`desktop_click_text`); install Tesseract via `scripts/install-dependencies.*`
Browser (Playwright)	First-class Chromium driver: real DOM/ARIA selectors, `browser_click`, `browser_fill`, `browser_eval` — opt-in via `allowed_browser`
Native input (Windows)	`click`/`type_text`/`hotkey` go through `user32.SendInput` directly (real INPUT events, correct DPI, full Unicode)
Best-of-N sampling	On uncertain steps (recent failure or non-APPROVED validator), sample N candidates and pick via self-consistency or a verifier model
Skill library	`skill_save`/`skill_list`/`skill_run`: persist successful tool sequences and retrieve them by keyword on the next task. `skill_list` / `skill_run` are always available, so existing libraries are readable; `skill_save` (creation) is gated by `agent.skills_enabled` (default false). Each side keeps its own library: standalone agent uses `./skills/`, Pi extension uses `pi-extension/runtime/skills/`.
Trajectory replay	Per-session JSONL trace + `replay.py` re-runs any saved skill or trace deterministically (no LLM)
Failure recording	Rolling 5-second screen buffer dumps to an animated GIF on tool failure
State diff & modal flag	Pre/post-action window-set diff with an `[UNEXPECTED MODAL: …]` banner when an error/permission/confirm dialog appears
Dry-run mode	Stub all state-changing tools while keeping observation tools live
Action scoping	`allowed_apps` + `blocked_window_titles` enforced before every GUI action
Platform-aware prompts	Auto-injects OS-specific instructions (WSL `.exe`, `where.exe`, `which`, etc.)
Startup validation	Checks API key and model against the live server; prompts to fix or save
Live model picker	Ctrl+P → "Change Model" fetches the live model list; select and optionally persist to config
Safety countdown	N-second delay before each tool call; Escape cancels during the window
Hallucination guard	Detects when the model narrates actions without calling tools; re-prompts
Error retry	Failed tool calls inject a mandatory-retry message into the history
Step verification	System prompt instructs the model to verify each result and self-continue
TUI	Textual-based interactive session with status bar, tool visibility toggle, history save
CLI	Single-command non-interactive mode for scripting and automation
REST API	FastAPI HTTP server for programmatic task submission and live event streaming

Architecture

main.py             Entry point — argparse, validation, component wiring, TUI/CLI dispatch
│
├── config.json     Runtime configuration (URL, model, safety, logging, TUI settings)
│
├── client.py       Async OpenWebUI API client (aiohttp, OpenAI-compatible)
│   ├─ chat()           POST /api/chat/completions (or custom api_path)
│   ├─ fetch_models()   GET /api/models — used for validation and model picker
│   └─ health_check()   Connectivity probe
│
├── platform_detect.py  Detect OS/display environment (WSL, Wayland, X11, macOS, Windows)
│
├── backends/           Platform-specific desktop automation backends
│   ├─ base.py          pyautogui baseline (screenshot, click, type, hotkey, scroll)
│   ├─ wsl.py           WSLg display + PowerShell for window list and launch
│   ├─ windows.py       PowerShell + optional uiautomation (accessibility tree)
│   ├─ macos.py         screencapture, osascript, open -a
│   ├─ linux_x11.py     xdotool type override, wmctrl for windows
│   └─ linux_wayland.py grim, ydotool, swaymsg
│
├── tools.py        Tool registry and shell/filesystem implementations
│   ├─ shell_run             Shell command with timeout and destructive guard
│   ├─ fs_read / fs_write / fs_list
│   ├─ desktop_screenshot / click / type / hotkey / scroll / launch / list_windows
│   ├─ desktop_find_element  (Windows UIAutomation, Linux AT-SPI, WSL)
│   ├─ desktop_click_element (a11y-first click — same backends as find_element)
│   ├─ desktop_click_text    (OCR / a11y text match)
│   ├─ desktop_screenshot_marked / desktop_click_mark  (Set-of-Mark)
│   ├─ skill_save / skill_list / skill_run  (persistent macros)
│   ├─ browser_navigate / click / fill / press / get_text / screenshot / eval
│   │   (Playwright; registered when allowed_browser=true)
│   ├─ desktop_get_window_tree (Windows)
│   └─ ToolRegistry          JSON Schema catalog + async dispatch
│
├── agent.py        Agentic loop
│   ├─ Agent.run(input)      Async generator → AgentEvent stream
│   ├─ Agent.reset()         Clear conversation history
│   └─ Guardrails: hallucination detection, error-retry injection, step-continue
│
├── api.py          REST API server (FastAPI)
│   ├─ POST /api/task        Submit task → task_id
│   ├─ GET  /api/task/{id}   Poll task state + steps
│   ├─ GET  /api/task/{id}/stream  SSE live event stream
│   ├─ POST /api/task/{id}/cancel  Cancel running task
│   └─ GET  /api/healthz     Liveness probe
│
├── dry_run.py      DryRunAgent — canned events, no desktop needed
│
├── tui.py          Textual TUI
│   ├─ AgentTUI              Main app (status bar, conversation log, input)
│   ├─ HelpScreen            F1 modal — key bindings + tool list
│   ├─ _ModelPickerCommand   Ctrl+P palette → "Change Model"
│   └─ ModelPickerScreen     Modal — live model list with optional config save
│
└── logs/
    agent.log       Rotating log file
    history.jsonl   Saved conversation history (Ctrl+S in TUI)

Agentic Loop

User input
    │
    ▼
Append to message history
    │
    ▼
POST history + tool schemas → OpenWebUI
    │
    ├─ finish_reason == "stop"
    │       └─ Check for narrated actions (hallucination guard)
    │          ├─ Narration detected → inject correction, continue loop
    │          └─ Genuine stop → emit "done"
    │
    ├─ finish_reason == "tool_calls"
    │       └─ For each tool call:
    │            ├─ Safety countdown (N seconds, Escape to cancel)
    │            ├─ dispatch(tool_name, args) → result_json
    │            ├─ If error → append [AGENT POLICY] retry message to history
    │            └─ Append role="tool" result
    │          Loop (up to max_iterations)
    │
    └─ finish_reason == "length" → emit warning + "done"

Setup

Prerequisites

Python 3.10+

python --version

System packages (Linux/WSL only)

# X11 desktop tools
sudo apt install python3-tk python3-dev wmctrl xdotool

# Wayland desktop tools
sudo apt install ydotool grim swaymsg
sudo ydotoold &          # start ydotool daemon

macOS and Windows require no additional system packages.

Python packages

pip install -r requirements.txt

Optional platform-specific packages:

# Windows: accessibility tree (find elements by name, not pixel position)
pip install uiautomation pywin32

# macOS: richer window metadata
pip install pyobjc-framework-Quartz pyobjc-framework-AppKit

Optional dependencies — install scripts

Optional features (Tesseract for click-by-text, Playwright + Chromium for the browser tools, Linux AT-SPI for desktop_click_element, ImageMagick for Set-of-Mark overlays + failure GIFs, plus a few platform-specific pip packages) are installed by one script per OS under scripts/:

OS	Script
Linux / macOS / WSL	`bash scripts/install-dependencies.sh`
Windows	`scripts\install-dependencies.cmd` (cmd shim)
Windows (PowerShell)	`powershell -ExecutionPolicy Bypass -File scripts\install-dependencies.ps1`

Each script:

detects its OS, package manager (apt/dnf/pacman/zypper/brew/winget), and display server (X11 vs Wayland on Linux);
skips dependencies that are already installed (idempotent);
echoes every command before running it (loud by design);
installs the Python deps from requirements.txt, plus the optional ones (pyperclip, pytesseract, playwright, pyobjc-framework-Quartz on macOS, uiautomation + pywin32 on Windows);
runs python -m playwright install chromium;
if pi-extension/ exists, also runs npm install (which picks up playwright from optionalDependencies) and npx playwright install chromium inside it.

Either run the script manually before launch, or set this config flag and AutoGUI will invoke it once at startup before initialising the registry:

{ "install_dependencies": true }

The flag is at the top level of config.json (not under agent/tools). Default is false so unmodified setups don't install anything.

Manual single-package install if you only want one of the optional deps:

# Tesseract (for desktop_click_text / desktop_find_text)
sudo apt install tesseract-ocr   # Debian/Ubuntu/WSL
sudo dnf install tesseract       # Fedora
brew install tesseract           # macOS
winget install UB-Mannheim.TesseractOCR   # Windows
pip install pytesseract

# Playwright + Chromium (for browser_* tools)
pip install playwright
python -m playwright install chromium

# Linux a11y for desktop_click_element
sudo apt install python3-pyatspi gir1.2-atspi-2.0

# ImageMagick (for Set-of-Mark overlay + failure GIFs)
sudo apt install imagemagick     # Linux
brew install imagemagick         # macOS
winget install ImageMagick.ImageMagick   # Windows

Planner (vs. dry-run — they're different things)

Planner = one extra LLM call at the start of each task that produces a numbered, high-level plan (3–8 steps describing goals, not specific clicks). The plan is injected as a [PLAN] block into the executor's context so every subsequent decision has the full trajectory in mind. Configured under agent.planner.enabled, defaults on. Same OpenWebUI client as the rest of the agent — one extra round-trip per task; no separate model required.

Dry-run (safety.dry_run: true) is a safety stub, not a planner — it returns {dry_run: true, would_execute: …} for every state-changing tool while leaving the real screen unchanged. Useful for "rehearse a task without touching anything", but not as a plan-then-execute mechanism: the executor would think each step succeeded, observe the unchanged real screen, and tie itself in knots over the contradiction.

If you want plan-first-then-execute semantics, leave dry-run off and keep the planner on — that's exactly what the planner does.

You can turn the planner off entirely with agent.planner.enabled: false if you don't want the extra round- trip while debugging.

A11y-first clicking (most reliable)

desktop_click_element(name=…, control_type=…) talks to the real UI control by name/role via the OS accessibility API instead of clicking at a guessed pixel position. Prefer this over desktop_click whenever the target has a visible label — it survives DPI scaling, window moves, and async UI redraws.

Platform	Backend used	Install
Windows	UIAutomation (`uiautomation` pkg)	`pip install uiautomation pywin32`
macOS	Not available (Pi extension only)	Use Pi extension for macOS AX element clicking
Linux X11	AT-SPI 2 (`pyatspi`)	`sudo apt install python3-pyatspi gir1.2-atspi-2.0`
Linux Wayland	AT-SPI 2 (`pyatspi`)	same as X11

When the a11y backend isn't available the fallback ladder is: desktop_click_text (OCR/a11y text match) → desktop_click_mark (Set-of-Mark) → desktop_click(x, y). The agent's system prompt encourages the model to walk this ladder.

Browser automation (Playwright)

Set tools.allowed_browser: true to enable the browser_* tool family — a Playwright-driven Chromium that the agent can navigate, inspect, and interact with via real DOM/ARIA selectors instead of pixel coordinates.

Playwright + Chromium are installed by scripts/install-dependencies.* (see "Optional dependencies — install scripts" above). Either run the script once manually, or set install_dependencies: true at the top of config.json to have AutoGUI run it at startup. Until they're present, the browser_* tools register but return a clear error pointing back at the install script.

Selectors follow Playwright syntax:

CSS: button.primary, #login-form input[name="email"]
Text: text=Sign in, text=/^Continue$/i
ARIA role: role=button[name="Sign in"]
XPath: xpath=//button[contains(.,"Sign in")]

Use browser.user_data_dir to point at a persistent profile if you want logins/cookies to survive restarts.

Native input on Windows

On Windows, desktop_click / desktop_type / desktop_hotkey go through user32.SendInput directly via ctypes when available. This gives you real INPUT events (indistinguishable from a physical keyboard/mouse), correct per-monitor DPI behaviour, and full Unicode text input via KEYEVENTF_UNICODE. Falls back to pyautogui if SendInput initialisation fails. No configuration required.

Typing reliability

desktop_type always tries clipboard paste first (one event, arbitrary length, perfect Unicode), with the platform-correct modifier — Cmd+V on macOS, Ctrl+V everywhere else. Only when the clipboard path fails or pyperclip isn't installed does it fall back to per-character keystrokes. Per-platform fallbacks:

Windows / WSL — SendInput KEYEVENTF_UNICODE with a 5 ms inter-event pause and 15 ms inter-character pause (slow targets used to drop keys at the previous 0 ms cadence).
Linux X11 — xdotool type --clearmodifiers --delay 30 (the default 12 ms cadence sometimes loses keys on slow targets, producing artefacts like hello world → hello ddddd).
Linux Wayland — ydotool type --key-delay 20 with a fallback to plain ydotool type for older versions.

Every typing call is now logged at INFO level with the actual text (truncated to 60 chars) and the method that ran, so if a target app keeps losing keys you can see which path is being used.

The clipboard-paste path saves the user's clipboard before pasting and restores it afterward, so automation doesn't clobber whatever the user had copied.

Best-of-N action sampling

Optional. Set agent.bon.enabled: true and on uncertain steps the agent will sample N candidate completions from the primary model in parallel, then choose between them via:

Self-consistency — if a strong majority propose the same first tool + arg signature, that's the pick (no extra call).
Verifier — otherwise the same OpenWebUI client is given a one-line summary of each candidate (no tools attached) and asked to return only the index of the best one.
Fallback — any failure path picks the first viable candidate, so BoN can never make the agent worse than baseline.

Triggers (also configurable):

trigger_on_recent_failure — last iteration had a failed tool.
trigger_on_validator_disagreement — last validator verdict was not APPROVED.

Cost: 3–5× tokens on triggered steps. Spend nothing on confident steps. Defaults are conservative; turn it on when you're chasing the last ~20% of accuracy.

Failure recording

A daemon thread maintains a rolling 5-second screen buffer at 5 fps. On any failed tool call the buffer is flushed to an animated GIF under screenshots/failures/, and a failure_recording event is emitted with the path. Defaults to on; tune via agent.screen_record.

Set-of-Mark grounding

When vision is enabled, the agent uses Set-of-Mark screenshots: numbered boxes are drawn over detected UI elements, and the model clicks by ID via desktop_click_mark(mark_id) instead of guessing pixel coordinates. The marks come from the OS accessibility tree where available (Windows UIAutomation, macOS) and from window rects elsewhere. No setup required — it's on by default.

Configuration

cp config.json.example config.json

Edit config.json:

{
  "openwebui": {
    "base_url": "http://localhost:3000",
    "api_key": "sk-your-key-from-openwebui-settings",
    "model": "llama3.1:70b"
  }
}

Your API key: OpenWebUI → Settings → Account → API Keys.

The model string must match exactly what appears in your OpenWebUI model list. If you leave the model wrong, startup validation will offer a menu to pick the right one.

Bypassing OpenWebUI — connecting directly to Ollama

If you prefer to skip the OpenWebUI proxy (for example, if the model you want isn't configured for tool-calling in OpenWebUI, or you don't have admin access to change that setting), set api_path to /v1/chat/completions and point base_url at your Ollama instance:

{
  "openwebui": {
    "base_url": "http://localhost:11434",
    "api_path": "/v1/chat/completions",
    "api_key": "",
    "model": "qwen3:14b"
  }
}

Ollama exposes an OpenAI-compatible completions endpoint at /v1/chat/completions that AutoGUI targets directly — no OpenWebUI installation required, and no API key needed. The openwebui config section name is kept for backwards compatibility; it works for any OpenAI-compatible endpoint regardless of whether OpenWebUI is involved.

Verify connectivity

python main.py --check

Prints connection status, configured model, and registered tool list.

REST API

AutoGUI ships a FastAPI REST server that wraps the Agent class, making it accessible to web UIs, scripts, and CI pipelines without the TUI.

Install API dependencies

pip install -r requirements.txt   # fastapi and uvicorn are already included

Start the server

The REST API starts automatically in the background whenever you run python main.py (any mode — TUI or single-command). A [autogui] REST API listening on http://… banner is printed to stderr at startup. You can also start it standalone:

# With config.json present:
python api.py
# Listening on http://0.0.0.0:8002

# Without a config file — use environment variables:
OPENWEBUI_BASE_URL=http://localhost:3000 \
OPENWEBUI_API_KEY=sk-my-key \
OPENWEBUI_MODEL=llama3.1:70b \
python api.py

# Test without a real desktop or OpenWebUI instance:
AUTOGUI_DRY_RUN=true python api.py

Security and network bind address

Warning: the REST API has no authentication and binds to 0.0.0.0 (all interfaces) by default. This default suits sandbox / container environments where network isolation is provided by the runtime. Do not expose the API port to an untrusted network without additional access controls. For local development, restrict the server to loopback (127.0.0.1) using the mechanisms below.

Mechanism	Effect
`AUTOGUI_API_HOST=127.0.0.1`	Restrict the API to loopback (recommended for local dev)
`AUTOGUI_API_PORT=<port>`	Change the listen port (default `8002`)
`AUTOGUI_DISABLE_API=1`	Disable the background API for all `main.py` invocations

If fastapi and uvicorn are not installed, the background thread is silently skipped and the main agent works normally.

Docker

# Real agent (needs X11 and OpenWebUI):
docker run -p 8002:8002 \
  -e DISPLAY=$DISPLAY \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  -e OPENWEBUI_BASE_URL=http://host.docker.internal:3000 \
  -e OPENWEBUI_API_KEY=sk-my-key \
  autogui python api.py

# Dry-run (no display or OpenWebUI needed):
docker run -p 8002:8002 -e AUTOGUI_DRY_RUN=true autogui python api.py

Quick curl example

# Submit a task
TASK_ID=$(curl -s -X POST http://localhost:8002/api/task \
  -H 'Content-Type: application/json' \
  -d '{"task": "Take a screenshot of the desktop"}' \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["task_id"])')

# Stream live events
curl -N http://localhost:8002/api/task/$TASK_ID/stream

# Or poll for the finished result
curl -s http://localhost:8002/api/task/$TASK_ID | python3 -m json.tool

Full endpoint reference, SSE event format, and all environment variables: docs/REST_API.md

Interactive docs (once the server is running):

Swagger UI: http://localhost:8002/docs
ReDoc: http://localhost:8002/redoc

Usage

Pi extension

The native Pi extension lives in pi-extension/ and is implemented entirely in TypeScript. It does not use OpenWebUI or the standalone Python agent loop.

cd pi-extension
npm install
npm run typecheck
pi -e ./src/index.ts

Inside Pi:

/autogui Open a harmless app and describe what you see

When an /autogui task completes naturally, the extension auto-spawns a read-only Pi validator in a fresh tmux session (only screenshot and window- listing tools active) to double-check the desktop state. Set validateAfterAutogui: false in the extension's config.json to skip the follow-up.

See pi-extension/README.md for the full extension details.

Skill library

Skills are named, replayable sequences of successful tool calls — saved with skill_save, listed with skill_list, replayed with skill_run (or replay.py outside the agent loop).

skills_enabled controls creation only. Reads are always allowed:

`skills_enabled`	`skill_list`	`skill_run`	`skill_save`	Candidate suggestion at task start
`false` (default)	✓	✓	— (not registered)	✓
`true`	✓	✓	✓	✓

So a fresh checkout never writes a skills/ directory until you opt in, but if you copy in an existing skills.jsonl (or someone else on the same machine has already created one) it remains usable immediately.

{
  "agent": {
    "skills_enabled": false,            // default — no NEW skills are written
    "skills_path": "skills/skills.jsonl"
  }
}

The standalone agent stores skills at ./skills/skills.jsonl relative to the project root. This path is deliberately separate from the Pi extension's library at pi-extension/runtime/skills/skills.jsonl — each side manages its own library so they don't shadow each other. Both directories are git-ignored and created lazily the first time skill_save fires (no creation = no directory).

Side	Default skill path	Config key (creation gate)
Standalone Python agent	`skills/skills.jsonl`	`agent.skills_enabled`
Pi extension	`pi-extension/runtime/skills/skills.jsonl`	`skillsEnabled` (in `pi-extension/config.json`)

If you want a single shared library across both programs, point skills_path and skillsPath at the same absolute path — the default is to keep them separate so each program's library is private.

App-memory library

Per-app quirk database — failure histograms, success counts, and free-form notes attached to an app via memory_note(app, text). Surfaced into the planner as "app memory hints" for any apps visible at task start so plans bias toward strategies that worked before.

agent.memory.enabled controls creation only. Same pattern as skills_enabled:

`memory.enabled`	`memory_get`	planner hints	controller auto-records	`memory_note`	`memory/` dir on disk
`false` (default)	✓	✓ (reads existing)	—	— (not registered)	none until user opts in
`true`	✓	✓	✓	✓	created lazily on first write

{
  "agent": {
    "memory": {
      "enabled": false,        // default — no NEW records are written
      "dir": "memory"          // standalone-agent quirk database
    }
  }
}

The standalone agent stores at ./memory/; the Pi extension stores at pi-extension/runtime/memory/. Both directories are git-ignored and created lazily the first time memory_note (or the controller's auto- recorder) actually fires. Point both at the same absolute path if you want a single shared quirk database across both programs.

Side	Default memory path	Config key (creation gate)
Standalone Python agent	`memory/`	`agent.memory.enabled`
Pi extension	`pi-extension/runtime/memory/`	`memoryEnabled` (in `pi-extension/config.json`)

Runtime directories

The standalone Python agent creates runtime directories as needed:

Path	Contents
`logs/`	`agent.log` (rotating) + per-session `session_<ts>.log` files
`logs/traces/`	Per-task JSONL trajectory logs
`logs/artifacts/`	Artifact bodies + `index.jsonl`. Stable-id store: each capture gets a fresh `artifact://<id>` even when the body is identical to a prior capture.
`logs/progress/`	Per-task JSON progress records (auto-resume keyed by task hash)
`memory/`	Per-app quirk store — `memory/<app>.json` + `memory/index.jsonl`. Only created the first time `memory_note` runs or the controller auto-records, which requires `agent.memory.enabled=true`. Reads via `memory_get` work regardless.
`screenshots/`	Ad-hoc screenshots taken by the agent
`screenshots/failures/`	Animated GIF failure recordings
`skills/`	Skill library — `skills/skills.jsonl` (only created the first time `skill_save` runs, which requires `skills_enabled=true`)

The Pi extension writes runtime files under pi-extension/runtime/:

Path	Contents
`pi-extension/runtime/skills/`	Skill library — `skills.jsonl` (only created when `skillsEnabled=true` and `skill_save` fires; reads are always allowed)
`pi-extension/runtime/traces/`	Per-session JSONL trajectory logs
`pi-extension/runtime/artifacts/`	Artifact bodies + `index.jsonl` (stable-id, not deduped).
`pi-extension/runtime/progress/`	Per-task JSON progress records
`pi-extension/runtime/memory/`	Per-app quirk store — `<app>.json` + `index.jsonl`. Created lazily when `memoryEnabled=true` and a write fires; reads via `memory_get` work regardless.
`pi-extension/runtime/screenshots/`	Ad-hoc screenshots
`pi-extension/runtime/failures/`	Animated GIF failure recordings
`pi-extension/runtime/logs/`	`autogui.log`

All pi-extension/runtime/ paths are git-ignored.

Interactive TUI

python main.py

The TUI shows a scrollable conversation pane with a status bar at the bottom. The status bar always shows the current model name, conversation length, and tool visibility state.

Key Bindings

Key	Action
Enter	Submit input
Ctrl+P	Command palette — type "model" → select Change Model
Ctrl+R	Reset conversation history
Ctrl+S	Save history to `logs/history.jsonl`
Ctrl+T	Toggle tool call/result visibility
Escape	Cancel current task (best-effort)
F1	Help overlay (key bindings + tool list)
Ctrl+C	Exit

Single-command mode

python main.py "List all Python files in ~/projects and show me their sizes"

python main.py "Open Notepad and type Hello World"

python main.py --no-desktop "Summarize ~/Documents/notes.txt"

python main.py --quiet "Run tests in ~/myproject"

python main.py --model mistral:7b "What files are in the current directory?"

CLI flags

Flag	Description
`--config PATH`	Use a custom config file (default: `config.json`)
`--model MODEL`	Override model for this session only
`--no-desktop`	Disable mouse/keyboard/screenshot tools
`--no-shell`	Disable shell execution
`--no-tools`	Disable all tools (pure chat mode)
`--verbose`	DEBUG-level logging to stderr and log file
`--quiet`	Suppress tool call/result output (single-command mode)
`--check`	Connectivity health check, then exit

Startup Validation

Every time the agent starts it runs a validation sequence before opening the TUI:

API key — if the key is unset or a placeholder, prompts for one (hidden input).
Connection — calls /api/models to verify the key and server are reachable.
- HTTP 401 → re-prompts for the key (up to 3 attempts).
- Connection refused / timeout → prompts for a new base_url.
Model check — if the configured model is not in the server's model list, shows a numbered menu so you can pick one.

After each successful check you are offered the option to save the new value to config.json (API key, base URL, or model), so you only need to do this once.

Model Picker

Open the model picker via Ctrl+P (the command palette) — type "model" and select Change Model:

┌─ Select Model ───────────────────────────────────────────────┐
│ 12 models  ·  ↑↓ to navigate                    │
│ ┌─────────────────────────────────────────────┐ │
│ │ llama3.1:70b  ●                             │ │  ← current model (green dot)
│ │ llama3.2:latest                             │ │
│ │ mistral:7b                                  │ │
│ │ phi3:mini                                   │ │
│ └─────────────────────────────────────────────┘ │
│ [ ] Save selection to config.json               │
│                        [Select]  [Cancel]       │
└─────────────────────────────────────────────────┘

The currently active model is highlighted with a green dot.
Selecting a model takes effect immediately for the next message.
Checking Save selection to config.json persists the choice so it survives restarts.
Press Escape or Cancel to close without changing the model.

Robustness, planning & verification (controller-only)

agent.controller.enabled defaults to true, so all of this runs by default; set it to false to fall back to the legacy single-loop ReAct executor. When the controller is on it layers several extra safeguards on top of the standard ReAct loop. Each is individually toggleable so you can dial in the tradeoff between speed and reliability.

Knob	Default	What it does
`agent.controller.critique_enabled`	`true`	Adds one extra LLM call after the planner that critiques the plan and returns a revised version when issues are found. Catches plan-level mistakes (missing steps, vague post-conditions, wrong dependencies) before any UI is touched.
`agent.controller.preflight_enabled`	`true`	Before the first state-changing action, verifies that resources the plan needs are available: apps on PATH, files present, URLs TCP-reachable, named tools registered, probe commands exit 0. Tasks abort with a structured `preflight_failed` event when something is missing.
`agent.controller.predicate_check_enabled`	`true`	When a plan step declares a typed `predicate` (`window_title_contains`, `file_exists`, `url_contains`, `text_visible`, `process_running`, `shell_returns`, …), the controller verifies it deterministically after `STEP_DONE`. A miss demotes the verdict to BLOCKED and triggers replan via the standard failure-classification path.
`agent.controller.visual_diff_enabled`	`true`	When vision is on, hashes each pre/post screenshot pair via a 16×16 perceptual ("dHash") hash and tags the tool result with `verifier.visual_diff` when a state-changing action moved fewer than ~12% of bits — i.e. the screen barely changed. Catches the silent-no-op failure mode that exit-code checks miss.
`agent.controller.watchdog_stall_threshold`	`3`	Hashes `(window list, active window, first proposed tool, first args)` per iteration. When the same signature recurs N times in a row the step is flagged as stuck and routed through the standard BLOCKED path. `0` disables.
`agent.budget.max_*`	`0`	Hard ceilings for tool calls / chat calls / total tokens / seconds. When any ceiling is exceeded a `budget_exceeded` event fires and the task ends before the next step runs.
`agent.memory.enabled`	`false`	Creation gate. When false (the default) `memory_note` is not registered, the controller does NOT auto-record successes/failures, and no `memory/` directory is created. `memory_get` and the planner's app-memory hints continue to read whatever is already on disk, so an existing quirk database stays useful even when creation is off. Set to `true` to allow new records. Mirrors the `agent.skills_enabled` flag.
`agent.memory.dir`	`memory/`	Per-app quirk store location (`memory/<app>.json`). Created lazily the first time something is written. The pi extension keeps its own quirk database under `pi-extension/runtime/memory/` so the two libraries don't shadow each other (point both at the same absolute path if you want them merged).

The planner also receives few-shot exemplars from the skill library (top-3 matches by keyword) and app memory hints for any visible apps, so plans are biased by what previously succeeded against the same software.

replay.py --drift-check re-runs a saved skill while comparing the live post-state against the windows + perceptual screen hash recorded when the skill was first captured (step.drift_anchor). Drift between rounds is logged so you know when a recipe has gone stale without having to re-record it from scratch.

The pi extension exposes the same primitives as tools: check_predicate, preflight, memory_get / memory_note, budget_status, classify_failure, desktop_wait_for. Pi owns the LLM loop, so the controller protocol injected into the system prompt instructs Pi's agent to call them at the right beats (preflight up front, predicate check before STEP_DONE, etc.) rather than the extension running them implicitly.

Test harness

A pytest suite under tests/ exercises the controller / artifacts / predicates / failures / app memory / budget / preflight / watchdog / visual diff modules with no live model and no desktop backend required:

pip install pytest pytest-asyncio
python -m pytest

The tests use mocked OpenWebUIClient and ToolRegistry stubs (see tests/conftest.py) to drive Agent._run_with_controller end-to-end, including a budget-exhaustion case that proves the ceiling stops the loop before the next step runs. Run this on every controller change to catch regressions in the orchestration logic without burning real model calls.

Safety Guardrails

Destructive command guard

shell_run refuses any command matching patterns like rm -rf, format, dd if=, DROP TABLE, etc. The model is told to get user confirmation before running destructive commands.

Tool execution countdown

Before dispatching a tool call the agent waits N seconds (configured via safety.command_confirm_delay_seconds). During this window you can cancel:

TUI — status bar shows a progress bar; press Escape.

CLI — countdown printed inline; press Escape or Ctrl+C.

  ⏳ [████░] shell_run: executing in 1s  (Esc / Ctrl+C to cancel)

Set "command_confirm_delay_seconds": 0 to disable the countdown and execute immediately.

Hallucination guard

The agent monitors stop-responses for phrases like "I clicked", "I typed", "I ran" without corresponding tool calls. When detected, it injects a correction into the history and continues the loop, forcing the model to issue the actual tool calls.

Error retry enforcement

When a tool returns an error, non-zero exit code, or timeout, the agent appends an [AGENT POLICY] The tool call above FAILED message to the history. This prevents the model from acknowledging the error and moving on — it must diagnose and retry the same step.

Platform Support

Platform	Screenshot	Click/Type	Hotkey	Windows	Launch	Find Element
WSL (WSLg)	pyautogui	pyautogui	pyautogui	PowerShell	PowerShell	PowerShell UIAutomation
Windows	pyautogui	pyautogui	pyautogui	PowerShell	PowerShell	uiautomation (optional)
macOS	screencapture	pyautogui	pyautogui	osascript	open -a	osascript
Linux X11	pyautogui	pyautogui/xdotool	pyautogui	wmctrl	subprocess	—
Linux Wayland	grim	ydotool	ydotool	swaymsg	subprocess	—

The correct backend is selected automatically at startup via platform_detect.detect(). No configuration is needed.

WSL note: The agent automatically detects WSL and instructs the model to append .exe to Windows programs and search /mnt/c when a binary is not on the PATH.

Configuration Reference

{
  "openwebui": {
    "base_url": "http://localhost:3000",          // OpenWebUI server URL (or Ollama: http://localhost:11434)
    "api_key": "sk-...",                          // API key (Settings → Account → API Keys); "" for Ollama
    "model": "llama3.1:70b",                     // Model ID — must match /api/models list
    "api_path": "/api/chat/completions",          // Completions path. Use "/v1/chat/completions" to bypass OpenWebUI and call Ollama directly
    "temperature": 0.2,                           // Sampling temperature (0–1)
    "max_tokens": 4096,                           // Max completion tokens per call
    "timeout_seconds": 120                        // Per-request timeout
  },
  "install_dependencies": false,          // True = run scripts/install-dependencies.* at startup
  "agent": {
    "max_iterations": 30,                 // Hard stop after N agentic loop iterations
    "confirm_destructive": true,         // Block shell commands matching destructive regex patterns
    "vision_screenshots": true,          // Send screenshots to vision-capable models
    "record_trace": true,                // Persist every event to logs/traces/<session>.jsonl
    "trace_dir": "logs/traces",          // Where the JSONL trajectory log lives
    "skills_enabled": false,             // CREATION gate. False (default) blocks skill_save; skill_list/skill_run/candidate-suggestion still work
    "suggest_skills": true,              // Offer top-K saved skills at task start
    "skills_path": "skills/skills.jsonl",  // Standalone-agent skill library; deliberately distinct from
                                            //   pi-extension/runtime/skills/skills.jsonl so each side has its own
    "planner": {                          // Pre-execution planning pass
      "enabled": true                     // One extra LLM call up front (uses the primary client)
    },
    "controller": {                       // Typed-plan + step-by-step executor (default ON)
      "enabled": true,
      "step_max_iterations": 8,           // Per-step iteration ceiling (separate from max_iterations)
      "step_max_retries": 2,
      "auto_resume": true,                // Resume completed step ids from logs/progress
      "replan_on_block": true,
      "critique_enabled": true,           // Extra LLM call to review the plan
      "preflight_enabled": true,          // Verify apps/files/URLs/tools/commands before acting
      "predicate_check_enabled": true,    // Verify typed post-conditions deterministically
      "visual_diff_enabled": true,        // Perceptual-hash diff to flag silent-no-op actions
      "watchdog_stall_threshold": 3       // 0 disables; flag step stuck after N identical signatures
    },
    "artifacts": {"dir": "logs/artifacts"},  // Stable-id observation store (append-only; not deduped)
    "progress":  {"dir": "logs/progress"},   // Per-task resume markers
    "memory": {                              // Per-app quirk database (separate from skills)
      "enabled": false,                      //   CREATION gate. False blocks memory_note + auto-recording;
                                              //   memory_get and planner hints still read whatever is on disk
      "dir": "memory"                        //   Distinct from pi-extension/runtime/memory/
    },
    "budget": {                              // Hard ceilings; 0 = no ceiling
      "max_tool_calls": 0,
      "max_chat_calls": 0,
      "max_total_tokens": 0,
      "max_seconds": 0
    },
    "bon": {                              // Best-of-N action sampling
      "enabled": true,                    // Samples n completions, picks best on uncertain steps
      "n": 3,                             // Number of candidates to sample
      "temperature": 0.7,                 // Sampling temperature for diverse candidates
      "trigger_on_recent_failure": true,
      "trigger_on_validator_disagreement": true
    },
    "screen_record": {                    // Rolling screen buffer
      "enabled": true,                    // Capture into a deque while running
      "fps": 5,                           // Frames per second
      "buffer_seconds": 5.0,              // Length of rolling window in seconds
      "max_width": 960,                   // Downscale before storing
      "out_dir": "screenshots/failures"   // GIFs are written here on tool failure
    }
  },
  "tools": {
    "shell_timeout_seconds": 30,          // Per-command shell timeout
    "screenshot_dir": "screenshots",      // Directory for saved screenshots
    "max_screenshot_width": 1280,         // Resize screenshots wider than this (px)
    "perception_cache_ttl_seconds": 0.5,  // Reuse the last screenshot for this long
    "allowed_shell": true,               // Enable shell_run tool
    "allowed_filesystem": true,          // Enable fs_read / fs_write / fs_list
    "allowed_desktop": true,             // Enable all desktop/* tools
    "allowed_browser": false             // Playwright browser_* tools
  },
  "browser": {                            // Settings for the Playwright backend
    "headless": false,                    // Run with a visible window
    "screenshot_dir": "screenshots/browser",
    "user_data_dir": "",                 // Non-empty path = persistent profile (keeps logins)
    "viewport": {"width": 1280, "height": 800}
  },
  "logging": {
    "level": "INFO",                     // Log level for file handler
    "file": "logs/agent.log",            // Log file path
    "max_bytes": 10485760,               // Rotate at 10 MB
    "backup_count": 3                    // Keep 3 rotated files
  },
  "tui": {
    "theme": "dark",                     // Textual theme
    "show_tool_calls": true,             // Show tool calls in conversation pane by default
    "show_token_counts": false,          // (reserved)
    "history_file": "logs/history.jsonl" // Ctrl+S saves here
  },
  "safety": {
    "command_confirm_delay_seconds": 5,  // Countdown before each tool call (0 = off)
    "dry_run": false,                    // True = state-changing tools return a stub
                                          //   {dry_run, would_execute} instead of running
    "allowed_apps": [],                  // Restrict GUI actions to these apps; empty = unrestricted
    "blocked_window_titles": [],         // Regex patterns; matching active window blocks GUI tools
    "fs_write_snapshot_dir": ""           // Non-empty path = back up files before fs_write overwrite
  }
}

Extending the Tool Set

1. Implement an async function in tools.py:

async def my_tool(param: str) -> dict:
    try:
        result = do_something(param)
        return {"success": True, "result": result}
    except Exception as e:
        return {"error": str(e)}

2. Register it in ToolRegistry._build():

self._register(
    {"type": "function", "function": {
        "name": "my_tool",
        "description": "Does something useful — the LLM reads this.",
        "parameters": {"type": "object",
                       "properties": {"param": {"type": "string"}},
                       "required": ["param"]},
    }},
    my_tool,
)

3. Gate it on a config key if needed (check self._tools_cfg.get("allowed_my_tool", True)).

Security Notes

Shell access — the destructive guard is not a sandbox. For untrusted tasks set "allowed_shell": false or run the agent in a container.
API key — restrict config.json permissions: chmod 600 config.json. The file is excluded from git via .gitignore.
Desktop control — the agent operates at OS level: it can click anything and type anywhere. Only run on machines and accounts where you accept this capability.
REST API — no authentication is enforced; the server binds to 0.0.0.0 by default (all interfaces). Set AUTOGUI_API_HOST=127.0.0.1 for loopback-only use, or AUTOGUI_DISABLE_API=1 to disable the background API entirely. See docs/REST_API.md for details.

License

MIT — use freely; attribution appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
.claude/skills/babysit-pr-copilot		.claude/skills/babysit-pr-copilot
.github/workflows		.github/workflows
backends		backends
docs		docs
pi-extension		pi-extension
prompts		prompts
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
agent.py		agent.py
api.py		api.py
app_memory.py		app_memory.py
artifacts.py		artifacts.py
browser_backend.py		browser_backend.py
budget.py		budget.py
client.py		client.py
config.json.example		config.json.example
controller.py		controller.py
dry_run.py		dry_run.py
failures.py		failures.py
install_runner.py		install_runner.py
main.py		main.py
planner.py		planner.py
platform_detect.py		platform_detect.py
predicates.py		predicates.py
preflight.py		preflight.py
progress.py		progress.py
prompt_loader.py		prompt_loader.py
pytest.ini		pytest.ini
replay.py		replay.py
requirements.txt		requirements.txt
screen_observer_client.py		screen_observer_client.py
screen_record.py		screen_record.py
skills.py		skills.py
som.py		som.py
subagent.py		subagent.py
tools.py		tools.py
trace.py		trace.py
tui.py		tui.py
visual_diff.py		visual_diff.py
wait_for.py		wait_for.py
watchdog.py		watchdog.py

Folders and files

Latest commit

History

Repository files navigation

AutoGUI Desktop Agent

Features

Architecture

Agentic Loop

Setup

Prerequisites

Optional dependencies — install scripts

Planner (vs. dry-run — they're different things)

A11y-first clicking (most reliable)

Browser automation (Playwright)

Native input on Windows

Typing reliability

Best-of-N action sampling

Failure recording

Set-of-Mark grounding

Configuration

Bypassing OpenWebUI — connecting directly to Ollama

Verify connectivity

REST API

Install API dependencies

Start the server

Security and network bind address

Docker

Quick curl example

Usage

Pi extension

Skill library

App-memory library

Runtime directories

Interactive TUI

Key Bindings

Single-command mode

CLI flags

Startup Validation

Model Picker

Robustness, planning & verification (controller-only)

Test harness

Safety Guardrails

Destructive command guard

Tool execution countdown

Hallucination guard

Error retry enforcement

Platform Support

Configuration Reference

Extending the Tool Set

Security Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages