Scope: Windows 11 + WSL2 Ubuntu 22.04 / 24.04 only. See docs/WSL_SETUP.md.
Pixel-level browser automation MCP server. Gives any MCP-speaking agent (hermes-agent, Claude Code, Codex, …) 21 tools to drive a real Chrome browser in an Xvfb display — screenshots as vision input, OS-level mouse/keyboard as output. No CDP. No navigator.webdriver. No DOM shortcuts.
What the GIF shows — an agent opens Chrome, focuses the Google search bar, types
snp500, presses Enter, and Google returns a full SERP with the live S&P 500 index card. The same flow routinely trips "unusual traffic" or a captcha for Playwright-driven automation. This stack doesn't get flagged because the browser is stock Chrome driven by stock X11 input — there is no automation fingerprint to detect.
| Playwright / CDP | hermes-computer-use | |
|---|---|---|
navigator.webdriver |
true (detectable) |
undefined |
| CDP endpoint open | yes | no |
| DOM access | direct (fast, brittle to markup changes) | screenshot only (slower, resilient to UI rewrites) |
| Anti-bot footprint | large, constantly patched | near-zero: stock Chrome + stock X input |
| Best for | flows on sites you own | agents operating unfamiliar sites like a human |
If your automation has to walk a signup funnel on a site guarded by Cloudflare, Kasada, reCAPTCHA, or DataDome, this stack usually passes where Playwright gets stopped.
Evidence: docs/assets/demo-sannysoft.png — bot.sannysoft.com fingerprint panel with WebDriver, Chrome runtime, Permissions, Plugins, Languages, and PHANTOM all passed.
agent ── stdio MCP ──▶ hermes_computer_use.server ── subprocess ──▶ xdotool / scrot
│
▼
Xvfb :99
│
┌──────────────────┼──────────────────┐
▼ ▼
x11vnc :5900 websockify + noVNC :6080
(native VNC clients) (browser viewer)
Longer version: docs/ARCHITECTURE.md.
Prerequisites: Windows 11, WSL2 with Ubuntu 22.04/24.04, systemd enabled. Full walkthrough: docs/WSL_SETUP.md.
Everything below runs inside the WSL shell.
pip install "hermes-computer-use[novnc]"You still need system packages (Xvfb, Chrome, xdotool…) and systemd units — see source install steps 1 & 4.
git clone https://github.com/Noah3521/hermes-computer-use.git ~/hermes-computer-use
cd ~/hermes-computer-use
bash scripts/setup.sh # 1. apt + Chrome + uinput (sudo)
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[novnc]" # 2. Python package
bash scripts/install-novnc.sh # 3. (optional) web viewer
mkdir -p ~/.config/systemd/user # 4. persistent services
cp systemd/*.example ~/.config/systemd/user/
# edit the paths inside to match your clone, then:
sudo loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now computer-use.service novnc.serviceSmoke test: python examples/smoke_test.py.
Copy the relevant snippet from config/hermes.yaml.example into your agent's MCP server config. Works with hermes-agent, Claude Code, Codex, mcp-inspector, or any stdio MCP client.
If your agent has shell + filesystem tools, you can skip the manual install entirely: paste the prompt in docs/LLM_SETUP_PROMPT.md and it will clone, install, wire up systemd, run the smoke test, and report back. Available in English, 日本語, 中文, 한국어.
| Category | Tools |
|---|---|
| Status | screen_info, cursor_position |
| Capture | screenshot |
| Pointer | move, left_click, right_click, double_click, middle_click, drag, scroll |
| Keyboard | type_text, press_key, hold_key, clear_field, select_all, copy, paste, cut, undo, redo, clipboard_set, clipboard_get |
| Timing | wait |
| Browser | open_url, new_tab, close_tab, back, forward, reload |
| Escape hatch | run_shell |
Optional DOM fast-path (CU_ENABLE_CDP=1) |
dom_click, dom_type, dom_query, dom_exists, dom_wait, dom_eval, network_capture, console_messages |
press_key accepts case-insensitive names and aliases — Backspace, backspace, BackSpace all work; cmd+a, command-a, ctrl+a all resolve; meta/win/windows/cmd map to Super.
For DOM-heavy pages where vision grounding is slow or fragile (SPA dashboards, deep forms), you can opt into CSS-selector-based clicks / typing / queries. Trade-off: Chrome exposes a DevTools port and navigator.webdriver flips to true for the session, which defeats the anti-bot posture on sites that fingerprint Chrome. Off by default.
CU_ENABLE_CDP=1 bash scripts/display.sh restart
pip install "hermes-computer-use[dom]" # adds websocket-client
# Run the MCP with CU_ENABLE_CDP=1 in its env too (hermes config etc.)See docs/ARCHITECTURE.md#dom-fast-path for when to use which.
Try any of the prompts in examples/demo_prompts.md. The simplest and most illustrative:
"Use computer_use to open Google, search for
snp500, and tell me the current S&P 500 index price from the page."
Open http://localhost:6080/vnc.html in a browser while the agent runs — watching the cursor arc through the search bar is surprisingly compelling.
| Var | Default | Meaning |
|---|---|---|
CU_DISPLAY |
99 |
X display number |
CU_WIDTH / CU_HEIGHT |
1440 / 900 |
Virtual screen size |
CU_VNC_PORT |
5900 |
x11vnc listen port |
CU_STATE_DIR |
/tmp/hermes-computer-use |
Logs, PID files |
CU_PROFILE_DIR |
$CU_STATE_DIR/chrome-profile |
Persistent Chrome profile |
CU_START_URL |
about:blank |
First URL Chrome opens |
CU_INPUT |
xdotool |
Set to ydotool for /dev/uinput input |
CU_KEY_DELAY_MS |
25 |
Inter-keystroke delay |
CU_MOVE_STEPS |
18 |
Cursor interpolation steps |
- WSL_SETUP.md — Windows-side setup, systemd, linger
- ARCHITECTURE.md — internals + design rationale
- CAPTCHA.md — what passive / behavioural / visual challenges this approach can and cannot handle
- TROUBLESHOOTING.md — common failure modes with fixes
- FAQ.md — Playwright comparison, anti-bot honesty, parallel runs, profile safety
- SECURITY.md — threat model and hardening checklist
This is an LLM with hands. Read SECURITY.md. Baseline:
- Run in an isolated WSL distro, not your daily driver.
- Strip
run_shellif the agent doesn't need shell access. - Don't persist real credentials in
CU_PROFILE_DIR.
See CONTRIBUTING.md. The guiding thesis is "emit no abnormal signals by default" > "emit clever evasions" — but additive hybrid paths (e.g. opt-in DOM / CDP fast-clicks that users turn on per-site) are welcome when they do not flip the default posture.
MIT. See LICENSE.
- anthropic-quickstarts/computer-use-demo — the reference loop.
- x11vnc + noVNC — observer pipeline.
- Model Context Protocol — the interface.
