Skip to content

feat: Screenshot tool with DXCam backend reporting and UIAutomation hang fix#104

Merged
Jeomon merged 6 commits intoCursorTouch:mainfrom
yasuhirofujii-medley:feat/fast-snapshot-no-tree
Mar 16, 2026
Merged

feat: Screenshot tool with DXCam backend reporting and UIAutomation hang fix#104
Jeomon merged 6 commits intoCursorTouch:mainfrom
yasuhirofujii-medley:feat/fast-snapshot-no-tree

Conversation

@yasuhirofujii-medley
Copy link

Summary

This PR adds a dedicated Screenshot tool for fast screenshot-only capture, reports the capture backend (DXCam/Pillow) in the response, and skips expensive UIAutomation window enumeration in the Screenshot fast path.

These changes build on top of the use_ui_tree=False fast path introduced in PR #98.


Why this is needed

1. Screenshot tool — dedicated fast capture endpoint (65a9ed3)

Problem: The existing Snapshot tool, even with use_ui_tree=False, still carries the overhead of being a general-purpose tool. Callers who only need a screenshot have to specify multiple flags (use_vision=True, use_annotation=False, use_ui_tree=False). More importantly, there was no way to invoke a screenshot-only path with a simple, discoverable tool name.

Solution: Added a new Screenshot tool that is purpose-built for fast screenshot capture:

  • Fixed to use_vision=True, use_annotation=False, use_ui_tree=False
  • Accepts display parameter (list of display indices) for multi-monitor selection
  • Single-purpose tool with a clear name that agents can discover easily
  • DXCam (DirectX) hardware capture is used when display is specified (requires capture_rect)

Also added:

  • Desktop.parse_display_selection() for robust display parameter handling
  • Desktop.get_display_union_rect() for computing the capture region from display indices
  • Shared _capture_desktop_state() helper to deduplicate Snapshot/Screenshot implementation
  • WINDOWS_MCP_PROFILE_SNAPSHOT env var for per-stage timing instrumentation

2. Capture backend reporting (5484e46)

Problem: When debugging screenshot performance, there was no way to tell from the tool response whether DXCam (DirectX, ~10ms) or Pillow (GDI, ~100ms) was used for capture. This made it difficult to confirm that DXCam was actually being activated.

Solution: The get_screenshot() method now tracks the backend used (self._last_screenshot_backend), and the response includes a Screenshot Backend: dxcam or Screenshot Backend: pillow line. The DesktopState dataclass carries a screenshot_backend field.

3. Skip UIAutomation window enumeration for Screenshot tool (5b22d1b, 3d751df)

Problem: Desktop.get_state() unconditionally called get_controls_handles(), get_windows(), and get_active_window() — even when use_ui_tree=False (Screenshot tool). These are UIAutomation API calls that enumerate windows via COM/WM messages. When an application is launching and not responding to window messages (e.g., showing a splash screen), these calls hang for tens of seconds (observed: 47 seconds for a single screenshot).

This is the same class of problem that PR #98 addressed for tree capture, but the window enumeration calls were left in place because the Snapshot response includes window metadata. For the Screenshot tool, however, this metadata is not needed — the purpose is strictly to capture the screen image as fast as possible.

Solution: When use_ui_tree=False, get_state() now skips all three UIAutomation window enumeration calls and returns empty window lists. This eliminates the hang entirely for the Screenshot path.

The comment explaining this was initially written in Japanese, which caused an encoding corruption issue when uv fetched the package from GitHub — multi-byte characters were mangled, newlines were swallowed, and an if statement was merged into a comment line, producing an IndentationError on startup. The comment was rewritten in English to avoid this.


Changes

src/windows_mcp/__main__.py

  • Added Screenshot tool with display parameter
  • Extracted _capture_desktop_state() shared helper (used by both Snapshot and Screenshot)
  • Added _snapshot_profile_enabled() and _as_bool() helpers
  • Added _build_snapshot_response() to deduplicate response construction
  • Response includes Screenshot Backend: line when available

src/windows_mcp/desktop/service.py

  • get_state(): Skip get_controls_handles/get_windows/get_active_window when use_ui_tree=False
  • get_screenshot(): Track _last_screenshot_backend (dxcam/pillow)
  • Added parse_display_selection() for display parameter validation
  • Added get_display_union_rect() for computing display capture region
  • Added per-stage profiling when WINDOWS_MCP_PROFILE_SNAPSHOT=1

src/windows_mcp/desktop/views.py

  • Added screenshot_backend: str | None field to DesktopState

src/windows_mcp/tree/service.py

  • Added screen_box property (used as fallback root box when UI tree is skipped)

tests/test_snapshot_display_filter.py

  • Added tests for parse_display_selection()
  • Added tests for display-filtered screenshot dimensions
  • Added tests for use_ui_tree=False tree skip + use_dom validation

Behavior

Default behavior (no breaking changes)

  • Snapshot tool continues to work exactly as before
  • All existing parameters and defaults are preserved

New Screenshot tool

{
  "tool": "Screenshot",
  "display": [0]
}

Returns a fast screenshot with DXCam backend (when available), no UI tree, no window enumeration.

Performance impact

Scenario Before After
Screenshot during app launch (UIAutomation hang) ~50s <1s
Normal Screenshot with DXCam ~200ms ~200ms
Snapshot (use_ui_tree=True) unchanged unchanged

Testing

python -m pytest -q tests/test_snapshot_display_filter.py
# 11 passed

yasuhirofujii-medley added 4 commits March 13, 2026 09:25
get_screenshot() で使用されたバックエンド (dxcam/pillow) を追跡し、
DesktopState.screenshot_backend に格納。
レスポンステキストに 'Screenshot Backend: dxcam/pillow' 行を追加。

Control Node 側でこの情報をパースしてログに表示することで、
DirectX キャプチャが有効かどうかをリモートから確認可能にする。
use_ui_tree=False (Screenshot tool) の場合、get_controls_handles /
get_windows / get_active_window をスキップ。
これらの UIAutomation API はアプリ起動中にハングする可能性があり、
Screenshot が 47 秒以上ブロックされるケースがあった。
uv cache fetch corrupted multi-byte (Japanese) characters in comments,
causing newlines to be swallowed and merging the if-statement into
the comment line, resulting in IndentationError on startup.
@Jeomon
Copy link
Member

Jeomon commented Mar 15, 2026

Awesome work and sorry for the late reply..
can you please make __main__.py organized like this file specifically meant for the mcp server.
And what is your thought like usage of mss
https://github.com/BoboTiG/python-mss

@yasuhirofujii-medley
Copy link
Author

Thanks for the review and the suggestion!

Regarding python-mss:
I've looked into it — mss optimizes the GDI/BitBlt capture path so it's noticeably faster than Pillow's ImageGrab, but DXCam (DXGI Desktop Duplication) still wins significantly since it captures directly from the GPU framebuffer.

That said, mss could be a solid middle-ground fallback for environments where DXCam isn't available (e.g., RDP sessions, VMs without GPU passthrough). The backend selection architecture in this PR already supports pluggable backends via WINDOWS_MCP_SCREENSHOT_BACKEND — adding mss as a fourth option alongside auto, pillow, and dxcam would be straightforward. Happy to add that if you'd like.

Regarding __main__.py organization:
Agreed — it's grown quite large with all tool definitions living in one flat file. I'd prefer to handle that refactor in a separate follow-up PR to keep this one focused on the Screenshot tool / DXCam / UIAutomation hang fix changes. The plan would be to move tool definitions into a tools/ subpackage and keep __main__.py as a thin server entrypoint + CLI. Would that work for you?

@Jeomon
Copy link
Member

Jeomon commented Mar 15, 2026

okay cool

@yasuhirofujii-medley
Copy link
Author

Thanks! If everything looks good, feel free to merge whenever you're ready. 🙏

I'll open a follow-up PR for the __main__.py refactor (tool definitions → tools/ subpackage) once this one lands.

@Jeomon
Copy link
Member

Jeomon commented Mar 15, 2026

Could we consider moving the screenshot-related logic into a file called screenshot.py in the desktop folder? I'm not sure if creating a class named Screenshot at this stage makes sense, or if we should just place all screenshot-related functions in that file. What do you think? This way, the desktop service file would be much cleaner.

@Jeomon
Copy link
Member

Jeomon commented Mar 15, 2026

In your knowledge can we use the mss or DXcam for macos and linux for capturing the screenshot is there any alternatives

@yasuhirofujii-medley
Copy link
Author

Re: Moving screenshot logic into desktop/screenshot.py

Agreed — I'll extract the screenshot-related methods into desktop/screenshot.py as part of this PR. I think standalone functions (not a class) is the right call for now. The extracted pieces would be:

  • get_screenshot_backend() — env var parsing
  • capture_with_dxcam() / capture_with_pillow() — backend implementations
  • resolve_dxcam_region() — monitor mapping for DXCam
  • capture() — main entry point with auto-fallback logic

This keeps service.py focused on desktop state orchestration and makes the screenshot layer independently testable.

Re: Cross-platform screenshot capture

Great question. Here's the landscape:

Backend Windows macOS Linux (X11) Linux (Wayland)
DXCam ✅ fastest (~10ms)
mss ✅ fast (~20ms) ✅ (Quartz) ✅ (X11)
Pillow ImageGrab ✅ (~50ms)
PyAutoGUI
XDG Desktop Portal

Since Windows-MCP is Windows-focused, I'd suggest keeping DXCam → Pillow as the default chain for now, but designing the screenshot module with a pluggable backend interface so that adding mss (or platform-specific backends) later becomes a drop-in change.

Something like:

# desktop/screenshot.py

def capture(region=None, backend="auto") -> Image:
    """Platform-aware screenshot capture with backend fallback."""
    ...

This way, if the project expands to cross-platform in the future, we just register new backends without touching the capture API. I'll structure the extraction with this in mind.

@yasuhirofujii-medley
Copy link
Author

One more thought on the cross-platform screenshot architecture — rather than choosing a single library, what about an auto-fallback chain per OS with an env var override?

WINDOWS_MCP_SCREENSHOT_BACKEND=auto (default)

auto → OS detection → fastest-first fallback:

  Windows:  DXCam → mss → Pillow
  macOS:    mss → Pillow
  Linux:    mss

Manual override:
  WINDOWS_MCP_SCREENSHOT_BACKEND=dxcam|mss|pillow

This way:

  • Users get the fastest available backend automatically with zero config
  • If a backend causes issues (e.g., DXCam on certain GPU drivers), they can pin it via one env var
  • Adding a new backend is just inserting it into the fallback chain — no API changes needed
  • mss becomes the cross-platform baseline, DXCam stays as the Windows fast path

I'll implement this as part of the desktop/screenshot.py extraction. The module would expose a single capture(region, backend="auto") entry point that handles the chain internally.

Does this direction work for you?

@Jeomon
Copy link
Member

Jeomon commented Mar 16, 2026

yes please

@Jeomon
Copy link
Member

Jeomon commented Mar 16, 2026

Bro the macos and linux thing i just asked for clarification purpose only this mcp is specifically for windows

@yasuhirofujii-medley
Copy link
Author

Good point my bad, I over-interpreted the question and went ahead with macOS/Linux fallback chains you didn't ask for.

Just pushed a fix: removed the platform.system() branching and simplified _auto_backend_chain() to a flat Windows-only chain (dxcam -> mss -> pillow). The mss backend is still useful on Windows as a middle-ground fallback (e.g., RDP sessions where DXCam isn't available), so it stays in the chain just no cross-platform logic anymore.

Changes in latest commit:

  • Removed platform import and OS detection from screenshot.py
  • _auto_backend_chain() now always returns ["dxcam", "mss", "pillow"]
  • Updated tests accordingly 18 passed

@Jeomon Jeomon merged commit 6573185 into CursorTouch:main Mar 16, 2026
@Jeomon
Copy link
Member

Jeomon commented Mar 16, 2026

Thanks, friend, I will merge the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants