Skip to content

feat: replace screenshot_size with screenshot_original_size to fix mouse coordinate mismatch#116

Merged
Jeomon merged 1 commit intoCursorTouch:mainfrom
JezaChen:feat/track-screenshot-original-size
Mar 19, 2026
Merged

feat: replace screenshot_size with screenshot_original_size to fix mouse coordinate mismatch#116
Jeomon merged 1 commit intoCursorTouch:mainfrom
JezaChen:feat/track-screenshot-original-size

Conversation

@JezaChen
Copy link
Copy Markdown
Collaborator

Problem

Some models — Claude Opus in particular — tend to rely on the Screenshot tool (rather than Snapshot) to decide where to click or move the mouse. The Screenshot tool returns a screenshot_size field in its response metadata representing the resolution of the image.

However, many LLM servers (e.g. Claude.ai) automatically compress or resize images before passing them to the model in order to reduce token usage. This means the image the model actually receives may be smaller than what screenshot_size indicates. When the model reads coordinates directly from the image and passes them to Click or Move without accounting for this discrepancy, the resulting positions are wrong — often off by a consistent scale factor — causing clicks to land in entirely unintended locations.

For example: the physical screen is 3840×2160, windows-mcp caps the screenshot to 1920×1080, and the LLM server further compresses it to 1024×576. A control appearing at (200, 200) in the received image is actually at (750, 750) on screen — but the model clicks (200, 200) always.

Root Cause

screenshot_size recorded the post-resize resolution on the server side (i.e. after windows-mcp's own downscaling cap). It said nothing about the size of the image the model actually received, and gave the model no guidance on how to reconcile the two. This made the field actively misleading.

Solution

  • Remove screenshot_size from DesktopState and the response metadata.
  • Introduce screenshot_original_size, captured immediately after the screenshot is taken, before any server-side downscaling. This represents the true screen coordinate space.
  • Attach an inline instruction to the screenshot_original_size metadata field, explaining to the model that:
    • The image it receives may have been further resized by the LLM server.
    • Before performing any mouse action (Click, Move, etc.), it must compare the actual received image dimensions against screenshot_original_size, compute the scale ratio, and apply it to convert image-space coordinates back to screen-space coordinates.
  • Update the Screenshot tool description (in both snapshot.py and manifest.json) to make this requirement explicit upfront.

Changes

  • src/windows_mcp/desktop/views.py — replace screenshot_size: Size | None with screenshot_original_size: Size | None
  • src/windows_mcp/desktop/service.py — capture screenshot_original_size before the resize step instead of after
  • src/windows_mcp/tools/_snapshot_helpers.py — update metadata output to emit screenshot_original_size with coordinate-scaling guidance
  • src/windows_mcp/tools/snapshot.py — update Screenshot tool description
  • manifest.json — sync Screenshot tool description
  • tests/test_snapshot_display_filter.py — update tests to use the new field name

…use coordinate mismatch

The previous screenshot_size recorded the post-resize dimensions, which
misled the LLM into using downscaled coordinates for mouse actions.
Now capture the pre-resize original size instead, and include coordinate
scaling guidance in both the response metadata and the Screenshot tool
description (snapshot.py + manifest.json).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 19, 2026 15:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses mouse coordinate mismatches caused by LLM-side image resizing by replacing the misleading screenshot_size metadata with screenshot_original_size, captured before any server-side downscaling, and by updating Screenshot guidance to instruct coordinate rescaling.

Changes:

  • Renames DesktopState.screenshot_size to screenshot_original_size and captures it before resizing in desktop state collection.
  • Updates snapshot response metadata text and Screenshot tool descriptions to explain how to scale image coordinates back to screen coordinates.
  • Updates affected tests to use the new field name.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/windows_mcp/desktop/views.py Renames the desktop state field to screenshot_original_size.
src/windows_mcp/desktop/service.py Captures original screenshot dimensions prior to any resizing step.
src/windows_mcp/tools/_snapshot_helpers.py Emits screenshot_original_size in the response text with coordinate scaling guidance.
src/windows_mcp/tools/snapshot.py Updates the Screenshot tool description to mention coordinate scaling.
manifest.json Syncs the Screenshot tool description update.
tests/test_snapshot_display_filter.py Updates tests to assert against screenshot_original_size.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +133 to +137
metadata_text += (
f"Screenshot Original Size: {desktop_state.screenshot_original_size.to_string()}"
" (the screenshot may be downscaled; multiply image coordinates by"
f" the ratio of original size to displayed size to get actual screen coordinates"
" for click, move and other mouse actions)\n"
Comment on lines 70 to 73
@mcp.tool(
name='Screenshot',
description="Captures a fast screenshot-first desktop snapshot with cursor position, desktop/window summaries, and an image. This path skips UI tree extraction for speed. Use Snapshot when you need interactive element ids, scrollable regions, or browser DOM extraction.",
description="Captures a fast screenshot-first desktop snapshot with cursor position, desktop/window summaries, and an image. This path skips UI tree extraction for speed. Use Snapshot when you need interactive element ids, scrollable regions, or browser DOM extraction. Note: the returned image may be downscaled for efficiency; when it is, multiply image coordinates by the ratio of original size to displayed size to get the actual screen coordinates for mouse actions (Click, Move, etc.).",
annotations=ToolAnnotations(
Comment on lines 79 to 82
{
"name": "Screenshot",
"description": "Captures a fast screenshot-first desktop snapshot with cursor position, active/open windows, and an image. Skips UI tree extraction for speed and should be the default first call when you mainly need visual context. Supports display=[0] or display=[0,1] to limit capture to specific screens."
"description": "Captures a fast screenshot-first desktop snapshot with cursor position, active/open windows, and an image. Skips UI tree extraction for speed and should be the default first call when you mainly need visual context. Supports display=[0] or display=[0,1] to limit capture to specific screens. Note: the returned image may be downscaled for efficiency; when it is, multiply image coordinates by the ratio of original size to displayed size to get the actual screen coordinates for mouse actions (Click, Move, etc.)."
},
@Jeomon Jeomon merged commit 23a0304 into CursorTouch:main Mar 19, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants