feat: replace screenshot_size with screenshot_original_size to fix mouse coordinate mismatch#116
Merged
Jeomon merged 1 commit intoCursorTouch:mainfrom Mar 19, 2026
Conversation
…use coordinate mismatch The previous screenshot_size recorded the post-resize dimensions, which misled the LLM into using downscaled coordinates for mouse actions. Now capture the pre-resize original size instead, and include coordinate scaling guidance in both the response metadata and the Screenshot tool description (snapshot.py + manifest.json). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses mouse coordinate mismatches caused by LLM-side image resizing by replacing the misleading screenshot_size metadata with screenshot_original_size, captured before any server-side downscaling, and by updating Screenshot guidance to instruct coordinate rescaling.
Changes:
- Renames
DesktopState.screenshot_sizetoscreenshot_original_sizeand captures it before resizing in desktop state collection. - Updates snapshot response metadata text and Screenshot tool descriptions to explain how to scale image coordinates back to screen coordinates.
- Updates affected tests to use the new field name.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/windows_mcp/desktop/views.py |
Renames the desktop state field to screenshot_original_size. |
src/windows_mcp/desktop/service.py |
Captures original screenshot dimensions prior to any resizing step. |
src/windows_mcp/tools/_snapshot_helpers.py |
Emits screenshot_original_size in the response text with coordinate scaling guidance. |
src/windows_mcp/tools/snapshot.py |
Updates the Screenshot tool description to mention coordinate scaling. |
manifest.json |
Syncs the Screenshot tool description update. |
tests/test_snapshot_display_filter.py |
Updates tests to assert against screenshot_original_size. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+133
to
+137
| metadata_text += ( | ||
| f"Screenshot Original Size: {desktop_state.screenshot_original_size.to_string()}" | ||
| " (the screenshot may be downscaled; multiply image coordinates by" | ||
| f" the ratio of original size to displayed size to get actual screen coordinates" | ||
| " for click, move and other mouse actions)\n" |
Comment on lines
70
to
73
| @mcp.tool( | ||
| name='Screenshot', | ||
| description="Captures a fast screenshot-first desktop snapshot with cursor position, desktop/window summaries, and an image. This path skips UI tree extraction for speed. Use Snapshot when you need interactive element ids, scrollable regions, or browser DOM extraction.", | ||
| description="Captures a fast screenshot-first desktop snapshot with cursor position, desktop/window summaries, and an image. This path skips UI tree extraction for speed. Use Snapshot when you need interactive element ids, scrollable regions, or browser DOM extraction. Note: the returned image may be downscaled for efficiency; when it is, multiply image coordinates by the ratio of original size to displayed size to get the actual screen coordinates for mouse actions (Click, Move, etc.).", | ||
| annotations=ToolAnnotations( |
Comment on lines
79
to
82
| { | ||
| "name": "Screenshot", | ||
| "description": "Captures a fast screenshot-first desktop snapshot with cursor position, active/open windows, and an image. Skips UI tree extraction for speed and should be the default first call when you mainly need visual context. Supports display=[0] or display=[0,1] to limit capture to specific screens." | ||
| "description": "Captures a fast screenshot-first desktop snapshot with cursor position, active/open windows, and an image. Skips UI tree extraction for speed and should be the default first call when you mainly need visual context. Supports display=[0] or display=[0,1] to limit capture to specific screens. Note: the returned image may be downscaled for efficiency; when it is, multiply image coordinates by the ratio of original size to displayed size to get the actual screen coordinates for mouse actions (Click, Move, etc.)." | ||
| }, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Some models — Claude Opus in particular — tend to rely on the Screenshot tool (rather than Snapshot) to decide where to click or move the mouse. The Screenshot tool returns a
screenshot_sizefield in its response metadata representing the resolution of the image.However, many LLM servers (e.g. Claude.ai) automatically compress or resize images before passing them to the model in order to reduce token usage. This means the image the model actually receives may be smaller than what
screenshot_sizeindicates. When the model reads coordinates directly from the image and passes them toClickorMovewithout accounting for this discrepancy, the resulting positions are wrong — often off by a consistent scale factor — causing clicks to land in entirely unintended locations.For example: the physical screen is 3840×2160, windows-mcp caps the screenshot to 1920×1080, and the LLM server further compresses it to 1024×576. A control appearing at
(200, 200)in the received image is actually at(750, 750)on screen — but the model clicks(200, 200)always.Root Cause
screenshot_sizerecorded the post-resize resolution on the server side (i.e. after windows-mcp's own downscaling cap). It said nothing about the size of the image the model actually received, and gave the model no guidance on how to reconcile the two. This made the field actively misleading.Solution
screenshot_sizefromDesktopStateand the response metadata.screenshot_original_size, captured immediately after the screenshot is taken, before any server-side downscaling. This represents the true screen coordinate space.screenshot_original_sizemetadata field, explaining to the model that:screenshot_original_size, compute the scale ratio, and apply it to convert image-space coordinates back to screen-space coordinates.snapshot.pyandmanifest.json) to make this requirement explicit upfront.Changes
src/windows_mcp/desktop/views.py— replacescreenshot_size: Size | Nonewithscreenshot_original_size: Size | Nonesrc/windows_mcp/desktop/service.py— capturescreenshot_original_sizebefore the resize step instead of aftersrc/windows_mcp/tools/_snapshot_helpers.py— update metadata output to emitscreenshot_original_sizewith coordinate-scaling guidancesrc/windows_mcp/tools/snapshot.py— update Screenshot tool descriptionmanifest.json— sync Screenshot tool descriptiontests/test_snapshot_display_filter.py— update tests to use the new field name