Added -accurate, reflective mouse click mode #57

klxu03 · 2023-12-02T06:35:09Z

Implemented -accurate, reflective mouse click mode.

If you add the -accurate tag when running operate, it will enable accurate mode. Now, whenever the model tries to click on anything, it will send another additional request to GPT giving it a chance to adjust its initial percentage guesses.

I first extract out the initial guess of where the model tries to click, and then I take a screenshot of a smaller 200 x 200 pixel rectangle around the initial guess. I then upsample the image by doubling the dimensions so it's bigger to GPT (all done in capture_mini_screenshot_with_cursor) and then ask the model to refine its guess, giving it the previous image/message as context as well as the previous X Y coordinate guess it attempted and give it a chance to add and subtract minute percentages (accurate_mode_double_check).

Locally, this configuration has significantly improved the accuracy of clicking on my desktop configuration (two monitors). Currently I've only implemented it for Linux, but if this approach is liked it can easily be adapted to other OS.

Idea could be further improved if the add_grid_to_image is further improved such that in the accurate_mode adding grid case to the mini screenshot, if instead of (25%, 25%) in the first intersection, instead it was (-3%, -5%) for example so it was just the relative percentage change. This will probably allow the model to have an easier time adding and subtracting the proper amount in the refined click.

Also I almost arbitrarily chose 200 x 200 as the rectangle size. I just noticed that on my desktop, the model would often be wrong by more than 100 pixels away, but less than 200 pixels so I just chose that as the size.

I would be happy to improve my code, or explain anything as wanted!

PS: I also added poetry support. But I can delete this and just add all of those files to .gitignore

… case needed again later

…cond query

joshbickett · 2023-12-02T15:03:48Z

@klxu03 thank you for this PR. It looks promising. I'll let you know if I have any questions!

michaelhhogue · 2023-12-02T18:02:31Z

@klxu03 Using mss for screenshots on Linux appears to be a much better solution, especially for just getting the active monitor.

klxu03 · 2023-12-02T18:06:39Z

@klxu03 Using mss for screenshots on Linux appears to be a much better solution, especially for just getting the active monitor.

Yeah earlier I went and just did mss, but then I noticed the clicking percentage system was configured globally so they didn't align. But if you think this is worth exploring, it's probably not bad to do a conversion (like scaling GPTs output divide by two and just adding 50% to the percentage)

michaelhhogue · 2023-12-02T18:13:28Z

@klxu03 Ah okay I see the removal of mss now. mss is probably the best multi-platform screenshot solution and the easiest way to just get the active monitor. So looking into the conversion in order to adopt mss could definitely help!

klxu03 · 2023-12-02T22:34:46Z

@klxu03 Ah okay I see the removal of mss now. mss is probably the best multi-platform screenshot solution and the easiest way to just get the active monitor. So looking into the conversion in order to adopt mss could definitely help!

Yeah makes sense, I'll probably try it another time (in a diff PR). BTW, I didn't do the most prompt engineering for the -accurate, so definitely feel free to change it.

Curious, is my code readable enough so you can follow the logic of what is happening? Are there any glaring architectural/design choices that you guys didn't like in this PR?

joshbickett · 2023-12-03T15:56:32Z

@klxu03 Just reviewing this PR now.. project grew more than expected and it has been busy.

Code looks good at a high level but my pip install . was breaking. Maybe something to do with the new pyproject.toml. I honestly don't know poetry well. We probably should be compatible, but main concern is that this same thing would break for other users.

Here's a ChatGPT thread about it: https://chat.openai.com/share/994b3d20-5bc4-4954-8abe-f53ddabc90ca

I deleted the pyproject.toml and it runs for me now so I'll test the -accuracy mode. One thought, maybe we could move the poetry support to another PR and keep this one just accuracy. Anyway, I'll have additional input soon

joshbickett · 2023-12-03T16:12:10Z

I like what I see so far.

The 200x200 may be a little small of a mini screenshot. It appears to me GPT-4v can roughly guess which of 4 quadrant of the screen to move into so it may make sense to make the mini_screenshot.jpg the 1 of 4 quadrants that is correct.

We could imagine a system where we loop over this function breaking the screenshot into ever smaller quadrant until we have just the right button to click.. but maybe that's for another PR.

Anyway, going to keep reviewing the PR and will share more thoughts

klxu03 · 2023-12-03T16:20:12Z

I like what I see so far.

The 200x200 may be a little small of a mini screenshot. It appears to me GPT-4v can roughly guess which of 4 quadrant of the screen to move into so it may make sense to make the mini_screenshot.jpg the 1 of 4 quadrants that is correct.

We could imagine a system where we loop over this function breaking the screenshot into ever smaller quadrant until we have just the right button to click.. but maybe that's for another PR.

Anyway, going to keep reviewing the PR and will share more thoughts

I love that idea! Like a sniper slowly scoping in. The idea was like scoping into like 400 x 400, adjust, then another 200 x 200, adjust, and then another 100 x 100? I feel like we could adjust the -accurate to also take in a number, like the number of scopes of precision you'd like. -accurate = 3, means 3 layers of scoping.

I coded this mini screenshot system in a way such that the scoping amounts can all be variables, it's only dependent on ACCURATE_PIXEL_COUNT, which can easily just be a param passed in.

Also yeah, I will just delete poetry. 100% not important

joshbickett · 2023-12-03T16:21:26Z

Ok great. If you can push up your poetry changes I think this is ready to merge into the main project.

I think this is a good architecture. It doesn't always perform well for me, but I think we can iterate it to improve performance.

-accurate = 3, means 3 layers of scoping sounds like a good approach. We can do this with a later PR.

klxu03 · 2023-12-03T16:29:00Z

Ok great. If you can push up your poetry changes I think this is ready to merge into the main project.

I think this is a good architecture. It doesn't always perform well for me, but I think we can iterate it to improve performance.

-accurate = 3, means 3 layers of scoping sounds like a good approach. We can do this with a later PR.

Awesome, sounds good! I just cleaned up the repo. Going to do a quick round of testing to make sure it all works

Update: it works

joshbickett · 2023-12-03T16:35:41Z

Ok, looks good! I'm going to merge it. Can you create a new PR for one thing I noticed?

Can you add more "app prints" so it shows a log of click logic? Something like this below. Does that make sense?

[Self-Operating Computer] [Act] CLICK
[Self-Operating Computer] [Act] CLICK REFLECTION

klxu03 · 2023-12-03T16:36:26Z

Ok, looks good! I'm going to merge it. Can you create a new PR for one thing I noticed?

Can you add more "app prints" so it shows a log of click logic? Something like this below. Does that make sense?
[Self-Operating Computer] [Act] CLICK
[Self-Operating Computer] [Act] CLICK REFLECTION

Yup! This makes sense

joshbickett · 2023-12-03T16:36:39Z

Anyway, I think you get the vision. The ideas we've discussed are all great. If you want to iterate on what you've built so far that'd be great!!

klxu03 · 2023-12-03T16:37:22Z

Anyway, I think you get the vision. The ideas we've discussed are all great. If you want to iterate on what you've built so far that'd be great!!

of course! happy to contribute and improve :)

joshbickett · 2023-12-03T16:40:43Z

Also the quick start section in README.MD could you create a "additional features" section or something and add this -accurate flag detail?

klxu03 · 2023-12-03T16:41:51Z

Also the quick start section in README.MD could you create a "additional features" section or something and add this -accurate flag detail?

for sure! i just twitter DMed an additional thought of later full blown converting accuracy to a pure classification problem I had after this thread!

michaelhhogue · 2023-12-03T17:48:55Z

@klxu03 Just wanted to comment that I've tested accurate mode on Linux and it's working great. I'm noticing significant improvements already. However, I've noticed that it sometimes seems to prioritize looking at the mini-screenshot over the whole screen. So it sometimes gets "stuck" in the 200 x 200 area of the screen where the previous guess was.

Tomorrow I'm going to look into refining the accurate mode vision prompt a bit to reduce how often it gets stuck in the 200 x 200 box.

Great work on this!

klxu03 and others added 10 commits December 1, 2023 21:02

add poetry support

378588e

screenshot active monitor

495584d

intermediary work saving how to get current mouse pointer position in…

e7dbb26

… case needed again later

working getting mini_screenshot, but model is updating weirdly

b76407e

properly more accurately refined the clicked position guess with a se…

f4c1378

…cond query

remove my print statements

b787717

Merge branch 'OthersideAI:main' into main

09d3491

deleted more comments

eb7f46e

added minor updates

4e980a3

removed mini screenshots

b4eec0d

michaelhhogue added the enhancement New feature or request label Dec 2, 2023

michaelhhogue mentioned this pull request Dec 2, 2023

Error parsing JSON: X get_image failed: error 8 (73, 0, 967) #64

Open

klxu03 added 4 commits December 2, 2023 17:53

added mini screenshot to macos, so now -accurate works on mac as well

0ebf8d4

adding new files which is ok

55df8c8

Merge branch 'main' of github.com:klxu03/self-operating-computer

1553446

Merge branch 'main' into reflective-mouse-click

b584669

klxu03 and others added 5 commits December 3, 2023 11:23

reset before poetry

24d7164

Merge branch 'main' into reflective-mouse-click

93a1b1d

removed poetry

cb6ebf9

remove screenshots dir

3b9263f

Merge branch 'OthersideAI:main' into main

90dd056

klxu03 added 3 commits December 3, 2023 11:26

Merge branch 'main' into reflective-mouse-click

586c4f5

DEBUG false

7e7f5e3

remove extraneous print

12997a3

joshbickett merged commit 51d9993 into OthersideAI:main Dec 3, 2023

This was referenced Dec 3, 2023

Reflective mouse click #51

Closed

Add downsampled images as context instead of screenshot.png? #56

Closed

This was referenced Dec 6, 2023

wrong coordinate #7

Closed

Click always off #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added -accurate, reflective mouse click mode #57

Added -accurate, reflective mouse click mode #57

klxu03 commented Dec 2, 2023

joshbickett commented Dec 2, 2023

michaelhhogue commented Dec 2, 2023

klxu03 commented Dec 2, 2023

michaelhhogue commented Dec 2, 2023

klxu03 commented Dec 2, 2023

joshbickett commented Dec 3, 2023

joshbickett commented Dec 3, 2023 •

edited

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023 •

edited

klxu03 commented Dec 3, 2023 •

edited

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023 •

edited

michaelhhogue commented Dec 3, 2023

Added -accurate, reflective mouse click mode #57

Added -accurate, reflective mouse click mode #57

Conversation

klxu03 commented Dec 2, 2023

joshbickett commented Dec 2, 2023

michaelhhogue commented Dec 2, 2023

klxu03 commented Dec 2, 2023

michaelhhogue commented Dec 2, 2023

klxu03 commented Dec 2, 2023

joshbickett commented Dec 3, 2023

joshbickett commented Dec 3, 2023 • edited

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023 • edited

klxu03 commented Dec 3, 2023 • edited

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023

joshbickett commented Dec 3, 2023

klxu03 commented Dec 3, 2023 • edited

michaelhhogue commented Dec 3, 2023

joshbickett commented Dec 3, 2023 •

edited

joshbickett commented Dec 3, 2023 •

edited

klxu03 commented Dec 3, 2023 •

edited

klxu03 commented Dec 3, 2023 •

edited