Create automated evaluator #98

michaelhhogue · 2023-12-09T18:41:17Z

This PR introduces a new program called evaluate.py.

The purpose of evaluate.py is to automate testing soc on a set of common test cases.

How to use:

With the venv sourced and the project properly installed, simply run

python3 evaluate.py

Once the evaluation begins, operate is automatically called for each defined test case. Upon completion of each test case, evaluate will use GPT-4v to confirm if the objective was successfully reached for that test case based on the summary screenshot.

For each test case, the console will print out either [PASSED] or [FAILED] based on GPT-4v's evaluation.
A justification is also given from GPT-4v on why it passed or failed the test case.

Demo

Here is a demo of evaluate testing two objectives:

Go to Github.com
Go to Youtube.com and play a video

simplescreenrecorder-2023-12-09_12_57_10_AdobeExpress.mp4

--prompt

To make this work properly, a new command line argument was added to operate: --prompt <prompt> which will inject the objective directly and begin executing tasks right away.

Note

The test cases can be changed at the top of evaluate.py in TEST_CASES. Please change these to whatever you want.
This is a dictionary where the key is the prompt itself and the value is the guideline of success for that prompt.
The test cases I have in there currently were just for me to test the program faster.

Future improvements

Add option at the end of each evaluation or all evaluations to print the output of operate. Currently the output is silenced so that it doesn't clear the evaluation results from the terminal.
Load test cases/guidelines from json file?
Add --manual mode which skips GPT-4v's evaluation
Play sound or bring terminal into focus when evaluation is complete

If you have any questions, let me know!

joshbickett · 2023-12-10T17:59:22Z

@michaelhhogue thanks for the PR. I'll try to look at it by tomorrow morning to provide input.

joshbickett · 2023-12-11T23:37:06Z

@michaelhhogue we've got some other priorities that've escalated at othersideai so it will be a little before I can review these PRs. Thanks for your patience.

michaelhhogue · 2023-12-12T13:37:01Z

@joshbickett No problem!

joshbickett · 2023-12-13T02:11:31Z

@michaelhhogue testing on this now. I am seeing the following error when trying to run standard operate mode.

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
Traceback (most recent call last):
  ...
  operate/main.py", line 249, in main
    objective = prompt(style=style)
TypeError: 'NoneType' object is not callable

…ported `prompt`

joshbickett · 2023-12-13T02:16:21Z

It appears this error, related to redefining something in the above scope called prompt

TypeError: 'NoneType' object is not callable

michaelhhogue · 2023-12-13T02:18:54Z

@joshbickett Ah whoops! Thanks for catching that oversight

joshbickett · 2023-12-13T02:23:30Z

@michaelhhogue no problem! Fixed now, still trying out evaluate.py.

joshbickett · 2023-12-13T02:30:15Z

@michaelhhogue worked great. This is a great value add PR to the project!

michaelhhogue · 2023-12-13T02:31:56Z

@joshbickett Glad to hear! It's not perfect, but it should be a good starting point to build off of in evaluating soc.

Daisuke134 · 2023-12-13T02:34:47Z

Sorry. i was trying to test this and gave me this error.
operate: error: unrecognized arguments: --prompt "Go to Github.com"

Seems like operate did not recognize prompt.

I did operate --help and gave me this.
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Specify the model to use
--voice Use voice input mode
-accurate Activate Reflective Mouse Click Mode

Could you tell me where the problem is?

joshbickett · 2023-12-13T02:36:29Z

@Daisuke134 thanks for the input. To confirm, you just ran python3 evaluate.py and that's what failed?

Daisuke134 · 2023-12-13T02:37:22Z

yes.

joshbickett · 2023-12-13T02:38:22Z

Oh interesting. Could you try running pip install . and then python3 evaluate.py and see if that fixes it?

Daisuke134 · 2023-12-13T02:38:56Z

Sorry!! I was not doing pip install . .. It worked now🙇‍♀️

joshbickett · 2023-12-13T02:40:34Z

Ok, great!

michaelhhogue · 2023-12-13T02:48:50Z

@joshbickett Just wanted to throw in real quick, the test cases I included are sort of just place holders so I could test it faster. It might be best to push a commit changing these to whatever you had in mind.

joshbickett · 2023-12-13T02:48:52Z

@michaelhhogue if you have time to create a PR to add some instruction on testing to the contributing.md file that'd be great!

joshbickett · 2023-12-13T02:49:34Z

@joshbickett Just wanted to throw in real quick, the test cases I included are sort of just place holders so I could test it faster. It might be best to push a commit changing these to whatever you had in mind.

Ok sounds good. I'll update the test cases with slight changes, but they are a good start

michaelhhogue · 2023-12-13T02:51:46Z

@michaelhhogue if you have time to create a PR to add some instruction on testing to the contributing.md file that'd be great!

Yeah I'll put one together real quick.

Daisuke134 · 2023-12-13T03:15:38Z

Question: I did "python3 evaluate.py" and but it gave me "Couldn't open the summary screenshot" error, but when I changed the desktop background from my dog picture to a simple blue background, it worked.

Also, If my Google Chrome tabs are open, it struggles to open a new tab in the existing window. Are there any PR that adress these issues?

michaelhhogue · 2023-12-13T03:28:59Z

@Daisuke134 It prints that message when operate never successfully finished any objective and saved a summary screenshot. Has operate reached 'DONE' on any objective yet? There should be a file named "summary_screenshot.png" in the screenshots folder if it has.

Edit: Probably would be useful to open a PR giving a more descriptive message for when there isn't a summary screenshot.

Daisuke134 · 2023-12-13T03:50:16Z

@michaelhhogue thank you. I changed my desktop background and the whole thing worked!

For the error, it created both screenshot.png and screenshot_with_grid.png, but probably it could not figure out how to open github.

I was just wondering if there were anyone who was working on cases when Chome window is already opened. I am trying out different VISION_PROMPT to fix this problem.

michaelhhogue · 2023-12-13T03:59:00Z

@Daisuke134 It's hit-or-miss when I have chrome already open. Sometimes it just goes to the url bar and types in the next site, sometimes it re-launches chrome. In Linux, it'll open a new chrome window when you search for it again. Not sure how this behaves on other OSes.

michaelhhogue added 13 commits December 8, 2023 14:04

Add direct prompt mode

0528644

Add evaluator.py

a16fce9

Silence operator stdout

ffbffb6

Rename to evaluate

ff7f021

Use gpt-4v to evalue summary screenshot

c9379e1

Change test cases

ddbbba0

Add summary message

138012a

Add evaluation justification

8cbd372

Merge branch 'main' into evaluator

32fe0c4

Change default test cases

4be8acd

Add comment to TEST_CASES

33f8e91

Merge commit '9459e9f395dc59cf6536244ccf47be26d849cbc2' into evaluator

088a444

Remove evaluator contribution idea

2341c23

joshbickett added 2 commits December 12, 2023 18:13

Change name of paramater to terminal_prompt to avoid override of im…

9dabcb5

…ported `prompt`

Remove extra function string

979840c

joshbickett merged commit 82829cf into OthersideAI:main Dec 13, 2023

michaelhhogue deleted the evaluator branch February 14, 2024 23:21

Create automated evaluator #98

Create automated evaluator #98

Uh oh!

Conversation

michaelhhogue commented Dec 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use:

Demo

--prompt

Note

Future improvements

If you have any questions, let me know!

Uh oh!

joshbickett commented Dec 10, 2023

Uh oh!

joshbickett commented Dec 11, 2023

Uh oh!

michaelhhogue commented Dec 12, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023

Uh oh!

Daisuke134 commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Daisuke134 commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Daisuke134 commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

joshbickett commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023

Uh oh!

Daisuke134 commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Daisuke134 commented Dec 13, 2023

Uh oh!

michaelhhogue commented Dec 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelhhogue commented Dec 9, 2023 •

edited

Loading

joshbickett commented Dec 13, 2023 •

edited

Loading

joshbickett commented Dec 13, 2023 •

edited

Loading

michaelhhogue commented Dec 13, 2023 •

edited

Loading