Skip to content

Conversation

@michaelhhogue
Copy link
Collaborator

@michaelhhogue michaelhhogue commented Dec 9, 2023

This PR introduces a new program called evaluate.py.

The purpose of evaluate.py is to automate testing soc on a set of common test cases.

How to use:

With the venv sourced and the project properly installed, simply run

python3 evaluate.py

Once the evaluation begins, operate is automatically called for each defined test case. Upon completion of each test case, evaluate will use GPT-4v to confirm if the objective was successfully reached for that test case based on the summary screenshot.

For each test case, the console will print out either [PASSED] or [FAILED] based on GPT-4v's evaluation.
A justification is also given from GPT-4v on why it passed or failed the test case.

Demo

Here is a demo of evaluate testing two objectives:

  • Go to Github.com
  • Go to Youtube.com and play a video
simplescreenrecorder-2023-12-09_12_57_10_AdobeExpress.mp4

--prompt

To make this work properly, a new command line argument was added to operate: --prompt <prompt> which will inject the objective directly and begin executing tasks right away.

Note

The test cases can be changed at the top of evaluate.py in TEST_CASES. Please change these to whatever you want.
This is a dictionary where the key is the prompt itself and the value is the guideline of success for that prompt.
The test cases I have in there currently were just for me to test the program faster.

Future improvements

  • Add option at the end of each evaluation or all evaluations to print the output of operate. Currently the output is silenced so that it doesn't clear the evaluation results from the terminal.
  • Load test cases/guidelines from json file?
  • Add --manual mode which skips GPT-4v's evaluation
  • Play sound or bring terminal into focus when evaluation is complete

If you have any questions, let me know!

@joshbickett
Copy link
Contributor

@michaelhhogue thanks for the PR. I'll try to look at it by tomorrow morning to provide input.

@joshbickett
Copy link
Contributor

@michaelhhogue we've got some other priorities that've escalated at othersideai so it will be a little before I can review these PRs. Thanks for your patience.

@michaelhhogue
Copy link
Collaborator Author

@joshbickett No problem!

@joshbickett
Copy link
Contributor

@michaelhhogue testing on this now. I am seeing the following error when trying to run standard operate mode.

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
Traceback (most recent call last):
  ...
  operate/main.py", line 249, in main
    objective = prompt(style=style)
TypeError: 'NoneType' object is not callable

@joshbickett
Copy link
Contributor

It appears this error, related to redefining something in the above scope called prompt

TypeError: 'NoneType' object is not callable

@michaelhhogue
Copy link
Collaborator Author

@joshbickett Ah whoops! Thanks for catching that oversight

@joshbickett
Copy link
Contributor

@michaelhhogue no problem! Fixed now, still trying out evaluate.py.

@joshbickett
Copy link
Contributor

@michaelhhogue worked great. This is a great value add PR to the project!

@joshbickett joshbickett merged commit 82829cf into OthersideAI:main Dec 13, 2023
@michaelhhogue
Copy link
Collaborator Author

@joshbickett Glad to hear! It's not perfect, but it should be a good starting point to build off of in evaluating soc.

@Daisuke134
Copy link
Contributor

Sorry. i was trying to test this and gave me this error.
operate: error: unrecognized arguments: --prompt "Go to Github.com"

Seems like operate did not recognize prompt.

I did operate --help and gave me this.
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Specify the model to use
--voice Use voice input mode
-accurate Activate Reflective Mouse Click Mode

Could you tell me where the problem is?

@joshbickett
Copy link
Contributor

joshbickett commented Dec 13, 2023

@Daisuke134 thanks for the input. To confirm, you just ran python3 evaluate.py and that's what failed?

@Daisuke134
Copy link
Contributor

yes.

@joshbickett
Copy link
Contributor

joshbickett commented Dec 13, 2023

Oh interesting. Could you try running pip install . and then python3 evaluate.py and see if that fixes it?

@Daisuke134
Copy link
Contributor

Sorry!! I was not doing pip install . .. It worked now🙇‍♀️

@joshbickett
Copy link
Contributor

Ok, great!

@michaelhhogue
Copy link
Collaborator Author

@joshbickett Just wanted to throw in real quick, the test cases I included are sort of just place holders so I could test it faster. It might be best to push a commit changing these to whatever you had in mind.

@joshbickett
Copy link
Contributor

@michaelhhogue if you have time to create a PR to add some instruction on testing to the contributing.md file that'd be great!

@joshbickett
Copy link
Contributor

@joshbickett Just wanted to throw in real quick, the test cases I included are sort of just place holders so I could test it faster. It might be best to push a commit changing these to whatever you had in mind.

Ok sounds good. I'll update the test cases with slight changes, but they are a good start

@michaelhhogue
Copy link
Collaborator Author

@michaelhhogue if you have time to create a PR to add some instruction on testing to the contributing.md file that'd be great!

Yeah I'll put one together real quick.

@Daisuke134
Copy link
Contributor

Question: I did "python3 evaluate.py" and but it gave me "Couldn't open the summary screenshot" error, but when I changed the desktop background from my dog picture to a simple blue background, it worked.

Also, If my Google Chrome tabs are open, it struggles to open a new tab in the existing window. Are there any PR that adress these issues?

@michaelhhogue
Copy link
Collaborator Author

michaelhhogue commented Dec 13, 2023

@Daisuke134 It prints that message when operate never successfully finished any objective and saved a summary screenshot. Has operate reached 'DONE' on any objective yet? There should be a file named "summary_screenshot.png" in the screenshots folder if it has.

Edit: Probably would be useful to open a PR giving a more descriptive message for when there isn't a summary screenshot.

@Daisuke134
Copy link
Contributor

@michaelhhogue thank you. I changed my desktop background and the whole thing worked!

For the error, it created both screenshot.png and screenshot_with_grid.png, but probably it could not figure out how to open github.

I was just wondering if there were anyone who was working on cases when Chome window is already opened. I am trying out different VISION_PROMPT to fix this problem.

@michaelhhogue
Copy link
Collaborator Author

@Daisuke134 It's hit-or-miss when I have chrome already open. Sometimes it just goes to the url bar and types in the next site, sometimes it re-launches chrome. In Linux, it'll open a new chrome window when you search for it again. Not sure how this behaves on other OSes.

@michaelhhogue michaelhhogue deleted the evaluator branch February 14, 2024 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants