-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Create automated evaluator #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@michaelhhogue thanks for the PR. I'll try to look at it by tomorrow morning to provide input. |
|
@michaelhhogue we've got some other priorities that've escalated at othersideai so it will be a little before I can review these PRs. Thanks for your patience. |
|
@joshbickett No problem! |
|
@michaelhhogue testing on this now. I am seeing the following error when trying to run standard |
|
It appears this error, related to redefining something in the above scope called |
|
@joshbickett Ah whoops! Thanks for catching that oversight |
|
@michaelhhogue no problem! Fixed now, still trying out |
|
@michaelhhogue worked great. This is a great value add PR to the project! |
|
@joshbickett Glad to hear! It's not perfect, but it should be a good starting point to build off of in evaluating soc. |
|
Sorry. i was trying to test this and gave me this error. Seems like operate did not recognize prompt. I did operate --help and gave me this. Could you tell me where the problem is? |
|
@Daisuke134 thanks for the input. To confirm, you just ran |
|
yes. |
|
Oh interesting. Could you try running |
|
Sorry!! I was not doing pip install . .. It worked now🙇♀️ |
|
Ok, great! |
|
@joshbickett Just wanted to throw in real quick, the test cases I included are sort of just place holders so I could test it faster. It might be best to push a commit changing these to whatever you had in mind. |
|
@michaelhhogue if you have time to create a PR to add some instruction on testing to the |
Ok sounds good. I'll update the test cases with slight changes, but they are a good start |
Yeah I'll put one together real quick. |
|
Question: I did "python3 evaluate.py" and but it gave me "Couldn't open the summary screenshot" error, but when I changed the desktop background from my dog picture to a simple blue background, it worked. Also, If my Google Chrome tabs are open, it struggles to open a new tab in the existing window. Are there any PR that adress these issues? |
|
@Daisuke134 It prints that message when Edit: Probably would be useful to open a PR giving a more descriptive message for when there isn't a summary screenshot. |
|
@michaelhhogue thank you. I changed my desktop background and the whole thing worked! For the error, it created both screenshot.png and screenshot_with_grid.png, but probably it could not figure out how to open github. I was just wondering if there were anyone who was working on cases when Chome window is already opened. I am trying out different VISION_PROMPT to fix this problem. |
|
@Daisuke134 It's hit-or-miss when I have chrome already open. Sometimes it just goes to the url bar and types in the next site, sometimes it re-launches chrome. In Linux, it'll open a new chrome window when you search for it again. Not sure how this behaves on other OSes. |
This PR introduces a new program called
evaluate.py.The purpose of
evaluate.pyis to automate testing soc on a set of common test cases.How to use:
With the venv sourced and the project properly installed, simply run
Once the evaluation begins,
operateis automatically called for each defined test case. Upon completion of each test case,evaluatewill use GPT-4v to confirm if the objective was successfully reached for that test case based on the summary screenshot.For each test case, the console will print out either
[PASSED]or[FAILED]based on GPT-4v's evaluation.A justification is also given from GPT-4v on why it passed or failed the test case.
Demo
Here is a demo of
evaluatetesting two objectives:simplescreenrecorder-2023-12-09_12_57_10_AdobeExpress.mp4
--prompt
To make this work properly, a new command line argument was added to
operate:--prompt <prompt>which will inject the objective directly and begin executing tasks right away.Note
The test cases can be changed at the top of
evaluate.pyinTEST_CASES. Please change these to whatever you want.This is a dictionary where the key is the prompt itself and the value is the guideline of success for that prompt.
The test cases I have in there currently were just for me to test the program faster.
Future improvements
operate. Currently the output is silenced so that it doesn't clear the evaluation results from the terminal.--manualmode which skips GPT-4v's evaluationIf you have any questions, let me know!