Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help us build challenges! #3835

Closed
waynehamadi opened this issue May 5, 2023 · 29 comments
Closed

Help us build challenges! #3835

waynehamadi opened this issue May 5, 2023 · 29 comments

Comments

@waynehamadi
Copy link
Contributor

waynehamadi commented May 5, 2023

Summary 馃挕

Challenges are tasks Auto-GPT is not able to achieve. Challenges will help us improve Auto-GPT

We need your help to build these challenges and the ecosystem around them

Here is a breakdown of the help we need.

A-Challenge Creation

1-Submit challenges within existing categories.

Memory

Information Retrieval

Research

Psychological

Debug Code

Adaptability

Website Navigation Challenge

Self Improvement (Solve challenges automatically)

Automated Challenge Creation

Basic Abilities

2-Design brand new challenge categories

  • Alignment
  • Obtain Knowledge
  • Focus (stay on task)
  • Planning (suggested by @Boostrix)
  • Others if you have ideas.
    DM me if interested (discord below)

Challenges Auto-GPT can already perform

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform.

UX experience around challenges

Improve logs/DEBUG folder to help people beat challenges more easily

The logs/DEBUG folder allows everyone to understand what Auto-GPT is doing at every cycle
Screenshot 2023-05-05 at 7 32 57 AM
We need to:

"Fix Auto-GPT" challenges !

The vision of the fix auto-gpt challenges is to give the tools for the community to report wrong behaviors and create challenges around them.
We need to:

Make it easy for people to change the prompt

Build the CI pipeline !

DM me on discord if you have devops experience and want to help us build the pipeline that will allow people to submit their challenges and see how they perform!

Pytest Imrovements

Challenges refactorization

CI pipeline improvements

Discord (merwanehamadi)
Join Auto-GPT's discord channel: https://discord.gg/autogpt

@Boostrix
Copy link
Contributor

Boostrix commented May 5, 2023

The section titled "others": should at least mention "planning", which is severely lacking at the moment. I bet all of us can come up with dozens of ai_settings files (objectives) where it fails to plan properly, or where it fails to recognize dependencies in between tasks, but also recognizing that it previously succeeded/completed a task and should proceed: #3593 (comment)

For starters, we probably need to have challenges that involve:

  • unconditional sequential tasks
  • conditional sequential tasks

(not getting into async/concurrency for now)

@waynehamadi
Copy link
Contributor Author

Yeah this is a great suggestion @Boostrix can I talk to you in discord ? my discord is merwanehamadi.
https://discord.gg/autogpt

@anonhostpi
Copy link

anonhostpi commented May 5, 2023

Boostrix has preference for staying off of Discord based on my prior interactions with them

@waynehamadi
Copy link
Contributor Author

@Boostrix can you think of a challenge we could build in order to test planning skills ?

@waynehamadi
Copy link
Contributor Author

@Androbin has suggested a very nice memory challenge involving files that are read in the wrong order. More details coming soon hopefully.

@Boostrix
Copy link
Contributor

Boostrix commented May 6, 2023

can you think of a challenge we could build in order to test planning skills ?

For starters, I would consider a plan to be multi-objective task with multiple dimensions of leeway and constraints that the agent needs to explore.

So I suppose anything involving detangling dependencies (or lack thereof) should work to get this started. That would involve organizing steps but also coordinating in between steps. In #3593, @dschonholtz is juggling some nice ideas. And given the current state of things, it might actually be good to have working/non-working examples of plans for the agent, so that we can see which approach(es) look promising.

And in fact, GPT itself seems pretty good at coming up with potential candidates:

  • Grocery shopping: An AI agent could plan a grocery shopping trip by identifying the necessary items, determining the optimal route to take through the store to minimize time and energy, and considering any budget or dietary constraints.
  • Booking a flight: An AI agent could plan a flight booking by identifying the preferred travel dates, considering any budget or scheduling constraints, and selecting the best flight options based on price, duration, and other criteria.
  • Event planning: An AI agent could plan an event by identifying the necessary resources (such as venue, catering, and decorations), determining the optimal timing and scheduling, and considering any budget or other constraints.
  • Software development: An AI agent could plan a software development project by identifying the necessary features and functionality, breaking them down into smaller tasks and subtasks, and assigning them to team members based on their skills and availability.
  • Construction project management: An AI agent could plan a construction project by identifying the necessary resources (such as labor, materials, and equipment), coordinating with contractors and subcontractors, scheduling tasks and milestones, and monitoring progress to ensure that the project stays on track and within budget.

PS: I would suggest to extend the logging feature so that we can optionally log resource utilization to CSV files for these challenges - that way, we can trivially plot the performance of different challenges over time (different versions of Auto-GPT) - and in conjunction with suppport for #3466 (constraint awareness), we could also log how many steps [thinking] were needed for each challenge/benchmark (plan/task)

EDIT: To promote this effort a little more, I would suggest to add a few of these issues to the MOTD displayed when starting Auto-GPT, so that more users are encouraged to get involved in participating - we could randomly alternate between a handful of relevant issues and invite folks to particpate.

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform. We need your help to create more of these regression tests.

One of the most impressive examples posted around here is this one by @adam-paterson
#2775 (comment)

@waynehamadi
Copy link
Contributor Author

@Boostrix great suggestions! How do you measure success in a very deterministic way for these items ?
I like the logging resource utilization suggestion. I will create an issue.

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

@Boostrix
Copy link
Contributor

Boostrix commented May 6, 2023

How do you measure success in a very deterministic way for these items ?
right, good question.

I was thinking to use a simple test case for starters, one where we ask the "planner" to coordinate a meeting between N different people (say 3-4) who are constrained by availability (date/time) - later on, N can be increased, with more constraints added, such as certain people never being available on the same date.
We can then use such a spec to generate a Pyton unit test using GPT to answer for each participant if that participant is available at the requested day/time.

So, the input for the unit test will be date/time for now (for each participant).
The next step would be to ask Agent-GPT to plan the corresponding meeting by telling it about those constraints and specifying a potential time window (or list thereof).

To verify if the the solution is valid, we merely need to execute the unit test for each participant which will tell us if that participant is available - that way, we're reducing the problem to running tests.

Once we have that fleshed out/working for 3-4 participants, it would make sense to add more complexity to it by adding more options and constraints, including dependencies between these options and constraints.

We could then adapt this framework for other more complex examples (see above).

I like the logging resource utilization suggestion. I will create an issue.

For API costs/token (budget) we have several open PRs, number of steps taken is part of at least 2 PRs that I am aware of.
The rest is probably in the realm of #3466

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

I didn't even think about using Selenium, I was thinking of treating the final HTML like any XHTML/XML document and query it from Python to see if it's got the relevant tags and attributes as requested by the specs. Personally, I find the whole Selenium stuff mainly useful for highly dynamic websites, static HTML can probably be queried "as is" from Python - no need for Selenium ?

That being said, we could also use @adam-patersons example and add an outer agent to mutate his ai_settings / project.txt file to tinker with 3-5 variations of each step (file names, technology stack, functionality etc) - that way, you end up with a ton of regression tests at the mere cost of a few nested loops.The "deliverables" portion of his specs is such succinct that we could probably use it "as is" to create a corresponding python unit test via Auto-GPT (I have used this for several toy projects, and it's working nicely).

@Boostrix
Copy link
Contributor

Boostrix commented May 8, 2023

Information Retrieval Challenge

FWIW, I can get it to bail out rather quickly by letting an outer agent mutate the following directive:

list of X open source software packages in the area of {audio, video, office, games, flight simulation}, [released under the {GPL, BSD ...], [working on {Windows, Mac OSX, Linux}]

Which is pretty interesting once you think about it, since this is the sort of stuff that LLM's like GPT are generally good at - and in fact, GPT can answer any of these easily, just not in combination with Auto-GPT, it seems like some sort of "intra-llm-regression" due to the interaction with the AI agent mechanism,

@Androbin
Copy link
Contributor

Androbin commented May 9, 2023

@Boostrix I would agree that our current prompting artificially limits GPT-4 abilities. The issue I see is that we actively discourage long-form chain-of-thought reasoning.

@Boostrix
Copy link
Contributor

Boostrix commented May 9, 2023

The idea to use dynamic prompting sounds rather promising:

@Boostrix
Copy link
Contributor

Boostrix commented May 9, 2023

Here's another good description by @hashratez of an intra-llm-regression that should be suitable to benchmark the system against GPT itself:

#1647

I then tried the simplest task: create an index.html file with the word "hello". In direct ChatGPT it would output this in 1 second. This took over 10 steps as it writes the code, checks it then does something else, etc.

@Boostrix
Copy link
Contributor

Boostrix commented May 11, 2023

we should update the list of potential challenges to add a new category for "experiential tests" - i.e. tests where the agent should be able to apply its experience of previously completing a task, but fails to do so.

The most straightforward example I can think of is it editing/updating a file successfully and 5 minutes later it wants to use interactive editors like nano, vim etc - that's a recurring topic here.

So we should add a benchmark to see how good an agent is at applying experience to future tasks.

A simple test case would be telling it to update a file with some hand holding, and afterwards leaving out the hand holding and counting how many times it succeeds or not (3/10 attempts).

Being able to apply past experience is an essential part of learning.

Here's another suggestion for a new category: "Tool use".
The LLM has access to ~20 built-in functions (BIFs) - it should be able to use this access to make up for missing functionality.
For instance: disabling browse_website, it should still be able to figure out internet access using execute_shell or execute_python. The number of steps it needs to do so, is a measure of its capability to use tools.

Likewise, after disabling download_file it should be able figure out using python or the shell to download stuff.

There are often several ways to "skin a cat" - when disabling git operations or the shell, the agent must be capable of figuring out alternatives, the number of steps it needs to do so tells us just how effective the agent is.

From a pytest standpoint, we would ideally be able to disable some BIFs and then run a test to see how many steps the test needs to complete - if, over time, that number increases, the agent is performing worse.

We can also ask the agent to perform tasks for which it does not have any BIFs at all, such as doing mathematical calculations #3412 and then count the number of steps it needs to come up with a solution using different options (python or shell).

@waynehamadi
Copy link
Contributor Author

CI pipeline 4 times faster thanks to @AndresCdo and parallelized tests !

@Boostrix
Copy link
Contributor

Suggestion for new challenge type:
In the light of the latest google/DDG dilemma (#4120), we need to harden our commands/agent accordingly (#4157):

  • work out a way to randomly make commands malfunction/disabled via pytest (using a timer or a rand function at the command mgr/registry level)
  • let the agent figure out that a command is malfunctioning, so that it comes up with an alternative

This may involve tinkering with different variable/argument substitutions: #3904 (comment)

Basically, we need to keep track of commands that previously worked/failed and types of arguments/params that are known to work - including some optional timeout option to retry a command once in a while.

Also, commands really should get access to the error/exception string - because at that point, the LLM can actually help us!

@waynehamadi
Copy link
Contributor Author

waynehamadi commented May 13, 2023

@waynehamadi
Copy link
Contributor Author

waynehamadi commented May 13, 2023

@Boostrix yeah that sounds useful. We just have to be careful about inserting plausible mistakes

@waynehamadi
Copy link
Contributor Author

waynehamadi commented May 13, 2023

@Boostrix could you create an issue for the 2 challenges you mentioned above and label it "challenge" ?

@Boostrix
Copy link
Contributor

Which ones exactly ?

Coming up with a challenge where we mutate a URL should be a pretty straightforward way for the agent tofail, which it still can fix - using URL validation/patching. Most obvious example would be adding whitespaces into the URL without escaping.

@waynehamadi
Copy link
Contributor Author

all the ones you suggest, we don't have next step for them, can I just have the link of the issues where the challenge idea is written so I put them in this epic ?

@waynehamadi
Copy link
Contributor Author

waynehamadi commented May 14, 2023

New CI pipeline ready: now you can test challenges by creating a Pull Request.
currently working on this so you can SEE the results of the challenges:
#4190

@waynehamadi
Copy link
Contributor Author

thanks @AndresCdo for #3868

@waynehamadi
Copy link
Contributor Author

thank you @PortlandKyGuy for your work! #4261

@waynehamadi
Copy link
Contributor Author

thank you @erik-megarad #4469

@waynehamadi
Copy link
Contributor Author

thank you @dschonholtz @gravelBridge #4286

@waynehamadi
Copy link
Contributor Author

thank you @javableu !! #4167

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 6, 2023
@github-actions
Copy link
Contributor

This issue was closed automatically because it has been stale for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
@Boostrix
Copy link
Contributor

Boostrix commented Oct 4, 2023

unless I am mistaken, this should not be closed or "staled" at all, I believe this remains relevant or has something changed over the course of the last couple of months that I missing entirely ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants