Help us build challenges! #3835

waynehamadi · 2023-05-05T12:59:32Z

Summary 💡

Challenges are tasks Auto-GPT is not able to achieve. Challenges will help us improve Auto-GPT

We need your help to build these challenges and the ecosystem around them

Here is a breakdown of the help we need.

A-Challenge Creation

1-Submit challenges within existing categories.

Memory

~~[Challenge Creation] Memory Challenge C #3838 => @dschonholtz~~
Memory challenge D => @Androbin

Information Retrieval

~~[Challenge Creation] Information Retrieval Challenge B #3837 => @PortlandKyGuy~~
Information Retrieval Challenge C => @Androbin

Research

Research Challenge A => @Androbin

Psychological

~~Psychological challenge: Sally and Anne's Test. #3871 => unassigned~~

Debug Code

2-Design brand new challenge categories

Alignment
Obtain Knowledge
Focus (stay on task)
Planning (suggested by @Boostrix)
Others if you have ideas.
DM me if interested (discord below)

Challenges Auto-GPT can already perform

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform.

Regression Test: Auto-GPT should always be able to build this simple contact form #3901 => unassigned

UX experience around challenges

[Challenge Experience] Better wiki #3839 => unassigned

Improve logs/DEBUG folder to help people beat challenges more easily

The logs/DEBUG folder allows everyone to understand what Auto-GPT is doing at every cycle

We need to:

~~Log Self Feedback in logs/Debug folder #3842 => @AndresCdo done thank you !~~
~~Log User input in logs/Debug folder #3843 => @AndresCdo done thank you !~~
Log ai_settings.yaml in logs/DEBUG/[TIMESTAMP_AGENT_NAME]/001/ 0_ai_settings.yaml #3960 => log ai_settings.yaml
Log memory stored within Auto-GPT #3844 => BLOCKED =>waiting for @Pwuts memory rework to assign
Log llm interaction in text.py #3957 => BLOCKED => waiting for rearch

"Fix Auto-GPT" challenges !

The vision of the fix auto-gpt challenges is to give the tools for the community to report wrong behaviors and create challenges around them.
We need to:

Make it easy for people to change the prompt

Centralize all of our prompts in one place so people can change it easily #3858 => @Eggrolling-hu
Prompt Profiles #3954 => idea brought up by @Boostrix

Build the CI pipeline !

DM me on discord if you have devops experience and want to help us build the pipeline that will allow people to submit their challenges and see how they perform!

Our CI pipeline should trigger the benchmark pipeline #3966 => unassigned

Pytest Imrovements

~~Run Pytest Parallel mode to increase speed #3863 => @AndresCdo~~
Use Trio in pytest to have async tests #3900 => unassigned
Our tests should run in a certain order to save resources #3906 => unassigned
When running autogpt in pytest, we shouldn't wait for the logs to be written #4199 => unassigned
Browse website should use VCR #4496 => unassigned

Challenges refactorization

Challenges should be using the automatic AI Config generation #3883 => unassigned
Our Challenges should use the default list of commands #3907 => unassigned
~~Challenges should not fail if X number of seconds, but only if X number of cycles is done #4161 => @merwanehamadi~~
~~Generating cassettes locally is not necessary #4189 => unassigned~~
How do we know AUTOMATICALLY if a prompt improved Auto-GPT ? #4190 => unassigned

CI pipeline improvements

~~Find a solution to make isort and black NOT slow down challenge contribution #4163 => canceled~~
~~We shouldn't have to merge 2 PRs when a cassettes changes #4196 => @merwanehamadi~~
-~~allow ci pipeline to make calls to api providers => @merwanehamadi~~
~~VCR shouldn't record 500, 408, 429, when running challenges #4461 => @erik-megarad~~
~~CI pipeline cache dependencies #4258 => @merwanehamadi~~
-Run the test suite every 4 hours on master, without using VCR #4590
#####################
(issue to create:
create a test that makes sure the challenges have a corresponding doc
create a test maker, i.e a command that creates a test automatically
)

Discord (merwanehamadi)
Join Auto-GPT's discord channel: https://discord.gg/autogpt

Boostrix · 2023-05-05T19:02:36Z

The section titled "others": should at least mention "planning", which is severely lacking at the moment. I bet all of us can come up with dozens of ai_settings files (objectives) where it fails to plan properly, or where it fails to recognize dependencies in between tasks, but also recognizing that it previously succeeded/completed a task and should proceed: #3593 (comment)

For starters, we probably need to have challenges that involve:

unconditional sequential tasks
conditional sequential tasks

(not getting into async/concurrency for now)

waynehamadi · 2023-05-05T21:02:26Z

Yeah this is a great suggestion @Boostrix can I talk to you in discord ? my discord is merwanehamadi.
https://discord.gg/autogpt

anonhostpi · 2023-05-05T21:33:16Z

Boostrix has preference for staying off of Discord based on my prior interactions with them

waynehamadi · 2023-05-06T03:40:23Z

@Boostrix can you think of a challenge we could build in order to test planning skills ?

waynehamadi · 2023-05-06T04:44:42Z

@Androbin has suggested a very nice memory challenge involving files that are read in the wrong order. More details coming soon hopefully.

Boostrix · 2023-05-06T05:48:40Z

can you think of a challenge we could build in order to test planning skills ?

For starters, I would consider a plan to be multi-objective task with multiple dimensions of leeway and constraints that the agent needs to explore.

So I suppose anything involving detangling dependencies (or lack thereof) should work to get this started. That would involve organizing steps but also coordinating in between steps. In #3593, @dschonholtz is juggling some nice ideas. And given the current state of things, it might actually be good to have working/non-working examples of plans for the agent, so that we can see which approach(es) look promising.

And in fact, GPT itself seems pretty good at coming up with potential candidates:

Grocery shopping: An AI agent could plan a grocery shopping trip by identifying the necessary items, determining the optimal route to take through the store to minimize time and energy, and considering any budget or dietary constraints.
Booking a flight: An AI agent could plan a flight booking by identifying the preferred travel dates, considering any budget or scheduling constraints, and selecting the best flight options based on price, duration, and other criteria.
Event planning: An AI agent could plan an event by identifying the necessary resources (such as venue, catering, and decorations), determining the optimal timing and scheduling, and considering any budget or other constraints.
Software development: An AI agent could plan a software development project by identifying the necessary features and functionality, breaking them down into smaller tasks and subtasks, and assigning them to team members based on their skills and availability.
Construction project management: An AI agent could plan a construction project by identifying the necessary resources (such as labor, materials, and equipment), coordinating with contractors and subcontractors, scheduling tasks and milestones, and monitoring progress to ensure that the project stays on track and within budget.

PS: I would suggest to extend the logging feature so that we can optionally log resource utilization to CSV files for these challenges - that way, we can trivially plot the performance of different challenges over time (different versions of Auto-GPT) - and in conjunction with suppport for #3466 (constraint awareness), we could also log how many steps [thinking] were needed for each challenge/benchmark (plan/task)

EDIT: To promote this effort a little more, I would suggest to add a few of these issues to the MOTD displayed when starting Auto-GPT, so that more users are encouraged to get involved in participating - we could randomly alternate between a handful of relevant issues and invite folks to particpate.

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform. We need your help to create more of these regression tests.

One of the most impressive examples posted around here is this one by @adam-paterson
#2775 (comment)

waynehamadi · 2023-05-06T12:54:33Z

@Boostrix great suggestions! How do you measure success in a very deterministic way for these items ?
I like the logging resource utilization suggestion. I will create an issue.

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

Boostrix · 2023-05-06T13:26:23Z

How do you measure success in a very deterministic way for these items ?
right, good question.

I was thinking to use a simple test case for starters, one where we ask the "planner" to coordinate a meeting between N different people (say 3-4) who are constrained by availability (date/time) - later on, N can be increased, with more constraints added, such as certain people never being available on the same date.
We can then use such a spec to generate a Pyton unit test using GPT to answer for each participant if that participant is available at the requested day/time.

So, the input for the unit test will be date/time for now (for each participant).
The next step would be to ask Agent-GPT to plan the corresponding meeting by telling it about those constraints and specifying a potential time window (or list thereof).

To verify if the the solution is valid, we merely need to execute the unit test for each participant which will tell us if that participant is available - that way, we're reducing the problem to running tests.

Once we have that fleshed out/working for 3-4 participants, it would make sense to add more complexity to it by adding more options and constraints, including dependencies between these options and constraints.

We could then adapt this framework for other more complex examples (see above).

I like the logging resource utilization suggestion. I will create an issue.

For API costs/token (budget) we have several open PRs, number of steps taken is part of at least 2 PRs that I am aware of.
The rest is probably in the realm of #3466

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

I didn't even think about using Selenium, I was thinking of treating the final HTML like any XHTML/XML document and query it from Python to see if it's got the relevant tags and attributes as requested by the specs. Personally, I find the whole Selenium stuff mainly useful for highly dynamic websites, static HTML can probably be queried "as is" from Python - no need for Selenium ?

That being said, we could also use @adam-patersons example and add an outer agent to mutate his ai_settings / project.txt file to tinker with 3-5 variations of each step (file names, technology stack, functionality etc) - that way, you end up with a ton of regression tests at the mere cost of a few nested loops.The "deliverables" portion of his specs is such succinct that we could probably use it "as is" to create a corresponding python unit test via Auto-GPT (I have used this for several toy projects, and it's working nicely).

Boostrix · 2023-05-08T13:02:40Z

Information Retrieval Challenge

FWIW, I can get it to bail out rather quickly by letting an outer agent mutate the following directive:

list of X open source software packages in the area of {audio, video, office, games, flight simulation}, [released under the {GPL, BSD ...], [working on {Windows, Mac OSX, Linux}]

Which is pretty interesting once you think about it, since this is the sort of stuff that LLM's like GPT are generally good at - and in fact, GPT can answer any of these easily, just not in combination with Auto-GPT, it seems like some sort of "intra-llm-regression" due to the interaction with the AI agent mechanism,

Androbin · 2023-05-09T01:09:14Z

@Boostrix I would agree that our current prompting artificially limits GPT-4 abilities. The issue I see is that we actively discourage long-form chain-of-thought reasoning.

Boostrix · 2023-05-09T07:03:50Z

The idea to use dynamic prompting sounds rather promising:

Adaptability Challenge : Dynamic prompting #3937

Boostrix · 2023-05-09T21:22:32Z

Here's another good description by @hashratez of an intra-llm-regression that should be suitable to benchmark the system against GPT itself:

#1647

I then tried the simplest task: create an index.html file with the word "hello". In direct ChatGPT it would output this in 1 second. This took over 10 steps as it writes the code, checks it then does something else, etc.

Boostrix · 2023-05-11T08:06:16Z

we should update the list of potential challenges to add a new category for "experiential tests" - i.e. tests where the agent should be able to apply its experience of previously completing a task, but fails to do so.

The most straightforward example I can think of is it editing/updating a file successfully and 5 minutes later it wants to use interactive editors like nano, vim etc - that's a recurring topic here.

So we should add a benchmark to see how good an agent is at applying experience to future tasks.

A simple test case would be telling it to update a file with some hand holding, and afterwards leaving out the hand holding and counting how many times it succeeds or not (3/10 attempts).

Being able to apply past experience is an essential part of learning.

Here's another suggestion for a new category: "Tool use".
The LLM has access to ~20 built-in functions (BIFs) - it should be able to use this access to make up for missing functionality.
For instance: disabling browse_website, it should still be able to figure out internet access using execute_shell or execute_python. The number of steps it needs to do so, is a measure of its capability to use tools.

Likewise, after disabling download_file it should be able figure out using python or the shell to download stuff.

There are often several ways to "skin a cat" - when disabling git operations or the shell, the agent must be capable of figuring out alternatives, the number of steps it needs to do so tells us just how effective the agent is.

From a pytest standpoint, we would ideally be able to disable some BIFs and then run a test to see how many steps the test needs to complete - if, over time, that number increases, the agent is performing worse.

We can also ask the agent to perform tasks for which it does not have any BIFs at all, such as doing mathematical calculations #3412 and then count the number of steps it needs to come up with a solution using different options (python or shell).

waynehamadi · 2023-05-12T01:16:16Z

CI pipeline 4 times faster thanks to @AndresCdo and parallelized tests !

Boostrix · 2023-05-13T08:11:52Z

Suggestion for new challenge type:
In the light of the latest google/DDG dilemma (#4120), we need to harden our commands/agent accordingly (#4157):

work out a way to randomly make commands malfunction/disabled via pytest (using a timer or a rand function at the command mgr/registry level)
let the agent figure out that a command is malfunctioning, so that it comes up with an alternative

This may involve tinkering with different variable/argument substitutions: #3904 (comment)

Basically, we need to keep track of commands that previously worked/failed and types of arguments/params that are known to work - including some optional timeout option to retry a command once in a while.

Also, commands really should get access to the error/exception string - because at that point, the LLM can actually help us!

waynehamadi · 2023-05-13T14:23:05Z

a new issue was added

Challenges should not fail if X number of seconds, but only if X number of cycles is done #4161

waynehamadi · 2023-05-13T14:33:12Z

@Boostrix yeah that sounds useful. We just have to be careful about inserting plausible mistakes

waynehamadi · 2023-05-13T14:42:51Z

@Boostrix could you create an issue for the 2 challenges you mentioned above and label it "challenge" ?

Boostrix · 2023-05-14T07:31:02Z

Which ones exactly ?

Coming up with a challenge where we mutate a URL should be a pretty straightforward way for the agent tofail, which it still can fix - using URL validation/patching. Most obvious example would be adding whitespaces into the URL without escaping.

waynehamadi · 2023-05-14T12:52:52Z

all the ones you suggest, we don't have next step for them, can I just have the link of the issues where the challenge idea is written so I put them in this epic ?

waynehamadi · 2023-05-14T14:40:24Z

New CI pipeline ready: now you can test challenges by creating a Pull Request.
currently working on this so you can SEE the results of the challenges:
#4190

waynehamadi · 2023-05-17T11:39:00Z

thanks @AndresCdo for #3868

waynehamadi · 2023-05-31T17:58:17Z

thank you @PortlandKyGuy for your work! #4261

waynehamadi · 2023-05-31T18:04:00Z

thank you @erik-megarad #4469

waynehamadi · 2023-05-31T18:22:18Z

thank you @dschonholtz @gravelBridge #4286

waynehamadi · 2023-06-09T22:03:28Z

thank you @javableu !! #4167

github-actions · 2023-09-06T20:59:15Z

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

github-actions · 2023-09-17T01:49:52Z

This issue was closed automatically because it has been stale for 10 days with no activity.

Boostrix · 2023-10-04T17:54:31Z

unless I am mistaken, this should not be closed or "staled" at all, I believe this remains relevant or has something changed over the course of the last couple of months that I missing entirely ?

Androbin mentioned this issue May 6, 2023

Add Challenge: PTE Reorder Paragraphs #3887

Open

5 tasks

This was referenced May 6, 2023

Adding support for other programming languages #2775

Closed

[DRAFT] Wizard mode (placeholder for #3820 ) #3911

Draft

Boostrix mentioned this issue May 11, 2023

Introducing Looping Heuristics / Detection #3668

Open

1 task

waynehamadi added challenge and removed challenge labels May 12, 2023

Boostrix mentioned this issue May 16, 2023

This model's maximum context length is 8191 tokens, however you requested 21485 tokens (21485 in your prompt; 0 for the completion). Please reduce your prompt; or completion length. #4233

Closed

github-actions bot added the Stale label Sep 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help us build challenges! #3835

Help us build challenges! #3835

waynehamadi commented May 5, 2023 •

edited

Boostrix commented May 5, 2023 •

edited

waynehamadi commented May 5, 2023

anonhostpi commented May 5, 2023 •

edited

waynehamadi commented May 6, 2023

waynehamadi commented May 6, 2023

Boostrix commented May 6, 2023 •

edited

waynehamadi commented May 6, 2023

Boostrix commented May 6, 2023 •

edited

Boostrix commented May 8, 2023 •

edited

Androbin commented May 9, 2023

Boostrix commented May 9, 2023

Boostrix commented May 9, 2023

Boostrix commented May 11, 2023 •

edited

waynehamadi commented May 12, 2023

Boostrix commented May 13, 2023

waynehamadi commented May 13, 2023 •

edited

waynehamadi commented May 13, 2023 •

edited

waynehamadi commented May 13, 2023 •

edited

Boostrix commented May 14, 2023

waynehamadi commented May 14, 2023

waynehamadi commented May 14, 2023 •

edited

waynehamadi commented May 17, 2023

waynehamadi commented May 31, 2023

waynehamadi commented May 31, 2023

waynehamadi commented May 31, 2023

waynehamadi commented Jun 9, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 17, 2023

Boostrix commented Oct 4, 2023

Help us build challenges! #3835

Help us build challenges! #3835

Comments

waynehamadi commented May 5, 2023 • edited

Summary 💡

A-Challenge Creation

1-Submit challenges within existing categories.

Memory

Information Retrieval

Research

Psychological

Debug Code

Adaptability

Website Navigation Challenge

Self Improvement (Solve challenges automatically)

Automated Challenge Creation

Basic Abilities

2-Design brand new challenge categories

Challenges Auto-GPT can already perform

UX experience around challenges

Improve logs/DEBUG folder to help people beat challenges more easily

"Fix Auto-GPT" challenges !

Make it easy for people to change the prompt

Build the CI pipeline !

Pytest Imrovements

Challenges refactorization

CI pipeline improvements

Boostrix commented May 5, 2023 • edited

waynehamadi commented May 5, 2023

anonhostpi commented May 5, 2023 • edited

waynehamadi commented May 6, 2023

waynehamadi commented May 6, 2023

Boostrix commented May 6, 2023 • edited

waynehamadi commented May 6, 2023

Boostrix commented May 6, 2023 • edited

Boostrix commented May 8, 2023 • edited

Androbin commented May 9, 2023

Boostrix commented May 9, 2023

Boostrix commented May 9, 2023

Boostrix commented May 11, 2023 • edited

waynehamadi commented May 12, 2023

Boostrix commented May 13, 2023

waynehamadi commented May 13, 2023 • edited

waynehamadi commented May 13, 2023 • edited

waynehamadi commented May 13, 2023 • edited

Boostrix commented May 14, 2023

waynehamadi commented May 14, 2023

waynehamadi commented May 14, 2023 • edited

waynehamadi commented May 17, 2023

waynehamadi commented May 31, 2023

waynehamadi commented May 31, 2023

waynehamadi commented May 31, 2023

waynehamadi commented Jun 9, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 17, 2023

Boostrix commented Oct 4, 2023

waynehamadi commented May 5, 2023 •

edited

Boostrix commented May 5, 2023 •

edited

anonhostpi commented May 5, 2023 •

edited

Boostrix commented May 6, 2023 •

edited

Boostrix commented May 6, 2023 •

edited

Boostrix commented May 8, 2023 •

edited

Boostrix commented May 11, 2023 •

edited

waynehamadi commented May 13, 2023 •

edited

waynehamadi commented May 13, 2023 •

edited

waynehamadi commented May 13, 2023 •

edited

waynehamadi commented May 14, 2023 •

edited