New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

False believes challenge based on sally anne test. #4167

Merged

waynehamadi merged 10 commits into Significant-Gravitas:master from javableu:master

Jun 9, 2023

Contributor

javableu commented May 13, 2023 •

edited

Background

Fixes issue #3871

Changes

Added a new community challenge.

Documentation

Added the documentation in the docs/challenges/memory/challenge_d.md

Test Plan

PR Quality Checklist

My pull request is atomic and focuses on a single change.
I have thoroughly tested my changes with multiple different prompts.
I have considered potential risks and mitigations for my changes.
I have documented my changes clearly and comprehensively.
I have not snuck in any "extra" small tweaks changes

vercel bot commented May 13, 2023 •

edited

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 9, 2023 9:59pm

github-actions bot commented May 13, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

github-actions bot added the size/xl label

vercel bot temporarily deployed to Preview

May 13, 2023 22:31

Inactive

ntindle assigned waynehamadi

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 28 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 26 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

2 similar comments

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 1 hour (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 50 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 40 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 37 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 17 minutes (more than 100, code: "api-deployments-free-per-day").

github-actions bot commented May 14, 2023

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

k-boikov added the challenge label

Contributor

waynehamadi commented Jun 3, 2023

@javableu nice thanks ! don't hesitate to ping me next time, I will test it when I have time

Contributor

waynehamadi commented Jun 9, 2023

@javableu on it

javableu added 5 commits

June 9, 2023 12:17


          False believes challenge based on sally anne test.

1a079d8


          Update test_memory_challenge_d.py

706711f


          Update challenge_d.md

30562b2

Some text appearing in bold


          Update test_memory_challenge_d.py

831b87f


          Update test_memory_challenge_d.py

fc288bf

javableu added 4 commits

June 9, 2023 12:17


          Update test_memory_challenge_d.py

e3a5cda


          Update test_memory_challenge_d.py

9d6d748

black  test_memory_challenge_d.py


          Update test_memory_challenge_d.py

e69bf1a

replaced the dynamic time depending of the level to a fix time


          Update test_memory_challenge_d.py

f1ee81e

isort command for the libraries

Contributor

waynehamadi commented Jun 9, 2023

@javableu this is a challenge that requires incredible minutia and attention for details. It tests the LLM a lot more than AutoGPT, but I do believe at some point we will need autonomous agents to develop the skills required to beat this challenge.

Thank you very much.

github-actions bot added size/l and removed size/xl labels

vercel bot temporarily deployed to Preview

June 9, 2023 21:58

Inactive


          Refactored memory challenge a

d4bb563

vercel bot temporarily deployed to Preview

June 9, 2023 21:59

Inactive

Auto-GPT-Bot added the behaviour change label

Contributor

Auto-GPT-Bot commented Jun 9, 2023

You changed AutoGPT's behaviour. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged.

waynehamadi approved these changes

View reviewed changes

waynehamadi merged commit 474a9c4 into Significant-Gravitas:master

8 checks passed

waynehamadi mentioned this pull request

Help us build challenges! #3835

Closed

Contributor Author

javableu commented Jun 12, 2023 •

edited

Thanks @merwanehamadi for accepting this challenge. Even if the challenge as it is might be a bit too complicated, I think we could reuse it as a sort of "Swiss knife type of challenge". One storry==>many challenges that can be solved. For example, we could track the placements of the balls, the chronology of events, the discussions that they have had, what individuals have seen, and their beliefs of beliefs, ...

I have asked the chat to come up with some examples of questions/challenges. Most of them are interesting to us I believe. Can you tell me 2 or 3 challenges I should focus on from the list below?

Fact-checking and tracking: Track the true location of each marble at the end of the story. The challenge involves disregarding the misinformation given by the characters.

Timeline Reconstruction: Create a timeline of events, including when and where each marble was moved, and by whom.

Discrepancy Identification: Identify each instance where a character gives false information about the location of a marble.

Behavior Pattern Analysis: Analyze the behavior patterns of the characters. For example, does a certain character consistently lie about the location of the marbles?

Prediction: Predict the next series of events or actions based on the patterns established in the story.

Causality Analysis: Determine the cause and effect relationships in the story. How does each action lead to the next?

Lie Detection: Determine which character is most truthful and which is least based on their statements and actions.

Narrative Generation: Generate a continuation of the story based on the established patterns of character behavior.

Who would get most of the marble correctly?
From the information provided, it's hard to determine who would get the most marbles correctly because it depends on how each character reacts to the false information they're given and how they decide to act upon it. However, Anne and Charlie seem to have the most interaction with the marbles, so they might have a better chance of tracking them correctly.

Who is the most greedy and try to keep things for themselves?
From the story, it seems that Charlie might be the most inclined to keep things for himself, as he is seen putting marble D into his own basket.

Where is everyone at the end of the story?
The last locations mentioned for each character are: Sally is outside, Anne leaves the room (presumably going outside), Charlie is in the room after re-entering, and Bob's last location is not specified after he places marble A in the blue box.

How many things did people see?
This is a bit ambiguous. If you're referring to the number of actions involving the marbles each character has witnessed, it would require a detailed breakdown of each paragraph. Each character has seen a different number of actions based on when they were in the room.

What do people assume?
It's hard to know for sure what each character assumes without more information, but each one likely assumes the information they're given about the marble locations is truthful, unless they know for sure it's false.

What do people know for sure without a doubt?
Each character knows for sure the actions they themselves took with the marbles. However, due to the misinformation, they may not know the true current locations of all the marbles.

How many times did each person touch each marble?
This would require a detailed count of each instance a character interacted with a marble, which can be done by parsing the text.

Who do they know is lying?
Without additional context or dialogue, it's not clear if any character is aware that others are lying.

Which question should they ask to find every marble in the least amount of questions?
The most efficient question might be: "Where did you last see or place each marble?" Asked to each person, this would ideally give the current location of each marble, assuming they all tell the truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment