Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False believes challenge based on sally anne test. #4167

Merged
merged 10 commits into from Jun 9, 2023
Merged

False believes challenge based on sally anne test. #4167

merged 10 commits into from Jun 9, 2023

Conversation

javableu
Copy link
Contributor

@javableu javableu commented May 13, 2023

Background

Fixes issue #3871

Changes

Added a new community challenge.

Documentation

Added the documentation in the docs/challenges/memory/challenge_d.md

Test Plan

PR Quality Checklist

  • My pull request is atomic and focuses on a single change.
  • I have thoroughly tested my changes with multiple different prompts.
  • I have considered potential risks and mitigations for my changes.
  • I have documented my changes clearly and comprehensively.
  • I have not snuck in any "extra" small tweaks changes

@vercel
Copy link

vercel bot commented May 13, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 9, 2023 9:59pm

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel vercel bot temporarily deployed to Preview May 13, 2023 22:31 Inactive
@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 28 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 26 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

2 similar comments
@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 1 hour (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 50 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 40 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 37 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel
Copy link

vercel bot commented May 14, 2023

Deployment failed with the following error:

Resource is limited - try again in 17 minutes (more than 100, code: "api-deployments-free-per-day").

@github-actions
Copy link

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@waynehamadi
Copy link
Contributor

@javableu nice thanks ! don't hesitate to ping me next time, I will test it when I have time

@waynehamadi
Copy link
Contributor

@javableu on it

black  test_memory_challenge_d.py
replaced the dynamic time depending of the level to a fix time
isort command for the libraries
@waynehamadi
Copy link
Contributor

@javableu this is a challenge that requires incredible minutia and attention for details. It tests the LLM a lot more than AutoGPT, but I do believe at some point we will need autonomous agents to develop the skills required to beat this challenge.

Thank you very much.

@github-actions github-actions bot added size/l and removed size/xl labels Jun 9, 2023
@vercel vercel bot temporarily deployed to Preview June 9, 2023 21:58 Inactive
@vercel vercel bot temporarily deployed to Preview June 9, 2023 21:59 Inactive
@Auto-GPT-Bot
Copy link
Contributor

You changed AutoGPT's behaviour. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged.

@waynehamadi waynehamadi merged commit 474a9c4 into Significant-Gravitas:master Jun 9, 2023
8 checks passed
@javableu
Copy link
Contributor Author

javableu commented Jun 12, 2023

Thanks @merwanehamadi for accepting this challenge. Even if the challenge as it is might be a bit too complicated, I think we could reuse it as a sort of "Swiss knife type of challenge". One storry==>many challenges that can be solved. For example, we could track the placements of the balls, the chronology of events, the discussions that they have had, what individuals have seen, and their beliefs of beliefs, ...

I have asked the chat to come up with some examples of questions/challenges. Most of them are interesting to us I believe. Can you tell me 2 or 3 challenges I should focus on from the list below?

Fact-checking and tracking: Track the true location of each marble at the end of the story. The challenge involves disregarding the misinformation given by the characters.

Timeline Reconstruction: Create a timeline of events, including when and where each marble was moved, and by whom.

Discrepancy Identification: Identify each instance where a character gives false information about the location of a marble.

Behavior Pattern Analysis: Analyze the behavior patterns of the characters. For example, does a certain character consistently lie about the location of the marbles?

Prediction: Predict the next series of events or actions based on the patterns established in the story.

Causality Analysis: Determine the cause and effect relationships in the story. How does each action lead to the next?

Lie Detection: Determine which character is most truthful and which is least based on their statements and actions.

Narrative Generation: Generate a continuation of the story based on the established patterns of character behavior.

Who would get most of the marble correctly?
From the information provided, it's hard to determine who would get the most marbles correctly because it depends on how each character reacts to the false information they're given and how they decide to act upon it. However, Anne and Charlie seem to have the most interaction with the marbles, so they might have a better chance of tracking them correctly.

Who is the most greedy and try to keep things for themselves?
From the story, it seems that Charlie might be the most inclined to keep things for himself, as he is seen putting marble D into his own basket.

Where is everyone at the end of the story?
The last locations mentioned for each character are: Sally is outside, Anne leaves the room (presumably going outside), Charlie is in the room after re-entering, and Bob's last location is not specified after he places marble A in the blue box.

How many things did people see?
This is a bit ambiguous. If you're referring to the number of actions involving the marbles each character has witnessed, it would require a detailed breakdown of each paragraph. Each character has seen a different number of actions based on when they were in the room.

What do people assume?
It's hard to know for sure what each character assumes without more information, but each one likely assumes the information they're given about the marble locations is truthful, unless they know for sure it's false.

What do people know for sure without a doubt?
Each character knows for sure the actions they themselves took with the marbles. However, due to the misinformation, they may not know the true current locations of all the marbles.

How many times did each person touch each marble?
This would require a detailed count of each instance a character interacted with a marble, which can be done by parsing the text.

Who do they know is lying?
Without additional context or dialogue, it's not clear if any character is aware that others are lying.

Which question should they ask to find every marble in the least amount of questions?
The most efficient question might be: "Where did you last see or place each marble?" Asked to each person, this would ideally give the current location of each marble, assuming they all tell the truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants