-
Notifications
You must be signed in to change notification settings - Fork 44.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False believes challenge based on sally anne test. #4167
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
2 similar comments
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Deployment failed with the following error:
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
@javableu nice thanks ! don't hesitate to ping me next time, I will test it when I have time |
@javableu on it |
black test_memory_challenge_d.py
replaced the dynamic time depending of the level to a fix time
isort command for the libraries
@javableu this is a challenge that requires incredible minutia and attention for details. It tests the LLM a lot more than AutoGPT, but I do believe at some point we will need autonomous agents to develop the skills required to beat this challenge. Thank you very much. |
You changed AutoGPT's behaviour. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged. |
Thanks @merwanehamadi for accepting this challenge. Even if the challenge as it is might be a bit too complicated, I think we could reuse it as a sort of "Swiss knife type of challenge". One storry==>many challenges that can be solved. For example, we could track the placements of the balls, the chronology of events, the discussions that they have had, what individuals have seen, and their beliefs of beliefs, ... I have asked the chat to come up with some examples of questions/challenges. Most of them are interesting to us I believe. Can you tell me 2 or 3 challenges I should focus on from the list below? Fact-checking and tracking: Track the true location of each marble at the end of the story. The challenge involves disregarding the misinformation given by the characters. Timeline Reconstruction: Create a timeline of events, including when and where each marble was moved, and by whom. Discrepancy Identification: Identify each instance where a character gives false information about the location of a marble. Behavior Pattern Analysis: Analyze the behavior patterns of the characters. For example, does a certain character consistently lie about the location of the marbles? Prediction: Predict the next series of events or actions based on the patterns established in the story. Causality Analysis: Determine the cause and effect relationships in the story. How does each action lead to the next? Lie Detection: Determine which character is most truthful and which is least based on their statements and actions. Narrative Generation: Generate a continuation of the story based on the established patterns of character behavior. Who would get most of the marble correctly? Who is the most greedy and try to keep things for themselves? Where is everyone at the end of the story? How many things did people see? What do people assume? What do people know for sure without a doubt? How many times did each person touch each marble? Who do they know is lying? Which question should they ask to find every marble in the least amount of questions? |
Background
Fixes issue #3871
Changes
Added a new community challenge.
Documentation
Added the documentation in the docs/challenges/memory/challenge_d.md
Test Plan
PR Quality Checklist