-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Logic Reasoning Benchmark #1973
Conversation
Do you mean add this into cmd? |
Just pushed a quick-and-dirty version, still a bit buggy :( . The biggest obstacle is how can let the agent call a custom defined python function to help solve the task. I copy the python file to the workspace_mount_path and tell the agent to use the code in this file. Does this logic make sense? Thanks! |
This is a good question, maybe @xingyaoww can give some feedback. |
@Ren-Ma Yes! I think temporarily that should work (if we are assuming number of process = 1) - Before the task starts, you clean up the |
Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined! |
thank god it finally works !! I just tested on the first example of ProntoQA dataset, see the action track in the README. |
@@ -0,0 +1,20 @@ | |||
You are a helpful assistant assigned with logic reasoning task. You need to determine the correctness of a query given some facts and fules. | |||
you can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. You first need to install a python package through ```pip install scitools-pyke```. The code should be enclosed using "<execute_ipython>" tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tip: we have a sandbox
parameter in main function, and you can execute the installation at here, then it may save some cost when you call gpt. At least they can reduce one action. But it is also fine to tell gpt in instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip, any example for this? byw, currently i deprecated the sandbox in the main function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! already integrated! Sandbox is awesome, saving one action step for each instance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run it locally via gpt-3.5-turbo
. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview
, it will not stuck, but the output format seems also not what you want.
I see you have a function get_test_result
to get the answer, but I worry it can not correctly deal with different output like
The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
In this case, your get_test_result
function cannot extract the correct result B
.
But I find the gpt will run the program your provide and get the final choice:
# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'
I think maybe we can check this observation result to better parse the final choice?
If I my understand was wrong, pls correct me.
The following is the log from gpt-4-1106-preview
:
==============
STEP 1
02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')
from logic_inference import LogicInferenceEngine
# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """
xxxxxxx
.... some log
xxxxxxx
# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)
# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'
==============
STEP 2
02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
pushed a little bit edits! Now the answer should be correctly parsef from the message in the state.history. |
I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original |
Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>
I have not read the orignal paper. Could you tell me the difference between
Do you have any other idea? |
From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.
I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself. |
You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. |
Co-authored-by: Shimada666 <649940882@qq.com>
Yeah, it seems reasonable to me, thanks for the explanation. |
@Ren-Ma btw, can you let the |
…Devin into eval_logic_reasoning
Done! Now we can quickly get the accuracy from the metadata.json. See README.md. |
LGTM. Hope some can also take a look before we merge it. |
This PR provides a draft evaluation for two common logic reasoning benchmarks (ProntoQA, ProofWriter) which test the ability of deduction reasoning, i.e., given a set of facts and rules to judge the correctness of a query. Solving this task requires excellent abilities of parsing natural language into prover-specific symbolic language, and calling a external prover to solve the problem.
To ease the evaluation, symbolic language is provided together with the dataset. So the only task for the agent is to correctly call the prover (pyke, a python package).
The current draft is preliminary. The integration process has not been completed yet. I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).