Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Logic Reasoning Benchmark #1973

Merged
merged 24 commits into from
May 30, 2024
Merged

Conversation

Ren-Ma
Copy link
Contributor

@Ren-Ma Ren-Ma commented May 22, 2024

This PR provides a draft evaluation for two common logic reasoning benchmarks (ProntoQA, ProofWriter) which test the ability of deduction reasoning, i.e., given a set of facts and rules to judge the correctness of a query. Solving this task requires excellent abilities of parsing natural language into prover-specific symbolic language, and calling a external prover to solve the problem.

To ease the evaluation, symbolic language is provided together with the dataset. So the only task for the agent is to correctly call the prover (pyke, a python package).

The current draft is preliminary. The integration process has not been completed yet. I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

@yufansong
Copy link
Collaborator

I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

Do you mean add this into cmd?

@Ren-Ma Ren-Ma marked this pull request as ready for review May 23, 2024 15:07
@Ren-Ma
Copy link
Contributor Author

Ren-Ma commented May 23, 2024

Just pushed a quick-and-dirty version, still a bit buggy :( . The biggest obstacle is how can let the agent call a custom defined python function to help solve the task. I copy the python file to the workspace_mount_path and tell the agent to use the code in this file. Does this logic make sense? Thanks!

@neubig
Copy link
Contributor

neubig commented May 23, 2024

This is a good question, maybe @xingyaoww can give some feedback.

@xingyaoww
Copy link
Contributor

@Ren-Ma Yes! I think temporarily that should work (if we are assuming number of process = 1) - Before the task starts, you clean up the workspace, put the relevant code into workspace, then ask the agent to look at /workspace and begin working!

@xingyaoww
Copy link
Contributor

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

@neubig neubig marked this pull request as draft May 24, 2024 09:57
@neubig neubig assigned Ren-Ma and unassigned xingyaoww May 24, 2024
@Ren-Ma Ren-Ma marked this pull request as ready for review May 26, 2024 14:31
@Ren-Ma
Copy link
Contributor Author

Ren-Ma commented May 26, 2024

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

thank god it finally works !! I just tested on the first example of ProntoQA dataset, see the action track in the README.

@@ -0,0 +1,20 @@
You are a helpful assistant assigned with logic reasoning task. You need to determine the correctness of a query given some facts and fules.
you can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. You first need to install a python package through ```pip install scitools-pyke```. The code should be enclosed using "<execute_ipython>" tag.
Copy link
Collaborator

@yufansong yufansong May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tip: we have a sandbox parameter in main function, and you can execute the installation at here, then it may save some cost when you call gpt. At least they can reduce one action. But it is also fine to tell gpt in instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip, any example for this? byw, currently i deprecated the sandbox in the main function.

Copy link
Collaborator

@li-boxuan li-boxuan May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! already integrated! Sandbox is awesome, saving one action step for each instance!

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run it locally via gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.

I see you have a function get_test_result to get the answer, but I worry it can not correctly deal with different output like

The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!

In this case, your get_test_result function cannot extract the correct result B.

But I find the gpt will run the program your provide and get the final choice:

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'

I think maybe we can check this observation result to better parse the final choice?
If I my understand was wrong, pls correct me.

The following is the log from gpt-4-1106-preview:


==============
STEP 1

02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')

from logic_inference import LogicInferenceEngine

# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """

xxxxxxx
.... some log
xxxxxxx

# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'


==============
STEP 2

02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.

02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING

@Ren-Ma
Copy link
Contributor Author

Ren-Ma commented May 27, 2024

I run it locally via gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.

I see you have a function get_test_result to get the answer, but I worry it can not correctly deal with different output like

The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!

In this case, your get_test_result function cannot extract the correct result B.

But I find the gpt will run the program your provide and get the final choice:

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'

I think maybe we can check this observation result to better parse the final choice? If I my understand was wrong, pls correct me.

The following is the log from gpt-4-1106-preview:


==============
STEP 1

02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')

from logic_inference import LogicInferenceEngine

# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """

xxxxxxx
.... some log
xxxxxxx

# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'


==============
STEP 2

02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.

02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING

pushed a little bit edits! Now the answer should be correctly parsef from the message in the state.history.

@ryanhoangt
Copy link
Contributor

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA. Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>
@yufansong
Copy link
Collaborator

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA.

I have not read the orignal paper. Could you tell me the difference between original ProntoQA and this impelementation?

Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

Do you have any other idea?

@ryanhoangt
Copy link
Contributor

ryanhoangt commented May 29, 2024

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

@Ren-Ma
Copy link
Contributor Author

Ren-Ma commented May 29, 2024

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs.
The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer.
Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Co-authored-by: Shimada666 <649940882@qq.com>
@ryanhoangt
Copy link
Contributor

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Yeah, it seems reasonable to me, thanks for the explanation.

@yufansong
Copy link
Collaborator

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

@Ren-Ma
Copy link
Contributor Author

Ren-Ma commented May 29, 2024

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

Done! Now we can quickly get the accuracy from the metadata.json. See README.md.

@Ren-Ma Ren-Ma requested a review from li-boxuan May 29, 2024 10:00
@yufansong
Copy link
Collaborator

LGTM. Hope some can also take a look before we merge it.

@yufansong yufansong merged commit a982349 into OpenDevin:main May 30, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants