Support Logic Reasoning Benchmark #1973

Ren-Ma · 2024-05-22T14:46:18Z

This PR provides a draft evaluation for two common logic reasoning benchmarks (ProntoQA, ProofWriter) which test the ability of deduction reasoning, i.e., given a set of facts and rules to judge the correctness of a query. Solving this task requires excellent abilities of parsing natural language into prover-specific symbolic language, and calling a external prover to solve the problem.

To ease the evaluation, symbolic language is provided together with the dataset. So the only task for the agent is to correctly call the prover (pyke, a python package).

The current draft is preliminary. The integration process has not been completed yet. I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

yufansong · 2024-05-22T17:49:11Z

I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

Do you mean add this into cmd?

Ren-Ma · 2024-05-23T15:13:33Z

Just pushed a quick-and-dirty version, still a bit buggy :( . The biggest obstacle is how can let the agent call a custom defined python function to help solve the task. I copy the python file to the workspace_mount_path and tell the agent to use the code in this file. Does this logic make sense? Thanks!

neubig · 2024-05-23T16:06:57Z

This is a good question, maybe @xingyaoww can give some feedback.

xingyaoww · 2024-05-23T16:40:22Z

@Ren-Ma Yes! I think temporarily that should work (if we are assuming number of process = 1) - Before the task starts, you clean up the workspace, put the relevant code into workspace, then ask the agent to look at /workspace and begin working!

xingyaoww · 2024-05-23T16:43:58Z

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

Ren-Ma · 2024-05-26T14:35:17Z

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

thank god it finally works !! I just tested on the first example of ProntoQA dataset, see the action track in the README.

evaluation/logic_reasoning/scripts/run_infer.sh

yufansong · 2024-05-26T18:26:57Z

evaluation/logic_reasoning/instruction.txt

@@ -0,0 +1,20 @@
+You are a helpful assistant assigned with logic reasoning task. You need to determine the correctness of a query given some facts and fules. 
+you can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. You first need to install a python package through ```pip install scitools-pyke```. The code should be enclosed using "<execute_ipython>" tag.


One tip: we have a sandbox parameter in main function, and you can execute the installation at here, then it may save some cost when you call gpt. At least they can reduce one action. But it is also fine to tell gpt in instructions.

thanks for the tip, any example for this? byw, currently i deprecated the sandbox in the main function.

Here's an example (kudos to @ryanhoangt)

https://github.com/ryanhoangt/OpenDevin/blob/dce15b51237a7e9bbd399158a1c0c6536937f7d3/evaluation/mint/run_infer.py#L118-L130

Thanks! already integrated! Sandbox is awesome, saving one action step for each instance!

yufansong

I run it locally via gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.

I see you have a function get_test_result to get the answer, but I worry it can not correctly deal with different output like

The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!

In this case, your get_test_result function cannot extract the correct result B.

But I find the gpt will run the program your provide and get the final choice:

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'

I think maybe we can check this observation result to better parse the final choice?
If I my understand was wrong, pls correct me.

The following is the log from gpt-4-1106-preview:


==============
STEP 1

02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')

from logic_inference import LogicInferenceEngine

# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """

xxxxxxx
.... some log
xxxxxxx

# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'


==============
STEP 2

02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.

02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING

Ren-Ma · 2024-05-27T09:09:25Z

I run it locally via gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.

I see you have a function get_test_result to get the answer, but I worry it can not correctly deal with different output like

The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!

In this case, your get_test_result function cannot extract the correct result B.

But I find the gpt will run the program your provide and get the final choice:

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'

I think maybe we can check this observation result to better parse the final choice? If I my understand was wrong, pls correct me.

The following is the log from gpt-4-1106-preview:


==============
STEP 1

02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')

from logic_inference import LogicInferenceEngine

# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """

xxxxxxx
.... some log
xxxxxxx

# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'


==============
STEP 2

02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.

02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING

pushed a little bit edits! Now the answer should be correctly parsef from the message in the state.history.

evaluation/logic_reasoning/run_infer.py

ryanhoangt · 2024-05-28T08:59:02Z

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA. Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

evaluation/logic_reasoning/.cache_program/facts.kfb

Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>

yufansong · 2024-05-28T16:51:22Z

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA.

I have not read the orignal paper. Could you tell me the difference between original ProntoQA and this impelementation?

Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

Do you have any other idea?

evaluation/logic_reasoning/logic_inference.py

ryanhoangt · 2024-05-29T02:52:59Z

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

Ren-Ma · 2024-05-29T03:53:53Z

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs.
The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer.
Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Co-authored-by: Shimada666 <649940882@qq.com>

ryanhoangt · 2024-05-29T04:12:09Z

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Yeah, it seems reasonable to me, thanks for the explanation.

yufansong · 2024-05-29T05:24:30Z

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

…Devin into eval_logic_reasoning

Ren-Ma · 2024-05-29T09:43:10Z

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

Done! Now we can quickly get the accuracy from the metadata.json. See README.md.

yufansong · 2024-05-29T15:03:35Z

LGTM. Hope some can also take a look before we merge it.

...puts/logic_reasoning/CodeActAgent/ProntoQA/gpt-4o-2024-05-13_maxiter_10_N_v1.5/metadata.json

Ren-Ma and others added 3 commits May 22, 2024 21:56

adding logic reasoning benchmark

c4d6686

adding logic reasoning benchmark

d31e674

Merge branch 'OpenDevin:main' into eval_logic_reasoning

6e9339c

xingyaoww mentioned this pull request May 23, 2024

Add: a mechanism for tracking contributions to the paper #1917

Closed

improve logic reasoning run_infer.py

619d062

Ren-Ma marked this pull request as ready for review May 23, 2024 15:07

neubig assigned xingyaoww May 23, 2024

neubig marked this pull request as draft May 24, 2024 09:57

neubig assigned Ren-Ma and unassigned xingyaoww May 24, 2024

xingyaoww added the evaluation label May 24, 2024

Ren-Ma and others added 3 commits May 26, 2024 20:03

Merge branch 'main' into eval_logic_reasoning

c255b75

evaluate on the first example

be18cf2

Merge branch 'main' into eval_logic_reasoning

5548f8d

Ren-Ma marked this pull request as ready for review May 26, 2024 14:31

yufansong reviewed May 26, 2024

View reviewed changes

evaluation/logic_reasoning/scripts/run_infer.sh Outdated Show resolved Hide resolved

Update evaluation/logic_reasoning/scripts/run_infer.sh

4ad2147

yufansong reviewed May 26, 2024

View reviewed changes

Ren-Ma and others added 2 commits May 27, 2024 14:49

Merge branch 'main' into eval_logic_reasoning

d2f1d87

parse state.history to get answer

3f17da1

update README.md

fe8b03f

ryanhoangt reviewed May 28, 2024

View reviewed changes

evaluation/logic_reasoning/run_infer.py Outdated Show resolved Hide resolved

ryanhoangt reviewed May 28, 2024

View reviewed changes

evaluation/logic_reasoning/.cache_program/facts.kfb Show resolved Hide resolved

Update evaluation/logic_reasoning/run_infer.py

ccae1d2

Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>

Shimada666 reviewed May 29, 2024

View reviewed changes

evaluation/logic_reasoning/logic_inference.py Outdated Show resolved Hide resolved

evaluation/logic_reasoning/logic_inference.py Outdated Show resolved Hide resolved

Shimada666 reviewed May 29, 2024

View reviewed changes

evaluation/logic_reasoning/logic_inference.py Outdated Show resolved Hide resolved

Update evaluation/logic_reasoning/logic_inference.py

17a1ec2

Co-authored-by: Shimada666 <649940882@qq.com>

Ren-Ma and others added 10 commits May 29, 2024 14:43

reformat code

ee94c24

reformat code

c353e86

Merge branch 'OpenDevin:main' into eval_logic_reasoning

fd8b5e0

fix conflicts

aeacbdd

get final accuracy

951e420

get final accuracy

7024342

Merge branch 'main' into eval_logic_reasoning

7d80f4f

add example output

98c21b5

Merge branch 'eval_logic_reasoning' of https://github.com/Ren-Ma/Open…

9917250

…Devin into eval_logic_reasoning

add example output

3b7e9d8

pre-install package within sandbox

79a2240

Ren-Ma requested a review from li-boxuan May 29, 2024 10:00

yufansong approved these changes May 29, 2024

View reviewed changes

yufansong merged commit a982349 into OpenDevin:main May 30, 2024
17 of 18 checks passed

li-boxuan reviewed May 31, 2024

View reviewed changes

...puts/logic_reasoning/CodeActAgent/ProntoQA/gpt-4o-2024-05-13_maxiter_10_N_v1.5/metadata.json Show resolved Hide resolved

li-boxuan mentioned this pull request May 31, 2024

Delete evaluation outputs files #2152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Logic Reasoning Benchmark #1973

Support Logic Reasoning Benchmark #1973

Ren-Ma commented May 22, 2024

yufansong commented May 22, 2024

Ren-Ma commented May 23, 2024

neubig commented May 23, 2024

xingyaoww commented May 23, 2024

xingyaoww commented May 23, 2024

Ren-Ma commented May 26, 2024

yufansong May 26, 2024 •

edited

Ren-Ma May 27, 2024

li-boxuan May 29, 2024 •

edited

Ren-Ma May 29, 2024

yufansong left a comment

Ren-Ma commented May 27, 2024

ryanhoangt commented May 28, 2024

yufansong commented May 28, 2024

ryanhoangt commented May 29, 2024 •

edited

Ren-Ma commented May 29, 2024

ryanhoangt commented May 29, 2024

yufansong commented May 29, 2024

Ren-Ma commented May 29, 2024

yufansong commented May 29, 2024

		@@ -0,0 +1,20 @@
		You are a helpful assistant assigned with logic reasoning task. You need to determine the correctness of a query given some facts and fules.
		you can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. You first need to install a python package through ```pip install scitools-pyke```. The code should be enclosed using "<execute_ipython>" tag.

Support Logic Reasoning Benchmark #1973

Support Logic Reasoning Benchmark #1973

Conversation

Ren-Ma commented May 22, 2024

yufansong commented May 22, 2024

Ren-Ma commented May 23, 2024

neubig commented May 23, 2024

xingyaoww commented May 23, 2024

xingyaoww commented May 23, 2024

Ren-Ma commented May 26, 2024

yufansong May 26, 2024 • edited

Choose a reason for hiding this comment

Ren-Ma May 27, 2024

Choose a reason for hiding this comment

li-boxuan May 29, 2024 • edited

Choose a reason for hiding this comment

Ren-Ma May 29, 2024

Choose a reason for hiding this comment

yufansong left a comment

Choose a reason for hiding this comment

Ren-Ma commented May 27, 2024

ryanhoangt commented May 28, 2024

yufansong commented May 28, 2024

ryanhoangt commented May 29, 2024 • edited

Ren-Ma commented May 29, 2024

ryanhoangt commented May 29, 2024

yufansong commented May 29, 2024

Ren-Ma commented May 29, 2024

yufansong commented May 29, 2024

yufansong May 26, 2024 •

edited

li-boxuan May 29, 2024 •

edited

ryanhoangt commented May 29, 2024 •

edited