-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MINT benchmark (MATH, GSM8K subset) #1955
Conversation
The evaluation for MATH subset now can be run using The result looks like below:
Pending improvements:
@xingyaoww can you help me review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress! I'll try to run it tomorrow and verify it, hopefully we can merge it soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run it locally and print the gpt response and fake user reply. The agent response contain <execute> .... </execute>
. But the user reponse always send back I don't understand your input.
Is this expected?
The following is log:
2024-05-25 01:44:56,537 - INFO - MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:56,537 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:56,537 - INFO - Gold reference: 10.0
2024-05-25 01:44:56,537 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:56,538 - INFO - gpt msg:<thought>
To find the area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$, we need to consider the floor function.
The floor function $\lfloor x \rfloor$ gives the largest integer less than or equal to $x$.
So, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.
</thought>
<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:56,538 - INFO - User response:Observation:
I don't understand your input.
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>.
You have 4 steps left and 1 chances to propose solution left.
2024-05-25 01:44:56,538 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:56,639 - INFO - STEP 1
2024-05-25 01:44:59,822 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.00 USD
2024-05-25 01:44:59,822 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:59,823 - INFO - Gold reference: 10.0
2024-05-25 01:44:59,823 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:59,823 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:59,823 - INFO - User response:Observation:
I don't understand your input.
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>.
You have 3 steps left and 1 chances to propose solution left.
2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:59,925 - INFO - STEP 2
2024-05-25 01:45:02,313 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:02,313 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:02,313 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:02,313 - INFO - Gold reference: 10.0
2024-05-25 01:45:02,313 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:02,314 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:02,314 - INFO - User response:Observation:
I don't understand your input.
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>.
You have 2 steps left and 1 chances to propose solution left.
2024-05-25 01:45:02,314 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:02,415 - INFO - STEP 3
2024-05-25 01:45:04,731 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:04,731 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:04,732 - INFO - Gold reference: 10.0
2024-05-25 01:45:04,732 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:04,732 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:04,732 - INFO - User response:Observation:
I don't understand your input.
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>.
You have 1 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.
2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:04,834 - INFO - STEP 4
2024-05-25 01:45:07,095 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:07,095 - INFO - MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message')
2024-05-25 01:45:07,095 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:07,095 - INFO - Gold reference: 10.0
2024-05-25 01:45:07,095 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:07,096 - INFO - gpt msg:<solution> The area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$ is 9 square units. </solution>
2024-05-25 01:45:07,096 - INFO - User response:Observation:
Your answer is wrong.
You have 0 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.
2024-05-25 01:45:07,096 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:07,197 - INFO - STEP 5
2024-05-25 01:45:07,197 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.ERROR
2024-05-25 01:45:08,098 - INFO - Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.STOPPED
2024-05-25 01:45:09,047 - INFO - AgentController task was cancelled
2024-05-25 01:45:09,049 - INFO - Msgs: [(MessageAction(content='You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using "<thought>" tag, for example: <thought> I need to print "Hello World!" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let\'s start by determining Li Bai\'s birth year. </thought>\n<execute> print(wikipedia_search(\'Li Bai\')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: 李白; pinyin: Lǐ Bái, 701–762), also pronounced as Li Bo, courtesy name Taibai (Chinese: 太白), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet\'s use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols(\'r theta\')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let\'s round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 4 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 3 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 2 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 1 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null'))]
2024-05-25 01:45:09,049 - INFO - Task state: {'finished': False, 'success': False, 'agent_action_count': {'propose_solution': 1, 'use_tool': 0, 'invalid_action': 0}, 'terminate_reason': None, 'latest_output': {'observation': None, 'success': False, 'content': 'Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n'}}
2024-05-25 01:45:09,049 - INFO - Final message: | Ground truth: 10.0
2024-05-25 01:45:09,327 - INFO - BrowserEnv already closed, no need to close again
last_action, _ = state.history[-1] | ||
result_state: TaskState = env.step(last_action.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we need to check
isinstance(act, MessageAction) and act.source == 'agent'
evaluation/mint/env.py
Outdated
class SimplifiedEnv: | ||
INVALID_INPUT_MESSAGE = ( | ||
"I don't understand your input. \n" | ||
'If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can replace these "<execute>
" tag with "<execute_ipython>
" tag used by CodeActAgent.
Then you can add an assertion for Agent
: The agent needs to be CodeActAgent
to run MINT. We can add supports for other agents when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. I've changed it! Also for CodeActAgent
, I hardcoded it for the AGENT
option inside run_infer.sh
.
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong |
Ok, this time it works in my local. It output some solutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor stuffs that need to be fixed before merging
evaluation/mint/run_infer.sh
Outdated
#!/bin/bash | ||
|
||
SUBSET=$1 | ||
EVAL_LIMIT=$2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add default value for SUBSET and EVAL_LIMIT, you can refer to the run_infer.sh
for swe bench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed it, can you have a look to see if it is what you're expecting?
e12f8f8
to
dce15b5
Compare
This PR provides a draft evaluation integration for the MINT benchmark which tests the agent's ability to solve tasks with multi-turn interactions. This benchmark tests the agent's ability of code generation, decision-making, and reasoning. I'm working on the
MATH
andGSM8K
subsets.The original repo is at here.
The current draft is preliminary, and the integration process is not done yet.