Support MINT benchmark (MATH, GSM8K subset) #1955

ryanhoangt · 2024-05-21T18:07:52Z

This PR provides a draft evaluation integration for the MINT benchmark which tests the agent's ability to solve tasks with multi-turn interactions. This benchmark tests the agent's ability of code generation, decision-making, and reasoning. I'm working on the MATH and GSM8K subsets.

The original repo is at here.

The current draft is preliminary, and the integration process is not done yet.

evaluation/mint/run_infer.py

ryanhoangt · 2024-05-24T16:02:45Z

The evaluation for MATH subset now can be run using bash ./evaluation/mint/run_infer.sh.

The result looks like below:

{
  "id": 0,
  "instance": {
    "task_name": "reasoning",
    "task_id": 0,
    "prompt": "What is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]",
    "reference": "10.0",
    "metadata": {}
  },
  "instruction": "You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using \"<thought>\" tag, for example: <thought> I need to print \"Hello World!\" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using \"<solution>\" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let's start by determining Li Bai's birth year. </thought>\n<execute> print(wikipedia_search('Li Bai')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: \u674e\u767d; pinyin: L\u01d0 B\u00e1i, 701\u2013762), also pronounced as Li Bo, courtesy name Taibai (Chinese: \u592a\u767d), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet's use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols('r theta')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let's round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n",
  "metadata": {
    "agent_class": "CodeActAgent",
    "model_name": "gpt-4-1106-preview",
    "max_iterations": 5,
    "max_propose_solution": 2,
    "eval_output_dir": "evaluation/evaluation_outputs/outputs/mint/CodeActAgent/gpt-4-1106-preview_maxiter_5",
    "start_time": "2024-05-24 22:45:09",
    "git_commit": "6aaae4ce1797bee7f1e76aa399e390ffa1442050"
  },
  "history": [...],
  "error": "Agent reached maximum number of iterations",
  "test_result": false
}

Pending improvements:

Integrate other subsets similarly.
Prompt tuning to maximize performance.
Robust error handling.

@xingyaoww can you help me review it?

xingyaoww

Great progress! I'll try to run it tomorrow and verify it, hopefully we can merge it soon

evaluation/mint/task.py

evaluation/mint/run_infer.sh

opendevin/core/main.py

evaluation/mint/README.md

yufansong

I run it locally and print the gpt response and fake user reply. The agent response contain <execute> .... </execute> . But the user reponse always send back I don't understand your input. Is this expected?
The following is log:

2024-05-25 01:44:56,537 - INFO - MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:56,537 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:56,537 - INFO - Gold reference: 10.0
2024-05-25 01:44:56,537 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:56,538 - INFO - gpt msg:<thought>
To find the area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$, we need to consider the floor function.
The floor function $\lfloor x \rfloor$ gives the largest integer less than or equal to $x$.
So, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.
</thought>
<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:56,538 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 4 steps left and 1 chances to propose solution left.

2024-05-25 01:44:56,538 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:56,639 - INFO - STEP 1
2024-05-25 01:44:59,822 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.00 USD
2024-05-25 01:44:59,822 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:59,823 - INFO - Gold reference: 10.0
2024-05-25 01:44:59,823 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:59,823 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:59,823 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 3 steps left and 1 chances to propose solution left.

2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:59,925 - INFO - STEP 2
2024-05-25 01:45:02,313 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:02,313 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:02,313 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:02,313 - INFO - Gold reference: 10.0
2024-05-25 01:45:02,313 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:02,314 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:02,314 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 2 steps left and 1 chances to propose solution left.

2024-05-25 01:45:02,314 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:02,415 - INFO - STEP 3
2024-05-25 01:45:04,731 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:04,731 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:04,732 - INFO - Gold reference: 10.0
2024-05-25 01:45:04,732 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:04,732 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:04,732 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 1 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.

2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:04,834 - INFO - STEP 4
2024-05-25 01:45:07,095 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:07,095 - INFO - MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message')
2024-05-25 01:45:07,095 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:07,095 - INFO - Gold reference: 10.0
2024-05-25 01:45:07,095 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:07,096 - INFO - gpt msg:<solution> The area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$ is 9 square units. </solution>
2024-05-25 01:45:07,096 - INFO - User response:Observation:
Your answer is wrong.
You have 0 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.

2024-05-25 01:45:07,096 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:07,197 - INFO - STEP 5
2024-05-25 01:45:07,197 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.ERROR
2024-05-25 01:45:08,098 - INFO - Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.STOPPED
2024-05-25 01:45:09,047 - INFO - AgentController task was cancelled
2024-05-25 01:45:09,049 - INFO - Msgs: [(MessageAction(content='You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using "<thought>" tag, for example: <thought> I need to print "Hello World!" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let\'s start by determining Li Bai\'s birth year. </thought>\n<execute> print(wikipedia_search(\'Li Bai\')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: 李白; pinyin: Lǐ Bái, 701–762), also pronounced as Li Bo, courtesy name Taibai (Chinese: 太白), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet\'s use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols(\'r theta\')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let\'s round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 4 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 3 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 2 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 1 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null'))]
2024-05-25 01:45:09,049 - INFO - Task state: {'finished': False, 'success': False, 'agent_action_count': {'propose_solution': 1, 'use_tool': 0, 'invalid_action': 0}, 'terminate_reason': None, 'latest_output': {'observation': None, 'success': False, 'content': 'Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n'}}
2024-05-25 01:45:09,049 - INFO - Final message:  | Ground truth: 10.0
2024-05-25 01:45:09,327 - INFO - BrowserEnv already closed, no need to close again

opendevin/core/main.py

evaluation/mint/run_infer.sh

evaluation/mint/task.py

evaluation/mint/run_infer.py

yufansong · 2024-05-24T17:34:48Z

evaluation/mint/run_infer.py

+    last_action, _ = state.history[-1]
+    result_state: TaskState = env.step(last_action.message)


Not sure whether we need to check

isinstance(act, MessageAction) and act.source == 'agent'

xingyaoww · 2024-05-25T12:15:19Z

evaluation/mint/env.py

+class SimplifiedEnv:
+    INVALID_INPUT_MESSAGE = (
+        "I don't understand your input. \n"
+        'If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\n'


I think you can replace these "<execute>" tag with "<execute_ipython>" tag used by CodeActAgent.

Then you can add an assertion for Agent: The agent needs to be CodeActAgent to run MINT. We can add supports for other agents when needed.

Nice idea. I've changed it! Also for CodeActAgent, I hardcoded it for the AGENT option inside run_infer.sh.

evaluation/mint/in_context_examples/reasoning/with_tool.txt

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

… eval-MINT-math

ryanhoangt · 2024-05-25T13:07:51Z

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

yufansong · 2024-05-25T14:36:38Z

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

Ok, this time it works in my local. It output some solutions.

…djust prompt

evaluation/mint/requirements.txt

yufansong

LGTM

evaluation/mint/run_infer.py

xingyaoww

Some minor stuffs that need to be fixed before merging

evaluation/mint/run_infer.sh

evaluation/mint/README.md

xingyaoww · 2024-05-27T05:54:24Z

evaluation/mint/run_infer.sh

+#!/bin/bash
+
+SUBSET=$1
+EVAL_LIMIT=$2


Can you add default value for SUBSET and EVAL_LIMIT, you can refer to the run_infer.sh for swe bench

I've fixed it, can you have a look to see if it is what you're expecting?

setup boilerplate and README

14a84e9

ryanhoangt marked this pull request as draft May 21, 2024 18:08

setup test script and load dataset

a76c36e

xingyaoww mentioned this pull request May 22, 2024

Add: a mechanism for tracking contributions to the paper #1917

Closed

xingyaoww added the evaluation label May 22, 2024

ryanhoangt added 3 commits May 22, 2024 17:09

add temp intg that works

34ab232

refactor code

9a904c7

add solution evaluation through 'fake_user_response_fn'

6aaae4c

ryanhoangt commented May 23, 2024

View reviewed changes

evaluation/mint/run_infer.py Show resolved Hide resolved

neubig assigned ryanhoangt May 23, 2024

finish integrating MATH subset

891321f

ryanhoangt marked this pull request as ready for review May 24, 2024 16:06

xingyaoww reviewed May 24, 2024

View reviewed changes

evaluation/mint/task.py Outdated Show resolved Hide resolved

evaluation/mint/run_infer.sh Outdated Show resolved Hide resolved

opendevin/core/main.py Outdated Show resolved Hide resolved

evaluation/mint/README.md Outdated Show resolved Hide resolved

yufansong reviewed May 24, 2024

View reviewed changes

yufansong added 3 commits May 25, 2024 01:52

Update evaluation/mint/run_infer.py

4a9ddf6

Update evaluation/mint/run_infer.sh

b11785d

Update opendevin/core/main.py

cc125dd

xingyaoww reviewed May 25, 2024

View reviewed changes

ryanhoangt and others added 7 commits May 25, 2024 19:18

remove redudant templates, add eval_note, update README

b7376f3

use <execute_ipython> tag instead of <execute>

b5942bc

hardcode AGENT option for run_infer.sh

7c08ebd

Update evaluation/mint/task.py

7d1fa1b

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

fix: bug no message returned when task's success

7a9f05f

change message to make the agent exit

b0920f3

Merge branch 'eval-MINT-math' of github.com:ryanhoangt/OpenDevin into…

fe9cdcf

… eval-MINT-math

import bash abstractmethod

9e12ccf

install all required packages inside sandbox before the agent runs, a…

10c7e04

…djust prompt

ryanhoangt added 2 commits May 26, 2024 11:59

add subset eval folder separation and test for gsm8k

afe1c70

fix bug in Reasoning task result check, add requirements.txt

cc80cf6

li-boxuan reviewed May 26, 2024

View reviewed changes

evaluation/mint/requirements.txt Show resolved Hide resolved

yufansong approved these changes May 26, 2024

View reviewed changes

li-boxuan reviewed May 26, 2024

View reviewed changes

evaluation/mint/run_infer.py Outdated Show resolved Hide resolved

li-boxuan added 2 commits May 26, 2024 09:57

Fix syntax error in evaluation/mint/run_infer.py

7616ede

Merge branch 'main' into eval-MINT-math

e1bdbcc

xingyaoww reviewed May 27, 2024

View reviewed changes

update README, add default values for SUBSET and EVAL_LIMIT

dce15b5

ryanhoangt force-pushed the eval-MINT-math branch from e12f8f8 to dce15b5 Compare May 27, 2024 10:20

yufansong enabled auto-merge (squash) May 28, 2024 07:37

yufansong merged commit 9434bcc into OpenDevin:main May 28, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MINT benchmark (MATH, GSM8K subset) #1955

Support MINT benchmark (MATH, GSM8K subset) #1955

ryanhoangt commented May 21, 2024

ryanhoangt commented May 24, 2024

xingyaoww left a comment

yufansong left a comment

yufansong May 24, 2024

xingyaoww May 25, 2024

ryanhoangt May 25, 2024

ryanhoangt commented May 25, 2024

yufansong commented May 25, 2024

yufansong left a comment

xingyaoww left a comment

xingyaoww May 27, 2024

ryanhoangt May 27, 2024

		last_action, _ = state.history[-1]
		result_state: TaskState = env.step(last_action.message)

Support MINT benchmark (MATH, GSM8K subset) #1955

Support MINT benchmark (MATH, GSM8K subset) #1955

Conversation

ryanhoangt commented May 21, 2024

ryanhoangt commented May 24, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

yufansong left a comment

Choose a reason for hiding this comment

yufansong May 24, 2024

Choose a reason for hiding this comment

xingyaoww May 25, 2024

Choose a reason for hiding this comment

ryanhoangt May 25, 2024

Choose a reason for hiding this comment

ryanhoangt commented May 25, 2024

yufansong commented May 25, 2024

yufansong left a comment

Choose a reason for hiding this comment

xingyaoww left a comment

Choose a reason for hiding this comment

xingyaoww May 27, 2024

Choose a reason for hiding this comment

ryanhoangt May 27, 2024

Choose a reason for hiding this comment