Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MINT benchmark (MATH, GSM8K subset) #1955

Merged
merged 23 commits into from
May 28, 2024

Conversation

ryanhoangt
Copy link
Contributor

This PR provides a draft evaluation integration for the MINT benchmark which tests the agent's ability to solve tasks with multi-turn interactions. This benchmark tests the agent's ability of code generation, decision-making, and reasoning. I'm working on the MATH and GSM8K subsets.

The original repo is at here.

The current draft is preliminary, and the integration process is not done yet.

@ryanhoangt ryanhoangt marked this pull request as draft May 21, 2024 18:08
@ryanhoangt
Copy link
Contributor Author

The evaluation for MATH subset now can be run using bash ./evaluation/mint/run_infer.sh.

The result looks like below:

{
  "id": 0,
  "instance": {
    "task_name": "reasoning",
    "task_id": 0,
    "prompt": "What is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]",
    "reference": "10.0",
    "metadata": {}
  },
  "instruction": "You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using \"<thought>\" tag, for example: <thought> I need to print \"Hello World!\" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using \"<solution>\" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let's start by determining Li Bai's birth year. </thought>\n<execute> print(wikipedia_search('Li Bai')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: \u674e\u767d; pinyin: L\u01d0 B\u00e1i, 701\u2013762), also pronounced as Li Bo, courtesy name Taibai (Chinese: \u592a\u767d), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet's use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols('r theta')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let's round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n",
  "metadata": {
    "agent_class": "CodeActAgent",
    "model_name": "gpt-4-1106-preview",
    "max_iterations": 5,
    "max_propose_solution": 2,
    "eval_output_dir": "evaluation/evaluation_outputs/outputs/mint/CodeActAgent/gpt-4-1106-preview_maxiter_5",
    "start_time": "2024-05-24 22:45:09",
    "git_commit": "6aaae4ce1797bee7f1e76aa399e390ffa1442050"
  },
  "history": [...],
  "error": "Agent reached maximum number of iterations",
  "test_result": false
}

Pending improvements:

  1. Integrate other subsets similarly.
  2. Prompt tuning to maximize performance.
  3. Robust error handling.

@xingyaoww can you help me review it?

@ryanhoangt ryanhoangt marked this pull request as ready for review May 24, 2024 16:06
Copy link
Contributor

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! I'll try to run it tomorrow and verify it, hopefully we can merge it soon

evaluation/mint/task.py Outdated Show resolved Hide resolved
evaluation/mint/run_infer.sh Outdated Show resolved Hide resolved
opendevin/core/main.py Outdated Show resolved Hide resolved
evaluation/mint/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run it locally and print the gpt response and fake user reply. The agent response contain <execute> .... </execute> . But the user reponse always send back I don't understand your input. Is this expected?
The following is log:

2024-05-25 01:44:56,537 - INFO - MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:56,537 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:56,537 - INFO - Gold reference: 10.0
2024-05-25 01:44:56,537 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:56,538 - INFO - gpt msg:<thought>
To find the area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$, we need to consider the floor function.
The floor function $\lfloor x \rfloor$ gives the largest integer less than or equal to $x$.
So, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.
</thought>
<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:56,538 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 4 steps left and 1 chances to propose solution left.

2024-05-25 01:44:56,538 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:56,639 - INFO - STEP 1
2024-05-25 01:44:59,822 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.00 USD
2024-05-25 01:44:59,822 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:44:59,823 - INFO - Gold reference: 10.0
2024-05-25 01:44:59,823 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:44:59,823 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:44:59,823 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 3 steps left and 1 chances to propose solution left.

2024-05-25 01:44:59,823 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:44:59,925 - INFO - STEP 2
2024-05-25 01:45:02,313 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:02,313 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:02,313 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:02,313 - INFO - Gold reference: 10.0
2024-05-25 01:45:02,313 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:02,314 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:02,314 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 2 steps left and 1 chances to propose solution left.

2024-05-25 01:45:02,314 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:02,415 - INFO - STEP 3
2024-05-25 01:45:04,731 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:04,731 - INFO - MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message')
2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:04,732 - INFO - Gold reference: 10.0
2024-05-25 01:45:04,732 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:04,732 - INFO - gpt msg:<execute>
# Finding all pairs of integers x and y such that their product is equal to 16
pairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]
pairs
</execute>
2024-05-25 01:45:04,732 - INFO - User response:Observation:
I don't understand your input. 
If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.
If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.
For example: The answer to the question is <solution> 42 </solution>. 

You have 1 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.

2024-05-25 01:45:04,732 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:04,834 - INFO - STEP 4
2024-05-25 01:45:07,095 - INFO - Cost: 0.00 USD | Accumulated Cost: 0.01 USD
2024-05-25 01:45:07,095 - INFO - MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message')
2024-05-25 01:45:07,095 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
2024-05-25 01:45:07,095 - INFO - Gold reference: 10.0
2024-05-25 01:45:07,095 - INFO - Task config: {'max_iterations': 5, 'max_propose_solution': 2}
2024-05-25 01:45:07,096 - INFO - gpt msg:<solution> The area of the region in the $xy-$plane that satisfies $\lfloor x \rfloor \lfloor y \rfloor = 16$ is 9 square units. </solution>
2024-05-25 01:45:07,096 - INFO - User response:Observation:
Your answer is wrong.
You have 0 steps left and 1 chances to propose solution left.
You should take the last step to propose a solution.

2024-05-25 01:45:07,096 - INFO - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
2024-05-25 01:45:07,197 - INFO - STEP 5
2024-05-25 01:45:07,197 - INFO - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.ERROR
2024-05-25 01:45:08,098 - INFO - Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.STOPPED
2024-05-25 01:45:09,047 - INFO - AgentController task was cancelled
2024-05-25 01:45:09,049 - INFO - Msgs: [(MessageAction(content='You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using "<thought>" tag, for example: <thought> I need to print "Hello World!" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let\'s start by determining Li Bai\'s birth year. </thought>\n<execute> print(wikipedia_search(\'Li Bai\')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: 李白; pinyin: Lǐ Bái, 701–762), also pronounced as Li Bo, courtesy name Taibai (Chinese: 太白), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet\'s use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols(\'r theta\')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let\'s round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<thought>\nTo find the area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$, we need to consider the floor function.\nThe floor function $\\lfloor x \\rfloor$ gives the largest integer less than or equal to $x$.\nSo, we need to find all pairs of integers $x$ and $y$ such that their product is equal to $16$.\n</thought>\n<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 4 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 3 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 2 steps left and 1 chances to propose solution left.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<execute>\n# Finding all pairs of integers x and y such that their product is equal to 16\npairs = [(x, 16//x) for x in range(1, 17) if 16 % x == 0]\npairs\n</execute>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content="Observation:\nI don't understand your input. \nIf you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\nIf you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\nFor example: The answer to the question is <solution> 42 </solution>. \n\nYou have 1 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n", wait_for_response=False, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='<solution> The area of the region in the $xy-$plane that satisfies $\\lfloor x \\rfloor \\lfloor y \\rfloor = 16$ is 9 square units. </solution>', wait_for_response=True, action='message'), NullObservation(content='', observation='null')), (MessageAction(content='Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n', wait_for_response=False, action='message'), NullObservation(content='', observation='null'))]
2024-05-25 01:45:09,049 - INFO - Task state: {'finished': False, 'success': False, 'agent_action_count': {'propose_solution': 1, 'use_tool': 0, 'invalid_action': 0}, 'terminate_reason': None, 'latest_output': {'observation': None, 'success': False, 'content': 'Observation:\nYour answer is wrong.\nYou have 0 steps left and 1 chances to propose solution left.\nYou should take the last step to propose a solution.\n'}}
2024-05-25 01:45:09,049 - INFO - Final message:  | Ground truth: 10.0
2024-05-25 01:45:09,327 - INFO - BrowserEnv already closed, no need to close again

opendevin/core/main.py Outdated Show resolved Hide resolved
evaluation/mint/run_infer.sh Outdated Show resolved Hide resolved
evaluation/mint/task.py Outdated Show resolved Hide resolved
evaluation/mint/run_infer.py Outdated Show resolved Hide resolved
Comment on lines +47 to +48
last_action, _ = state.history[-1]
result_state: TaskState = env.step(last_action.message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we need to check

isinstance(act, MessageAction) and act.source == 'agent'

class SimplifiedEnv:
INVALID_INPUT_MESSAGE = (
"I don't understand your input. \n"
'If you want to execute code, please use <execute> YOUR_CODE_HERE </execute>.\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can replace these "<execute>" tag with "<execute_ipython>" tag used by CodeActAgent.

Then you can add an assertion for Agent: The agent needs to be CodeActAgent to run MINT. We can add supports for other agents when needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. I've changed it! Also for CodeActAgent, I hardcoded it for the AGENT option inside run_infer.sh.

@ryanhoangt
Copy link
Contributor Author

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

@yufansong
Copy link
Collaborator

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

Ok, this time it works in my local. It output some solutions.

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor stuffs that need to be fixed before merging

evaluation/mint/run_infer.sh Outdated Show resolved Hide resolved
evaluation/mint/README.md Show resolved Hide resolved
#!/bin/bash

SUBSET=$1
EVAL_LIMIT=$2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add default value for SUBSET and EVAL_LIMIT, you can refer to the run_infer.sh for swe bench

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed it, can you have a look to see if it is what you're expecting?

@yufansong yufansong enabled auto-merge (squash) May 28, 2024 07:37
@yufansong yufansong merged commit 9434bcc into OpenDevin:main May 28, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants