Use LLM to analyze ML-Bench failure cases (#2399)

* add ml-bench w/o exec env * fix typos (#1956) no functional change * Refactored Logs (#1939) * [Feat] A competitive Web Browsing agent (#1856) * initial attempt at a browsing only agent * add browsing agent * update * implement agent * update * fix comments * remove unnecessary things from memory extras * update image processing --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Update README.md SWE-bench score (#1959) * Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update * fix: llm is_local function logic error (#1961) Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * doc: update documentation about poetry update (#1962) * add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * feat: add metrics related to cost for better observability (#1944) * add metrics for total_cost * make lint * refact codeact * change metrics into llm * add costs list, add into state * refactor log completion * refactor and test others * make lint * Update opendevin/core/metrics.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/llm/llm.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor * add code --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * doc: add more cmd in unit test documentation (#1963) * --- (#1975) updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1976) updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Logging security (#1943) * update .gitignore * Rename the confusing 'INFO' style to 'DETAIL' * override str and repr * feat: api_key desensitize * feat: add SensitiveDataFilter in file handler * tweak regex, add tests * more tweaks, include other attrs * add env vars, those with equivalent config * fix tests * tests are invaluable --------- Co-authored-by: Shimada666 <649940882@qq.com> * --- (#1967) updated-dependencies: - dependency-name: react-dom dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: "@types/react-dom" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1968) updated-dependencies: - dependency-name: "@reduxjs/toolkit" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1969) updated-dependencies: - dependency-name: husky dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1970) updated-dependencies: - dependency-name: tailwind-merge dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1971) updated-dependencies: - dependency-name: i18next dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Refactor session management (#1810) * refactor session mgmt * defer file handling to runtime * add todo * refactor sessions a bit more * remove messages logic from FE * fix up socket handshake * refactor frontend auth a bit * first pass at redoing file explorer * implement directory suffix * fix up file tree * close agent on websocket close * remove session saving * move file refresh * remove getWorkspace * plumb path/code differently * fix build issues * fix the tests * fix npm build * add session rehydration * fix event serialization * logspam * fix user message rehydration * add get_event fn * agent state restoration * change history tracking for codeact * fix responsiveness of init * fix lint * lint * delint * fix prop * update tests * logspam * lint * fix test * revert codeact * change fileService to use API * fix up session loading * delint * delint * fix integration tests * revert test * fix up access to options endpoints * fix initial files load * delint * fix file initialization * fix mock server * fixl int * fix auth for html * Update frontend/src/i18n/translation.json Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor sessions and sockets * avoid reinitializing the same session * fix reconnect issue * change up intro message * more guards on reinit * rename agent_session * delint * fix a bunch of tests * delint * fix last test * remove code editor context * fix build * fix any * fix dot notation * Update frontend/src/services/api.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix up error handling * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update frontend/src/services/session.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix build errs * fix else * add closed state * delint * Update opendevin/server/session/session.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * fix #1960 (#1964) * Add ruff for shared mutable defaults (B) (#1938) * Add ruff for shared mutable defaults (B) * Apply B006, B008 on current files, except fast API * Update agenthub/SWE_agent/prompts.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix unintended behavior change * this is correct, tell Ruff to leave it alone --------- Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888) * Add MacOS to integration tests * Switch back to python 3.11 * Install Docker for macos pipeline * regenerate.sh: Use environmental variable for sandbox type * Pack different agents' tests into a single check * Fix CodeAct tests * Reduce file match and extensive debug logs * Add TEST_IN_CI mode that reports codecov * Small fix: don't quit if reusing old responses failed * Merge codecov results * Fix typos * Remove coverage merge step - codecov automatically does that * Make mac integration tests as optional - too slow * Fix codecov args * Add comments in yaml * Include sandbox type in codecov report name * Fix codecov report merge * Revert renaming of test_matrix_success * Remove SWEAgent and PlannerAgent from tests * Mark planner agent and SWE agent as deprecated * CodeCov: Ignore planner and sweagent * Revert "Remove SWEAgent and PlannerAgent from tests" This reverts commit 040cb3b. * Remove all tests for SWE Agent * Only keep basic tests for MonologueAgent and PlannerAgent * Mark SWE Agent as deprecated, and ignore code coverage for it --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Save CI cycles for backend tests (#1985) * Fix typo in prompt (#1992) * Refactor monologue and SWE agent to use the messages in state history (#1863) * Refactor monologue to use the messages in state history * add messages, clean up * fix monologue * update integration tests * move private method * update SWE agent to use the history from State * integration tests for SWE agent * rename monologue to initial_thoughts, since that is what it is * fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994) * add ml-bench in readme * Bump boto3 from 1.34.110 to 1.34.111 (#2001) Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](boto/boto3@1.34.110...1.34.111) --- updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker from 7.0.0 to 7.1.0 (#2002) Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](docker/docker-py@7.0.0...7.1.0) --- updated-dependencies: - dependency-name: docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump litellm from 1.37.20 to 1.38.0 (#2005) Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0. - [Release notes](https://github.com/BerriAI/litellm/releases) - [Commits](BerriAI/litellm@v1.37.20...v1.38.0) --- updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix SWE-Bench evaluation due to setuptools version (#1995) * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit 2bd1055. * bump version * fix session state after resuming (#1999) * fix state resuming * fix session reconnection * fix lint * Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941) * add draft for skills * Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file * Remove new_sample.txt file * add some work from opendevin w/ fixes * Add unit tests for agentskills module * fix some issues and updated tests * add more tests for open * tweak and handle goto_line * add tests for some edge cases * add tests for scrolling * add tests for edit * add tests for search_dir * update tests to use pytest * use pytest --forked to avoid file op unit tests to interfere with each other via global var * update doc based on swe agent tool * update and add tests for find_file and search_file * move agent_skills to plugins * add agentskills as plugin and docs * add agentskill to ssh box and fix sandbox integration * remove extra returns in doc * add agentskills to initial tool for jupyter * support re-init jupyter kernel (for agentskills) after restart * fix print window's issue with indentation and add testcases * add prompt for codeact with the newest edit primitives * modify the way line number is presented (remove leading space) * change prompt to the newest display format * support tracking of costs via metrics * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * implement and add tests for py linting * remove extra text arg for incompatible subprocess ver * remove sample.txt * update test_edits integration tests * fix all integration * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/runtime/plugins/agent_skills/agentskills.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * correctly setup plugins for swebench eval * bump swe-bench version and add logging * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit 2bd1055. * bump version * remove _AGENT_SKILLS_DOCS * move flake8 to test dep * update poetry.lock * remove extra arg * reduce max iter for eval * update poetry * fix integration tests --------- Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * build: Add poetry command to use Python 3.11 for environment setup (#1972) * Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006) Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1. - [Release notes](https://github.com/adobe/react-spectrum/releases) - [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1) --- updated-dependencies: - dependency-name: "@react-types/shared" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @types/react-syntax-highlighter in /frontend (#2007) Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter) --- updated-dependencies: - dependency-name: "@types/react-syntax-highlighter" dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008) Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009) Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4. - [Release notes](https://github.com/okonet/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md) - [Commits](lint-staged/lint-staged@v15.2.2...v15.2.4) --- updated-dependencies: - dependency-name: lint-staged dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update README.md * Update README.md * add run_infer.sh * fix input output * fix docker sandbox * fix run * update and clean run_infer.py * add script to clean up dockers * update repo uid * add description * new * Update README.md * use root for sandbox * update readme * update ml-bench conda env * update readme * update readme * use try except * modify raise exception * add int * update README * longer time * fix existing issues * fix existing issue * new docker image * add metrics of cost * add result parsing cost * fix * fix * update summarize * fix * add analyze * update readme * use 4o * add eval output --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal> Co-authored-by: RainRat <rainrat78@yahoo.ca> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Frank Xu <frankxu2004@gmail.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Shimada666 <649940882@qq.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com> Co-authored-by: jiangleo <jiangleo@users.noreply.github.com> Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: Jeremi Joslin <jeremi@newlogic.com> Co-authored-by: Aaron Xia <zhhuaxia@gmail.com> Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com> Co-authored-by: Robert <871607149@qq.com>
OpenDevin · Jun 13, 2024 · 563bc41 · 563bc41
1 parent 0bd5ab7
commit 563bc41
Show file tree

Hide file tree

Showing 3 changed files with 207 additions and 0 deletions.
diff --git a/evaluation/ml_bench/README.md b/evaluation/ml_bench/README.md
@@ -53,6 +53,26 @@ To run the evaluation on the ML-Bench dataset, use the following command:
 
 You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.
 
+## Score Evaluation Output
+
+To score the evaluation output, use the following command:
+
+```bash
+./evaluation/ml_bench/scripts/summarise_results.py [eval_output_dir]
+# e.g., ./evaluation/ml_bench/scripts/summarise_results.py evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5
+```
+
+## Run Error Analysis on ML-Bench
+
+To run error analysis on the ML-Bench dataset, use the following command:
+
+```bash
+./evaluation/ml_bench/scripts/run_analysis.sh [eval_output_dir] [model_config]
+# e.g., ./evaluation/ml_bench/scripts/run_analysis.sh evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5/output.jsonl eval_gpt4_1106_preview
+```
+
+This command generates a report on the evaluation output and provides insights into the agent's performance.
+
 ## Examples
 
 For each task in the ML-Bench dataset, OpenDevin provides the agent with a set number of iterations to complete the task. The `history` field in the evaluation output shows each iteration's response and actions taken by the agent to complete the task.

diff --git a/evaluation/ml_bench/run_analysis.py b/evaluation/ml_bench/run_analysis.py
@@ -0,0 +1,162 @@
+import json
+import os
+import pprint
+
+import tqdm
+
+from opendevin.core.config import config, get_llm_config_arg, get_parser
+from opendevin.core.logger import opendevin_logger as logger
+from opendevin.llm.llm import LLM
+
+
+def extract_test_results(res_file_path: str) -> tuple[list[str], list[str]]:
+    passed = []
+    failed = []
+    costs = []
+    instance_ids = set()
+    instances = []
+    with open(res_file_path, 'r') as file:
+        for line in file:
+            data = json.loads(line.strip())
+            success = data['metrics']['success']
+            if data['instance_id'] in instance_ids:
+                print(f'WARNING: Duplicate instance_id found: {data["instance_id"]}')
+                continue
+            instance_ids.add(data['instance_id'])
+            instances.append(data)
+            if success:
+                passed.append(
+                    {
+                        'instance_id': data['instance_id'],
+                        'repo': data['repo'],
+                        'instruction': data['instruction'],
+                        'eval_script': data['eval_script'],
+                        'eval_exit_code': data['eval_exit_code'],
+                        'eval_output': data['eval_output'],
+                        'accumulated_cost': data['metrics']['accumulated_cost'],
+                    }
+                )
+            else:
+                failed.append(
+                    {
+                        'instance_id': data['instance_id'],
+                        'repo': data['repo'],
+                        'instruction': data['instruction'],
+                        'metadata': data['metadata'],
+                        'history': data['history'],
+                        'eval_script': data['eval_script'],
+                        'eval_exit_code': data['eval_exit_code'],
+                        'eval_output': data['eval_output'],
+                        'accumulated_cost': data['metrics']['accumulated_cost'],
+                    }
+                )
+            costs.append(data['metrics']['accumulated_cost'])
+
+        # sort by instance_id
+        instances.sort(key=lambda x: x['instance_id'])
+        with open(res_file_path, 'w') as file:
+            for instance in instances:
+                file.write(json.dumps(instance) + '\n')
+        return passed, failed, costs
+
+
+def classify_error(llm: LLM, failed_case: dict) -> str:
+    prompt = f"""
+    Please classify the error for the following failed case based on the history and eval_output:
+
+    Instruction:
+    {failed_case['instruction']}
+
+    Eval Script:
+    {failed_case['eval_script']}s
+
+    History:
+    {failed_case['history']}
+
+    Eval Output:
+    {failed_case['eval_output']}
+
+    The error categories are:
+    E1: Hallucination Errors - The model misinterpreted the user's intention, misplaced Python code and bash script, or generated random or irrelevant code.
+    E2: Lack of Knowledge or Information - The model lacks sufficient information or domain-specific knowledge to satisfy the user's requirements.
+    E3: Knowledge Manipulation - The model failed to integrate or manipulate information properly.
+    E4: Syntax Errors - The model generated code with syntax errors.
+    E5: Operational Error - The model gave up easily or exited without finishing the tasks.
+
+    Please provide only the error category (E1, E2, E3, E4, or E5) without any explanation.
+    """
+
+    try:
+        response = llm.completion(messages=[{'content': prompt, 'role': 'user'}])
+        error_category = response.choices[0].message['content']
+    except Exception as e:
+        logger.error(
+            f"Failed to classify the error for the failed case: {failed_case['instance_id']}"
+        )
+        logger.error(e)
+        error_category = input(
+            failed_case['instruction']
+            + ': '
+            + failed_case['eval_script']
+            + ' - '
+            + failed_case['eval_output']
+        )
+
+    if error_category not in ['E1', 'E2', 'E3', 'E4', 'E5']:
+        raise ValueError(f'Invalid error category: {error_category}')
+
+    return error_category
+
+
+if __name__ == '__main__':
+    parser = get_parser()
+    parser.add_argument(
+        '--json_file_path',
+        type=str,
+        required=True,
+        help='Path to the jsonl file containing the evaluation results',
+    )
+    args, _ = parser.parse_known_args()
+
+    # Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
+    # for details of how to set `llm_config`
+    if args.llm_config:
+        specified_llm_config = get_llm_config_arg(args.llm_config)
+        if specified_llm_config:
+            config.llm = specified_llm_config
+    logger.info(f'Config for evaluation: {config}')
+    llm = LLM(llm_config=specified_llm_config)
+
+    passed, new_failed, costs = extract_test_results(args.json_file_path)
+
+    failed = []
+    if os.path.exists(args.json_file_path.replace('.jsonl', '_failed.jsonl')):
+        with open(args.json_file_path.replace('.jsonl', '_failed.jsonl'), 'r') as file:
+            for line in file:
+                failed.append(json.loads(line.strip()))
+        print(
+            f'Loaded {len(failed)} failed cases from {args.json_file_path.replace(".jsonl", "_failed.jsonl")}'
+        )
+
+    for failed_case in tqdm.tqdm(new_failed):
+        if failed_case['instance_id'] in [case['instance_id'] for case in failed]:
+            continue
+        error_category = classify_error(llm, failed_case)
+        failed_case['error_category'] = error_category
+        failed.append(failed_case)
+        with open(args.json_file_path.replace('.jsonl', '_failed.jsonl'), 'a') as file:
+            file.write(json.dumps(failed_case) + '\n')
+
+    # Print the summary
+    print('Summary:')
+    print(f'Passed: {len(passed)}')
+    print(f'Failed: {len(failed)}')
+    print(f'Costs: {costs}')
+    print('Failed cases:')
+    error_categories = {}
+    for case in failed:
+        error_category = case['error_category']
+        if error_category not in error_categories:
+            error_categories[error_category] = 0
+        error_categories[error_category] += 1
+    pprint.pprint(error_categories)
diff --git a/evaluation/ml_bench/scripts/run_analysis.sh b/evaluation/ml_bench/scripts/run_analysis.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+RESULT_FILE=$1
+MODEL_CONFIG=$2
+
+if [ -z "$RESULT_FILE" ]; then
+  echo "RESULT_FILE not specified"
+  exit 1
+fi
+
+
+if [ -z "$MODEL_CONFIG" ]; then
+  echo "Model config not specified, use default"
+  MODEL_CONFIG="eval_gpto"
+fi
+
+echo "MODEL_CONFIG: $MODEL_CONFIG"
+echo "RESULT_FILE: $RESULT_FILE"
+
+COMMAND="poetry run python evaluation/ml_bench/run_analysis.py \
+  --llm-config $MODEL_CONFIG \
+  --json_file_path $RESULT_FILE"
+
+# Run the command
+eval $COMMAND