【Feature】add SWE-bench eval task, dataset loader, and summarizer integration by GaoHuaZhang · Pull Request #240 · AISBench/benchmark

GaoHuaZhang · 2026-04-10T12:03:09Z

PR Type / PR类型

Related Issue | 关联 Issue
Relates to #N/A

🔍 Motivation / 变更动机

当前 benchmark 流程中缺少对 SWE-bench 场景的完整 eval 接入，无法直接复用现有 CLI 工作流完成从预测结果到评测汇总的闭环。

本 PR 的目标是将 SWE-bench eval 能力对齐到现有任务框架，包括数据加载、任务执行和结果汇总，降低使用门槛并支持后续规模化评测。

📝 Modification / 修改内容

新增 SWEBenchEvalTask，接入官方 SWE-bench harness 执行评测，支持：
- 从 predictions 目录读取预测结果；
- 过滤无效/空 patch、跳过已完成实例；
- 维护 case 级进度状态（finish_count、resolved_count、accuracy）；
- 汇总并写出统一结果 JSON（含 accuracy、submitted_accuracy、harness_exit_code 等）。
新增 SWEBenchDataset，支持从 Hugging Face 在线加载或从本地 parquet（文件或分片目录）加载，并支持 filter_spec 与可复现 shuffle。
新增 SWEBenchSummarizer，从评测聚合结果中提取 accuracy 并接入统一可视化汇总流程。
在注册入口中接入 SWE-bench 相关模块：
- datasets/__init__.py 增加 swebench 数据集注册；
- summarizers/__init__.py 增加 swebench summarizer 注册；
- cli/workers.py 中 Eval worker 默认任务类型切换为 SWEBenchEvalTask，以适配 SWE-bench eval 流程。
增强错误处理和日志输出，补充本地路径解析失败、parquet 缺失、harness 依赖导入失败等场景的错误码与提示。

📐 Associated Test Results / 关联测试结果

本地静态代码检查：未单独执行（待 CI 验证）。
功能验证：基于代码路径和异常分支进行自检，建议在 CI 中补充一次端到端 SWE-bench eval 冒烟测试。

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

无显式向后不兼容变更。
该 PR 为新增 SWE-bench 能力，原有非 SWE-bench 数据集和流程不受影响。

⚠️ Performance degradation (Optional) / 性能下降（可选）

未引入框架层面的额外性能开销。
SWE-bench eval 本身依赖 harness 与容器执行，整体耗时主要受用例规模和 Docker 执行时长影响。

🌟 Use cases (Optional) / 使用案例（可选）

使用 ais_bench 对 SWE-bench 预测结果执行标准化评测，并产出统一结果文件；
在多模型/多数据集流程下复用现有 summarizer 进行准确率聚合展示；
支持离线环境通过本地 parquet 数据进行 SWE-bench eval。

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：建议重点关注 harness 集成流程与结果汇总字段兼容性。

gemini-code-assist

Code Review

This pull request introduces support for the SWE-bench dataset and evaluation task, including a new summarizer and specific error codes. It refactors configuration management to recursively convert types to strings before dumping and updates the Infer and Eval workers to better handle existing configurations. Additionally, the extract_non_reasoning_content post-processor was refactored to handle list inputs natively. Feedback highlights a redundant assignment in the evaluation worker, potential KeyError exceptions when parsing predictions, and the need to optimize dataset filtering to avoid high memory usage. It is also recommended to escape regex tokens in post-processors to ensure robustness against special characters.

gemini-code-assist · 2026-04-10T12:04:52Z

+            runner_cfg = new_cfg['eval']['runner']
+            runner_cfg['max_num_workers'] = self.args.max_num_workers
+            runner_cfg['max_workers_per_gpu'] = self.args.max_workers_per_gpu
+            runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug


There is a redundant double assignment to runner_cfg['debug'].

Suggested change

runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug

runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug

gemini-code-assist · 2026-04-10T12:04:52Z

+        with open(pred_path) as f:
+            raw_preds = json.load(f)
+        if isinstance(raw_preds, dict):
+            predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}


Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError if the prediction data is malformed. It is safer to verify the key's presence before access.

Suggested change

predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}

predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict) and KEY_INSTANCE_ID in p}

gemini-code-assist · 2026-04-10T12:04:52Z

+        if isinstance(raw_preds, dict):
+            predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}
+        else:
+            predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)}


Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError. It is safer to verify the key's presence before access.

Suggested change

predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)}

predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict) and KEY_INSTANCE_ID in p}

gemini-code-assist · 2026-04-10T12:04:52Z

+                    SWEB_CODES.LOCAL_PARQUET_LOAD_FAILED,
+                    f"Failed to load local swebench parquet from {root}: {e}",
+                )
+        dataset = self.filter_instances(list(dataset), filter_spec=filter_spec, shuffle=shuffle)


Converting the entire dataset to a list using list(dataset) can be very memory-intensive and slow for larger datasets. Consider using the dataset.filter() and dataset.shuffle() methods provided by the datasets library, which are optimized for performance and memory efficiency by operating on the underlying data lazily or in chunks.

gemini-code-assist · 2026-04-10T12:04:52Z

+        reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
+                                     re.DOTALL)


The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters (e.g., (, ), [), the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.

Suggested change

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',

re.DOTALL)

reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}',

re.DOTALL)

gemini-code-assist · 2026-04-10T12:04:52Z

+        reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
+                                     re.DOTALL)


The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters, the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.

Suggested change

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',

re.DOTALL)

reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}',

re.DOTALL)

The merge-base changed after approval.

util adapter

249bbd1

GaoHuaZhang had a problem deploying to smoke-test-approval April 10, 2026 12:03 — with GitHub Actions Error

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

adapter swebench eval

1c33d30

GaoHuaZhang force-pushed the swebench_part2 branch from f3f0413 to 1c33d30 Compare April 10, 2026 12:08

GaoHuaZhang temporarily deployed to smoke-test-approval April 10, 2026 12:08 — with GitHub Actions Inactive

github-actions Bot added the feature label Apr 10, 2026

GaoHuaZhang changed the title ~~Swebench part2~~ 【Feature】add SWE-bench eval task, dataset loader, and summarizer integration Apr 10, 2026

SJTUyh previously approved these changes Apr 10, 2026

View reviewed changes

Merge branch 'master' into swebench_part2

f9e2b82

Keithwwa temporarily deployed to smoke-test-approval April 10, 2026 13:17 — with GitHub Actions Inactive

SJTUyh approved these changes Apr 10, 2026

View reviewed changes

SJTUyh merged commit ebf0d0d into AISBench:master Apr 10, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240

【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part2

GaoHuaZhang commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug
	runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug

	predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}
	predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict) and KEY_INSTANCE_ID in p}

		reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
		re.DOTALL)

Conversation

GaoHuaZhang commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GaoHuaZhang commented Apr 10, 2026 •

edited

Loading