Skip to content

【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240

Merged
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part2
Apr 10, 2026
Merged

【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part2

Conversation

@GaoHuaZhang
Copy link
Copy Markdown
Collaborator

@GaoHuaZhang GaoHuaZhang commented Apr 10, 2026

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Relates to #N/A

🔍 Motivation / 变更动机

当前 benchmark 流程中缺少对 SWE-bench 场景的完整 eval 接入,无法直接复用现有 CLI 工作流完成从预测结果到评测汇总的闭环。

本 PR 的目标是将 SWE-bench eval 能力对齐到现有任务框架,包括数据加载、任务执行和结果汇总,降低使用门槛并支持后续规模化评测。

📝 Modification / 修改内容

  • 新增 SWEBenchEvalTask,接入官方 SWE-bench harness 执行评测,支持:
    • predictions 目录读取预测结果;
    • 过滤无效/空 patch、跳过已完成实例;
    • 维护 case 级进度状态(finish_countresolved_countaccuracy);
    • 汇总并写出统一结果 JSON(含 accuracysubmitted_accuracyharness_exit_code 等)。
  • 新增 SWEBenchDataset,支持从 Hugging Face 在线加载或从本地 parquet(文件或分片目录)加载,并支持 filter_spec 与可复现 shuffle
  • 新增 SWEBenchSummarizer,从评测聚合结果中提取 accuracy 并接入统一可视化汇总流程。
  • 在注册入口中接入 SWE-bench 相关模块:
    • datasets/__init__.py 增加 swebench 数据集注册;
    • summarizers/__init__.py 增加 swebench summarizer 注册;
    • cli/workers.pyEval worker 默认任务类型切换为 SWEBenchEvalTask,以适配 SWE-bench eval 流程。
  • 增强错误处理和日志输出,补充本地路径解析失败、parquet 缺失、harness 依赖导入失败等场景的错误码与提示。

📐 Associated Test Results / 关联测试结果

  • 本地静态代码检查:未单独执行(待 CI 验证)。
  • 功能验证:基于代码路径和异常分支进行自检,建议在 CI 中补充一次端到端 SWE-bench eval 冒烟测试。

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

无显式向后不兼容变更。
该 PR 为新增 SWE-bench 能力,原有非 SWE-bench 数据集和流程不受影响。

⚠️ Performance degradation (Optional) / 性能下降(可选)

未引入框架层面的额外性能开销。
SWE-bench eval 本身依赖 harness 与容器执行,整体耗时主要受用例规模和 Docker 执行时长影响。

🌟 Use cases (Optional) / 使用案例(可选)

  • 使用 ais_bench 对 SWE-bench 预测结果执行标准化评测,并产出统一结果文件;
  • 在多模型/多数据集流程下复用现有 summarizer 进行准确率聚合展示;
  • 支持离线环境通过本地 parquet 数据进行 SWE-bench eval。

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:建议重点关注 harness 集成流程与结果汇总字段兼容性。

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the SWE-bench dataset and evaluation task, including a new summarizer and specific error codes. It refactors configuration management to recursively convert types to strings before dumping and updates the Infer and Eval workers to better handle existing configurations. Additionally, the extract_non_reasoning_content post-processor was refactored to handle list inputs natively. Feedback highlights a redundant assignment in the evaluation worker, potential KeyError exceptions when parsing predictions, and the need to optimize dataset filtering to avoid high memory usage. It is also recommended to escape regex tokens in post-processors to ensure robustness against special characters.

Comment thread ais_bench/benchmark/cli/workers.py Outdated
runner_cfg = new_cfg['eval']['runner']
runner_cfg['max_num_workers'] = self.args.max_num_workers
runner_cfg['max_workers_per_gpu'] = self.args.max_workers_per_gpu
runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a redundant double assignment to runner_cfg['debug'].

Suggested change
runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug
runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug

with open(pred_path) as f:
raw_preds = json.load(f)
if isinstance(raw_preds, dict):
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError if the prediction data is malformed. It is safer to verify the key's presence before access.

Suggested change
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict) and KEY_INSTANCE_ID in p}

if isinstance(raw_preds, dict):
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)}
else:
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError. It is safer to verify the key's presence before access.

Suggested change
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)}
predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict) and KEY_INSTANCE_ID in p}

SWEB_CODES.LOCAL_PARQUET_LOAD_FAILED,
f"Failed to load local swebench parquet from {root}: {e}",
)
dataset = self.filter_instances(list(dataset), filter_spec=filter_spec, shuffle=shuffle)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Converting the entire dataset to a list using list(dataset) can be very memory-intensive and slow for larger datasets. Consider using the dataset.filter() and dataset.shuffle() methods provided by the datasets library, which are optimized for performance and memory efficiency by operating on the underlying data lazily or in chunks.

Comment on lines +61 to +62
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters (e.g., (, ), [), the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.

Suggested change
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}',
re.DOTALL)

Comment on lines +58 to +59
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters, the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.

Suggested change
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}',
re.DOTALL)

@GaoHuaZhang GaoHuaZhang temporarily deployed to smoke-test-approval April 10, 2026 12:08 — with GitHub Actions Inactive
@GaoHuaZhang GaoHuaZhang changed the title Swebench part2 【Feature】add SWE-bench eval task, dataset loader, and summarizer integration Apr 10, 2026
SJTUyh
SJTUyh previously approved these changes Apr 10, 2026
@Keithwwa Keithwwa dismissed SJTUyh’s stale review April 10, 2026 13:17

The merge-base changed after approval.

@Keithwwa Keithwwa temporarily deployed to smoke-test-approval April 10, 2026 13:17 — with GitHub Actions Inactive
@SJTUyh SJTUyh merged commit ebf0d0d into AISBench:master Apr 10, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants