【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240
【Feature】add SWE-bench eval task, dataset loader, and summarizer integration#240SJTUyh merged 3 commits intoAISBench:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the SWE-bench dataset and evaluation task, including a new summarizer and specific error codes. It refactors configuration management to recursively convert types to strings before dumping and updates the Infer and Eval workers to better handle existing configurations. Additionally, the extract_non_reasoning_content post-processor was refactored to handle list inputs natively. Feedback highlights a redundant assignment in the evaluation worker, potential KeyError exceptions when parsing predictions, and the need to optimize dataset filtering to avoid high memory usage. It is also recommended to escape regex tokens in post-processors to ensure robustness against special characters.
| runner_cfg = new_cfg['eval']['runner'] | ||
| runner_cfg['max_num_workers'] = self.args.max_num_workers | ||
| runner_cfg['max_workers_per_gpu'] = self.args.max_workers_per_gpu | ||
| runner_cfg['debug'] = runner_cfg['debug'] = self.args.debug or cfg.cli_args.debug |
| with open(pred_path) as f: | ||
| raw_preds = json.load(f) | ||
| if isinstance(raw_preds, dict): | ||
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)} |
There was a problem hiding this comment.
Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError if the prediction data is malformed. It is safer to verify the key's presence before access.
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)} | |
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict) and KEY_INSTANCE_ID in p} |
| if isinstance(raw_preds, dict): | ||
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds.values() if isinstance(p, dict)} | ||
| else: | ||
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)} |
There was a problem hiding this comment.
Accessing p[KEY_INSTANCE_ID] directly without checking if the key exists could raise a KeyError. It is safer to verify the key's presence before access.
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict)} | |
| predictions = {p[KEY_INSTANCE_ID]: p for p in raw_preds if isinstance(p, dict) and KEY_INSTANCE_ID in p} |
| SWEB_CODES.LOCAL_PARQUET_LOAD_FAILED, | ||
| f"Failed to load local swebench parquet from {root}: {e}", | ||
| ) | ||
| dataset = self.filter_instances(list(dataset), filter_spec=filter_spec, shuffle=shuffle) |
There was a problem hiding this comment.
Converting the entire dataset to a list using list(dataset) can be very memory-intensive and slow for larger datasets. Consider using the dataset.filter() and dataset.shuffle() methods provided by the datasets library, which are optimized for performance and memory efficiency by operating on the underlying data lazily or in chunks.
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) |
There was a problem hiding this comment.
The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters (e.g., (, ), [), the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | |
| re.DOTALL) | |
| reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}', | |
| re.DOTALL) |
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) |
There was a problem hiding this comment.
The think_start_token and think_end_token are used directly in a regular expression. If these tokens contain special regex characters, the regex will fail or behave unexpectedly. Use re.escape() to ensure they are treated as literal strings.
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | |
| re.DOTALL) | |
| reasoning_regex = re.compile(rf'{re.escape(think_start_token)}(.*?){re.escape(think_end_token)}', | |
| re.DOTALL) |
f3f0413 to
1c33d30
Compare
The merge-base changed after approval.
PR Type / PR类型
Related Issue | 关联 Issue
Relates to #N/A
🔍 Motivation / 变更动机
当前 benchmark 流程中缺少对 SWE-bench 场景的完整 eval 接入,无法直接复用现有 CLI 工作流完成从预测结果到评测汇总的闭环。
本 PR 的目标是将 SWE-bench eval 能力对齐到现有任务框架,包括数据加载、任务执行和结果汇总,降低使用门槛并支持后续规模化评测。
📝 Modification / 修改内容
SWEBenchEvalTask,接入官方 SWE-bench harness 执行评测,支持:predictions目录读取预测结果;finish_count、resolved_count、accuracy);accuracy、submitted_accuracy、harness_exit_code等)。SWEBenchDataset,支持从 Hugging Face 在线加载或从本地 parquet(文件或分片目录)加载,并支持filter_spec与可复现shuffle。SWEBenchSummarizer,从评测聚合结果中提取accuracy并接入统一可视化汇总流程。datasets/__init__.py增加 swebench 数据集注册;summarizers/__init__.py增加 swebench summarizer 注册;cli/workers.py中Evalworker 默认任务类型切换为SWEBenchEvalTask,以适配 SWE-bench eval 流程。📐 Associated Test Results / 关联测试结果
无显式向后不兼容变更。
该 PR 为新增 SWE-bench 能力,原有非 SWE-bench 数据集和流程不受影响。
未引入框架层面的额外性能开销。
SWE-bench eval 本身依赖 harness 与容器执行,整体耗时主要受用例规模和 Docker 执行时长影响。
🌟 Use cases (Optional) / 使用案例(可选)
ais_bench对 SWE-bench 预测结果执行标准化评测,并产出统一结果文件;✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息