【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241
【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241SJTUyh merged 3 commits intoAISBench:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the SWE-bench benchmark, including new dataset loaders, inference and evaluation tasks, and a specialized summarizer. It also refactors configuration management to handle recursive type conversion and provides default configurations for datasets. Additionally, text post-processing for reasoning content was optimized to handle lists more efficiently. Feedback includes correcting inaccurate type hints, fixing dictionary indentation in the evaluation worker, avoiding global random seed side effects, and improving regex compilation efficiency in post-processors.
| @@ -42,26 +42,30 @@ class Infer(BaseWorker): | |||
| def update_cfg(self, cfg: ConfigDict) -> None: | |||
| def get_task_type() -> str: | |||
There was a problem hiding this comment.
| new_cfg = dict( | ||
| eval=dict( | ||
| partitioner=dict(type=NaivePartitioner), | ||
| runner=dict( | ||
| type=LocalRunner, | ||
| task=dict(type=OpenICLEvalTask), | ||
| ), | ||
| ), | ||
| ) | ||
| )) |
There was a problem hiding this comment.
The indentation of the task key within the runner dictionary is incorrect, making it appear to be at the same level as runner itself. This reduces code readability and could lead to maintenance errors.
| new_cfg = dict( | |
| eval=dict( | |
| partitioner=dict(type=NaivePartitioner), | |
| runner=dict( | |
| type=LocalRunner, | |
| task=dict(type=OpenICLEvalTask), | |
| ), | |
| ), | |
| ) | |
| )) | |
| new_cfg = dict( | |
| eval=dict( | |
| partitioner=dict(type=NaivePartitioner), | |
| runner=dict( | |
| type=LocalRunner, | |
| task=dict(type=OpenICLEvalTask), | |
| ), | |
| ) | |
| ) |
| random.seed(42) | ||
| random.shuffle(instances) |
There was a problem hiding this comment.
Using random.seed(42) resets the global random number generator state, which can lead to non-deterministic behavior in other parts of the application that rely on randomness. It is better to use a local random.Random instance for shuffling.
| random.seed(42) | |
| random.shuffle(instances) | |
| rng = random.Random(42) | |
| rng.shuffle(instances) |
| def _process_single(item: str): | ||
| # Keep historical behavior for non-string input. | ||
| if not isinstance(item, str): | ||
| return item | ||
| if think_start_token not in item and think_end_token in item: | ||
| result = item.split(think_end_token)[-1].strip() | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: only end token present -> length={len(result)}" | ||
| ) | ||
| return result | ||
|
|
||
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) | ||
| non_reasoning_content = reasoning_regex.sub('', item).strip() | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}" | ||
| ) | ||
| return non_reasoning_content | ||
|
|
||
| if isinstance(text, list): | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: processing list of {len(text)} item(s)" | ||
| ) | ||
| return [_process_single(item) for item in text] | ||
|
|
||
| # Original behavior for complete tag pairs | ||
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) | ||
| non_reasoning_content = reasoning_regex.sub('', text).strip() | ||
| logger.debug(f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}") | ||
| return non_reasoning_content | ||
| logger.debug("extract_non_reasoning_content: processing single item") | ||
| return _process_single(text) No newline at end of file |
There was a problem hiding this comment.
Compiling the reasoning_regex inside the _process_single function is inefficient when processing a list of texts, as the regex is re-compiled for every item. Move the compilation outside of _process_single to improve performance.
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
def _process_single(item: str):
# Keep historical behavior for non-string input.
if not isinstance(item, str):
return item
if think_start_token not in item and think_end_token in item:
result = item.split(think_end_token)[-1].strip()
logger.debug(
f"extract_non_reasoning_content: only end token present -> length={len(result)}"
)
return result
non_reasoning_content = reasoning_regex.sub('', item).strip()
logger.debug(
f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
)
return non_reasoning_content
if isinstance(text, list):
logger.debug(
f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
)
return [_process_single(item) for item in text]
logger.debug("extract_non_reasoning_content: processing single item")
return _process_single(text)| def _process_single(item: str): | ||
| # Keep historical behavior for non-string input. | ||
| if not isinstance(item, str): | ||
| return item | ||
| if think_start_token not in item and think_end_token in item: | ||
| result = item.split(think_end_token)[-1].strip() | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: only end token present -> length={len(result)}" | ||
| ) | ||
| return result | ||
|
|
||
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) | ||
| non_reasoning_content = reasoning_regex.sub('', item).strip() | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: only end token present -> length={len(result)}" | ||
| f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}" | ||
| ) | ||
| return result | ||
| return non_reasoning_content | ||
|
|
||
| # Original behavior for complete tag pairs | ||
| reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', | ||
| re.DOTALL) | ||
| non_reasoning_content = reasoning_regex.sub('', text).strip() | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}" | ||
| ) | ||
| return non_reasoning_content | ||
| if isinstance(text, list): | ||
| logger.debug( | ||
| f"extract_non_reasoning_content: processing list of {len(text)} item(s)" | ||
| ) | ||
| return [_process_single(item) for item in text] | ||
|
|
||
| logger.debug("extract_non_reasoning_content: processing single item") | ||
| return _process_single(text) |
There was a problem hiding this comment.
The reasoning_regex is being compiled repeatedly inside the _process_single function. For better efficiency, especially when handling lists of strings, compile the regex once outside the helper function.
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
def _process_single(item: str):
# Keep historical behavior for non-string input.
if not isinstance(item, str):
return item
if think_start_token not in item and think_end_token in item:
result = item.split(think_end_token)[-1].strip()
logger.debug(
f"extract_non_reasoning_content: only end token present -> length={len(result)}"
)
return result
non_reasoning_content = reasoning_regex.sub('', item).strip()
logger.debug(
f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
)
return non_reasoning_content
if isinstance(text, list):
logger.debug(
f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
)
return [_process_single(item) for item in text]
logger.debug("extract_non_reasoning_content: processing single item")
return _process_single(text)
PR Type / PR类型
Related Issue | 关联 Issue
Relates to SWE-Bench benchmark adaptation requirements.
🔍 Motivation / 变更动机
为 AIS Bench 提供完整的 SWE-Bench 评测能力,覆盖数据加载、任务执行、结果汇总与 agent 推理链路,减少人工拼接流程并提升可复现性。
同时补齐相关配置、后处理与错误码文档,使用户可以更稳定地运行 SWE-Bench 场景。
📝 Modification / 修改内容
ais_bench/benchmark/datasets/swebench.py,并在datasets/__init__.py中导出。ais_bench/benchmark/tasks/swebench/swebench_eval.py,并接入任务路由。ais_bench/benchmark/tasks/swebench/swebench_infer.py与tasks/swebench/utils.py。ais_bench/benchmark/summarizers/swebench.py,并完成模块注册。tests/UT/cli/*、tests/UT/utils/test_model_postprocessors.py,并调整tests/pytest.ini。📐 Associated Test Results / 关联测试结果
未引入明确的向后不兼容 API 变更;本次改动主要是新增 SWE-Bench 能力与兼容性增强。
暂无已知性能退化风险;主要开销与新增 SWE-Bench 任务执行流程相关,属于预期行为。
🌟 Use cases (Optional) / 使用案例(可选)
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
tasks/swebench/*与cli/workers.py的流程衔接。🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build