【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration by GaoHuaZhang · Pull Request #241 · AISBench/benchmark

GaoHuaZhang · 2026-04-10T12:18:50Z

PR Type / PR类型

Related Issue | 关联 Issue
Relates to SWE-Bench benchmark adaptation requirements.

🔍 Motivation / 变更动机

为 AIS Bench 提供完整的 SWE-Bench 评测能力，覆盖数据加载、任务执行、结果汇总与 agent 推理链路，减少人工拼接流程并提升可复现性。
同时补齐相关配置、后处理与错误码文档，使用户可以更稳定地运行 SWE-Bench 场景。

📝 Modification / 修改内容

新增 SWE-Bench 数据集接入与注册：ais_bench/benchmark/datasets/swebench.py，并在 datasets/__init__.py 中导出。
新增 SWE-Bench 评测任务：ais_bench/benchmark/tasks/swebench/swebench_eval.py，并接入任务路由。
新增 Mini SWE Agent 推理任务：ais_bench/benchmark/tasks/swebench/swebench_infer.py 与 tasks/swebench/utils.py。
增强 CLI/配置与 worker 逻辑，支持 SWE-Bench 相关运行参数和流程。
增加 SWE-Bench summarizer：ais_bench/benchmark/summarizers/swebench.py，并完成模块注册。
更新 postprocessor 与 error code 处理逻辑，补充中英文 FAQ 错误码文档。
补充/更新 UT：tests/UT/cli/*、tests/UT/utils/test_model_postprocessors.py，并调整 tests/pytest.ini。

📐 Associated Test Results / 关联测试结果

已补充并更新相关单元测试（CLI 与 postprocessor 相关）。
CI/Test Report: 待 PR CI 运行后自动关联。

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

未引入明确的向后不兼容 API 变更；本次改动主要是新增 SWE-Bench 能力与兼容性增强。

⚠️ Performance degradation (Optional) / 性能下降（可选）

暂无已知性能退化风险；主要开销与新增 SWE-Bench 任务执行流程相关，属于预期行为。

🌟 Use cases (Optional) / 使用案例（可选）

使用 AIS Bench 统一入口运行 SWE-Bench 评测任务。
使用 Mini SWE Agent 生成 SWE-Bench 推理结果并进入评测汇总流程。
在问题定位时通过新增错误码文档快速排查常见失败场景。

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @benchmark-maintainers
Relevant Module Owners / 相关模块负责人: @benchmark-core
Other Collaboration Notes / 其他协作说明：建议重点关注 tasks/swebench/* 与 cli/workers.py 的流程衔接。

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

gemini-code-assist

Code Review

This pull request introduces support for the SWE-bench benchmark, including new dataset loaders, inference and evaluation tasks, and a specialized summarizer. It also refactors configuration management to handle recursive type conversion and provides default configurations for datasets. Additionally, text post-processing for reasoning content was optimized to handle lists more efficiently. Feedback includes correcting inaccurate type hints, fixing dictionary indentation in the evaluation worker, avoiding global random seed side effects, and improving regex compilation efficiency in post-processors.

gemini-code-assist · 2026-04-10T12:25:00Z

@@ -42,26 +42,30 @@ class Infer(BaseWorker):
    def update_cfg(self, cfg: ConfigDict) -> None:
        def get_task_type() -> str:


The return type hint str is no longer accurate as the function now returns class references (OpenICLApiInferTask or OpenICLInferTask) instead of strings. Please update the type hint to type or remove it to avoid confusion.

Suggested change

def get_task_type() -> str:

def get_task_type():

gemini-code-assist · 2026-04-10T12:25:00Z

+            new_cfg = dict(
+                eval=dict(
+                    partitioner=dict(type=NaivePartitioner),
+                    runner=dict(
+                        type=LocalRunner,
+                    task=dict(type=OpenICLEvalTask),
                ),
-            ),
-        )
+            ))


The indentation of the task key within the runner dictionary is incorrect, making it appear to be at the same level as runner itself. This reduces code readability and could lead to maintenance errors.

Suggested change

new_cfg = dict(

eval=dict(

partitioner=dict(type=NaivePartitioner),

runner=dict(

type=LocalRunner,

task=dict(type=OpenICLEvalTask),

),

),

)

))

new_cfg = dict(

eval=dict(

partitioner=dict(type=NaivePartitioner),

runner=dict(

type=LocalRunner,

task=dict(type=OpenICLEvalTask),

),

)

)

gemini-code-assist · 2026-04-10T12:25:00Z

+            random.seed(42)
+            random.shuffle(instances)


Using random.seed(42) resets the global random number generator state, which can lead to non-deterministic behavior in other parts of the application that rely on randomness. It is better to use a local random.Random instance for shuffling.

Suggested change

random.seed(42)

random.shuffle(instances)

rng = random.Random(42)

rng.shuffle(instances)

gemini-code-assist · 2026-04-10T12:25:00Z

+    def _process_single(item: str):
+        # Keep historical behavior for non-string input.
+        if not isinstance(item, str):
+            return item
+        if think_start_token not in item and think_end_token in item:
+            result = item.split(think_end_token)[-1].strip()
+            logger.debug(
+                f"extract_non_reasoning_content: only end token present -> length={len(result)}"
+            )
+            return result
+
+        reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
+                                     re.DOTALL)
+        non_reasoning_content = reasoning_regex.sub('', item).strip()
+        logger.debug(
+            f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
+        )
+        return non_reasoning_content
+
+    if isinstance(text, list):
+        logger.debug(
+            f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
+        )
+        return [_process_single(item) for item in text]

-    # Original behavior for complete tag pairs
-    reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
-                                 re.DOTALL)
-    non_reasoning_content = reasoning_regex.sub('', text).strip()
-    logger.debug(f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}")
-    return non_reasoning_content
+    logger.debug("extract_non_reasoning_content: processing single item")
+    return _process_single(text)


Compiling the reasoning_regex inside the _process_single function is inefficient when processing a list of texts, as the regex is re-compiled for every item. Move the compilation outside of _process_single to improve performance.

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', re.DOTALL) def _process_single(item: str): # Keep historical behavior for non-string input. if not isinstance(item, str): return item if think_start_token not in item and think_end_token in item: result = item.split(think_end_token)[-1].strip() logger.debug( f"extract_non_reasoning_content: only end token present -> length={len(result)}" ) return result non_reasoning_content = reasoning_regex.sub('', item).strip() logger.debug( f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}" ) return non_reasoning_content if isinstance(text, list): logger.debug( f"extract_non_reasoning_content: processing list of {len(text)} item(s)" ) return [_process_single(item) for item in text] logger.debug("extract_non_reasoning_content: processing single item") return _process_single(text)

gemini-code-assist · 2026-04-10T12:25:00Z

+    def _process_single(item: str):
+        # Keep historical behavior for non-string input.
+        if not isinstance(item, str):
+            return item
+        if think_start_token not in item and think_end_token in item:
+            result = item.split(think_end_token)[-1].strip()
+            logger.debug(
+                f"extract_non_reasoning_content: only end token present -> length={len(result)}"
+            )
+            return result
+
+        reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
+                                     re.DOTALL)
+        non_reasoning_content = reasoning_regex.sub('', item).strip()
        logger.debug(
-            f"extract_non_reasoning_content: only end token present -> length={len(result)}"
+            f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
        )
-        return result
+        return non_reasoning_content

-    # Original behavior for complete tag pairs
-    reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
-                                 re.DOTALL)
-    non_reasoning_content = reasoning_regex.sub('', text).strip()
-    logger.debug(
-        f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
-    )
-    return non_reasoning_content
+    if isinstance(text, list):
+        logger.debug(
+            f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
+        )
+        return [_process_single(item) for item in text]
+
+    logger.debug("extract_non_reasoning_content: processing single item")
+    return _process_single(text)


The reasoning_regex is being compiled repeatedly inside the _process_single function. For better efficiency, especially when handling lists of strings, compile the regex once outside the helper function.

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}', re.DOTALL) def _process_single(item: str): # Keep historical behavior for non-string input. if not isinstance(item, str): return item if think_start_token not in item and think_end_token in item: result = item.split(think_end_token)[-1].strip() logger.debug( f"extract_non_reasoning_content: only end token present -> length={len(result)}" ) return result non_reasoning_content = reasoning_regex.sub('', item).strip() logger.debug( f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}" ) return non_reasoning_content if isinstance(text, list): logger.debug( f"extract_non_reasoning_content: processing list of {len(text)} item(s)" ) return [_process_single(item) for item in text] logger.debug("extract_non_reasoning_content: processing single item") return _process_single(text)

zhanggaohua@huawei.com added 3 commits April 10, 2026 19:46

util adapter

249bbd1

adapter swebench eval

1c33d30

adapter mini swe agent

86547c5

GaoHuaZhang temporarily deployed to smoke-test-approval April 10, 2026 12:18 — with GitHub Actions Inactive

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

github-actions Bot added docs feature test-cases labels Apr 10, 2026

GaoHuaZhang changed the title ~~Swebench part3~~ 【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration Apr 10, 2026

SJTUyh approved these changes Apr 10, 2026

View reviewed changes

SJTUyh merged commit d675ff1 into AISBench:master Apr 10, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241

【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part3

GaoHuaZhang commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -42,26 +42,30 @@ class Infer(BaseWorker):
		def update_cfg(self, cfg: ConfigDict) -> None:
		def get_task_type() -> str:

Conversation

GaoHuaZhang commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GaoHuaZhang commented Apr 10, 2026 •

edited

Loading