Skip to content

【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241

Merged
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part3
Apr 10, 2026
Merged

【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration#241
SJTUyh merged 3 commits intoAISBench:masterfrom
GaoHuaZhang:swebench_part3

Conversation

@GaoHuaZhang
Copy link
Copy Markdown
Collaborator

@GaoHuaZhang GaoHuaZhang commented Apr 10, 2026

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Relates to SWE-Bench benchmark adaptation requirements.

🔍 Motivation / 变更动机

为 AIS Bench 提供完整的 SWE-Bench 评测能力,覆盖数据加载、任务执行、结果汇总与 agent 推理链路,减少人工拼接流程并提升可复现性。
同时补齐相关配置、后处理与错误码文档,使用户可以更稳定地运行 SWE-Bench 场景。

📝 Modification / 修改内容

  • 新增 SWE-Bench 数据集接入与注册:ais_bench/benchmark/datasets/swebench.py,并在 datasets/__init__.py 中导出。
  • 新增 SWE-Bench 评测任务:ais_bench/benchmark/tasks/swebench/swebench_eval.py,并接入任务路由。
  • 新增 Mini SWE Agent 推理任务:ais_bench/benchmark/tasks/swebench/swebench_infer.pytasks/swebench/utils.py
  • 增强 CLI/配置与 worker 逻辑,支持 SWE-Bench 相关运行参数和流程。
  • 增加 SWE-Bench summarizer:ais_bench/benchmark/summarizers/swebench.py,并完成模块注册。
  • 更新 postprocessor 与 error code 处理逻辑,补充中英文 FAQ 错误码文档。
  • 补充/更新 UT:tests/UT/cli/*tests/UT/utils/test_model_postprocessors.py,并调整 tests/pytest.ini

📐 Associated Test Results / 关联测试结果

  • 已补充并更新相关单元测试(CLI 与 postprocessor 相关)。
  • CI/Test Report: 待 PR CI 运行后自动关联。

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

未引入明确的向后不兼容 API 变更;本次改动主要是新增 SWE-Bench 能力与兼容性增强。

⚠️ Performance degradation (Optional) / 性能下降(可选)

暂无已知性能退化风险;主要开销与新增 SWE-Bench 任务执行流程相关,属于预期行为。

🌟 Use cases (Optional) / 使用案例(可选)

  • 使用 AIS Bench 统一入口运行 SWE-Bench 评测任务。
  • 使用 Mini SWE Agent 生成 SWE-Bench 推理结果并进入评测汇总流程。
  • 在问题定位时通过新增错误码文档快速排查常见失败场景。

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @benchmark-maintainers
  • Relevant Module Owners / 相关模块负责人: @benchmark-core
  • Other Collaboration Notes / 其他协作说明:建议重点关注 tasks/swebench/*cli/workers.py 的流程衔接。

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
/gemini summary Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
/gemini help Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

@GaoHuaZhang GaoHuaZhang temporarily deployed to smoke-test-approval April 10, 2026 12:18 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the SWE-bench benchmark, including new dataset loaders, inference and evaluation tasks, and a specialized summarizer. It also refactors configuration management to handle recursive type conversion and provides default configurations for datasets. Additionally, text post-processing for reasoning content was optimized to handle lists more efficiently. Feedback includes correcting inaccurate type hints, fixing dictionary indentation in the evaluation worker, avoiding global random seed side effects, and improving regex compilation efficiency in post-processors.

@@ -42,26 +42,30 @@ class Infer(BaseWorker):
def update_cfg(self, cfg: ConfigDict) -> None:
def get_task_type() -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint str is no longer accurate as the function now returns class references (OpenICLApiInferTask or OpenICLInferTask) instead of strings. Please update the type hint to type or remove it to avoid confusion.

Suggested change
def get_task_type() -> str:
def get_task_type():

Comment on lines +295 to +302
new_cfg = dict(
eval=dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLEvalTask),
),
),
)
))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indentation of the task key within the runner dictionary is incorrect, making it appear to be at the same level as runner itself. This reduces code readability and could lead to maintenance errors.

Suggested change
new_cfg = dict(
eval=dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLEvalTask),
),
),
)
))
new_cfg = dict(
eval=dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLEvalTask),
),
)
)

Comment on lines +60 to +61
random.seed(42)
random.shuffle(instances)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using random.seed(42) resets the global random number generator state, which can lead to non-deterministic behavior in other parts of the application that rely on randomness. It is better to use a local random.Random instance for shuffling.

Suggested change
random.seed(42)
random.shuffle(instances)
rng = random.Random(42)
rng.shuffle(instances)

Comment on lines +50 to +76
def _process_single(item: str):
# Keep historical behavior for non-string input.
if not isinstance(item, str):
return item
if think_start_token not in item and think_end_token in item:
result = item.split(think_end_token)[-1].strip()
logger.debug(
f"extract_non_reasoning_content: only end token present -> length={len(result)}"
)
return result

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
non_reasoning_content = reasoning_regex.sub('', item).strip()
logger.debug(
f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
)
return non_reasoning_content

if isinstance(text, list):
logger.debug(
f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
)
return [_process_single(item) for item in text]

# Original behavior for complete tag pairs
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
non_reasoning_content = reasoning_regex.sub('', text).strip()
logger.debug(f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}")
return non_reasoning_content
logger.debug("extract_non_reasoning_content: processing single item")
return _process_single(text) No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Compiling the reasoning_regex inside the _process_single function is inefficient when processing a list of texts, as the regex is re-compiled for every item. Move the compilation outside of _process_single to improve performance.

    reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
                                 re.DOTALL)

    def _process_single(item: str):
        # Keep historical behavior for non-string input.
        if not isinstance(item, str):
            return item
        if think_start_token not in item and think_end_token in item:
            result = item.split(think_end_token)[-1].strip()
            logger.debug(
                f"extract_non_reasoning_content: only end token present -> length={len(result)}"
            )
            return result

        non_reasoning_content = reasoning_regex.sub('', item).strip()
        logger.debug(
            f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
        )
        return non_reasoning_content

    if isinstance(text, list):
        logger.debug(
            f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
        )
        return [_process_single(item) for item in text]

    logger.debug("extract_non_reasoning_content: processing single item")
    return _process_single(text)

Comment on lines +47 to +73
def _process_single(item: str):
# Keep historical behavior for non-string input.
if not isinstance(item, str):
return item
if think_start_token not in item and think_end_token in item:
result = item.split(think_end_token)[-1].strip()
logger.debug(
f"extract_non_reasoning_content: only end token present -> length={len(result)}"
)
return result

reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
non_reasoning_content = reasoning_regex.sub('', item).strip()
logger.debug(
f"extract_non_reasoning_content: only end token present -> length={len(result)}"
f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
)
return result
return non_reasoning_content

# Original behavior for complete tag pairs
reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
re.DOTALL)
non_reasoning_content = reasoning_regex.sub('', text).strip()
logger.debug(
f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
)
return non_reasoning_content
if isinstance(text, list):
logger.debug(
f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
)
return [_process_single(item) for item in text]

logger.debug("extract_non_reasoning_content: processing single item")
return _process_single(text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reasoning_regex is being compiled repeatedly inside the _process_single function. For better efficiency, especially when handling lists of strings, compile the regex once outside the helper function.

    reasoning_regex = re.compile(rf'{think_start_token}(.*?){think_end_token}',
                                 re.DOTALL)

    def _process_single(item: str):
        # Keep historical behavior for non-string input.
        if not isinstance(item, str):
            return item
        if think_start_token not in item and think_end_token in item:
            result = item.split(think_end_token)[-1].strip()
            logger.debug(
                f"extract_non_reasoning_content: only end token present -> length={len(result)}"
            )
            return result

        non_reasoning_content = reasoning_regex.sub('', item).strip()
        logger.debug(
            f"extract_non_reasoning_content: removed reasoning sections -> length={len(non_reasoning_content)}"
        )
        return non_reasoning_content

    if isinstance(text, list):
        logger.debug(
            f"extract_non_reasoning_content: processing list of {len(text)} item(s)"
        )
        return [_process_single(item) for item in text]

    logger.debug("extract_non_reasoning_content: processing single item")
    return _process_single(text)

@GaoHuaZhang GaoHuaZhang changed the title Swebench part3 【Feature】Support SWE-Bench benchmark pipeline and Mini SWE Agent integration Apr 10, 2026
@SJTUyh SJTUyh merged commit d675ff1 into AISBench:master Apr 10, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants