【Feature】add SWE-bench example configs and bilingual user guide#191
【Feature】add SWE-bench example configs and bilingual user guide#191SJTUyh merged 4 commits intoAISBench:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking framework by integrating full support for SWEBench datasets. It introduces new components for loading, inferring, and evaluating models against SWEBench, alongside improvements to the configuration system for greater flexibility and automation. The changes streamline the process of setting up and running complex benchmarks, particularly for code generation and repair tasks, by abstracting away common configuration patterns and providing robust task execution mechanisms. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the SWE-Bench benchmark, including new dataset loaders, inference tasks, and evaluation tasks. The changes are extensive and introduce new dependencies like mini-swe-agent. I've found a few critical issues that will prevent the new workflow from running correctly, related to file handling and incorrect assumptions about library functions. There are also some correctness and performance issues that should be addressed. Please see my detailed comments below.
| self.model_cfg, | ||
| dataset_cfg, | ||
| osp.join(self.work_dir, self.output_subdir), | ||
| file_extension="json", |
There was a problem hiding this comment.
swebench输出结果为json格式
| preds_path = out_dir / "preds.json" | ||
| if preds_path.exists(): | ||
| shutil.move(preds_path, out_path) |
There was a problem hiding this comment.
The process_instance function from mini-swe-agent only creates per-instance trajectory files and does not generate the final aggregated prediction file. You need to add a step to collect the predictions from these trajectory files into a single preds.jsonl file after all instances have been processed. The mini-swe-agent library provides utilities for this. Also, the filename should be preds.jsonl, not preds.json.
| preds_path = out_dir / "preds.json" | |
| if preds_path.exists(): | |
| shutil.move(preds_path, out_path) | |
| from minisweagent.run.benchmarks.utils.run_utils import get_predictions_from_trajectories | |
| self.logger.info(f"Collecting predictions from trajectories in {out_dir}...") | |
| get_predictions_from_trajectories(str(out_dir)) | |
| preds_path = out_dir / "preds.jsonl" | |
| if preds_path.exists(): | |
| shutil.move(str(preds_path), out_path) |
| f"Invalid swebench dataset name, expected one of {list(DATASET_MAPPING.keys())} but got {name}", | ||
| ) | ||
| try: | ||
| dataset = load_dataset("parquet", data_files={split: path}) |
There was a problem hiding this comment.
The function load_dataset("parquet", data_files=...) returns a DatasetDict object, not a Dataset. The subsequent code on line 68, list(dataset), will then operate on the keys of this dictionary (e.g., ['test']) instead of the dataset records, which will cause a TypeError inside filter_instances. You need to select the appropriate split from the DatasetDict before processing it.
| dataset = load_dataset("parquet", data_files={split: path}) | |
| dataset = load_dataset("parquet", data_files={split: path})[split] |
| for dataset_cfg in config["datasets"]: | ||
| if "infer_cfg" not in dataset_cfg: | ||
| logger.debug(f"Filling in infer config for dataset {dataset_cfg['abbr']}") | ||
| dataset_cfg["infer_cfg"] = dict( | ||
| prompt_template=dict(type=get_config_type(PromptTemplate), template="{dummy}"), | ||
| retriever=dict(type=get_config_type(ZeroRetriever)), | ||
| inferencer=dict(type=get_config_type(GenInferencer)), | ||
| ) | ||
| if "reader_cfg" not in dataset_cfg: | ||
| logger.debug(f"Filling in reader config for dataset {dataset_cfg['abbr']}") | ||
| dataset_cfg["reader_cfg"] = dict(input_columns=["dummy"], output_column="dummy") | ||
| if "eval_cfg" not in dataset_cfg: | ||
| logger.debug(f"Filling in eval config for dataset {dataset_cfg['abbr']}") | ||
| dataset_cfg["eval_cfg"] = dict( | ||
| evaluator=dict(type=get_config_type(AccEvaluator)), | ||
| ) |
There was a problem hiding this comment.
| for i, item in enumerate(cfg): | ||
| cfg[i] = recur_convert_config_type(item) if isinstance(item, (dict, ConfigDict, Config, list)) else item |
There was a problem hiding this comment.
The recursive call for list items can be simplified. Instead of checking the type of the item before the recursive call, you can just call recur_convert_config_type on every item. The function already handles non-container types by returning them as is.
| for i, item in enumerate(cfg): | |
| cfg[i] = recur_convert_config_type(item) if isinstance(item, (dict, ConfigDict, Config, list)) else item | |
| for i, item in enumerate(cfg): | |
| cfg[i] = recur_convert_config_type(item) |
| DSET_CODES.DATA_PREPROCESSING_ERROR, | ||
| f"Failed to load swebench dataset {name} from Hugging Face with error: {e}.", | ||
| ) | ||
| dataset = self.filter_instances(list(dataset), filter_spec=filter_spec, shuffle=shuffle) |
There was a problem hiding this comment.
Converting the entire dataset to a list using list(dataset) can be very memory-intensive, especially for large datasets, and may lead to out-of-memory errors. It is more efficient to use the .filter() method provided by the datasets library, which processes the data in a streaming fashion without loading everything into memory at once. Consider refactoring filter_instances to work directly with Dataset objects.
028a8f5 to
2588e7c
Compare
|
|
||
| class CustomConfigChecker: | ||
| MODEL_REQUIRED_FIELDS = ['type', 'abbr', 'attr'] | ||
| DATASET_REQUIRED_FIELDS = ['type', 'abbr', 'reader_cfg', 'infer_cfg', 'eval_cfg'] |
There was a problem hiding this comment.
【review】'type'也是非必须的,限制也可以解除,不必校验
There was a problem hiding this comment.
不影响当前执行逻辑,后续有需要可以去掉
a37257c to
5e2a11e
Compare
| class CustomConfigChecker: | ||
| MODEL_REQUIRED_FIELDS = ['type', 'abbr', 'attr'] | ||
| DATASET_REQUIRED_FIELDS = ['type', 'abbr', 'reader_cfg', 'infer_cfg', 'eval_cfg'] | ||
| DATASET_REQUIRED_FIELDS = ['type', 'abbr'] |
There was a problem hiding this comment.
【review】'type'也是非必须的,限制也可以解除,不必校验
| error_msg += f" {error_message_suffix}" | ||
| raise AISBenchConfigError(UTILS_CODES.INVALID_INTEGER_TYPE, error_msg) | ||
| raise AISBenchConfigError( | ||
| UTILS_CODES.INVALID_INTEGER_TYPE, error_msg) |
There was a problem hiding this comment.
[review] 这样的格式修改没有必要,此行未超长,直接raise AISBenchConfigError(UTILS_CODES.INVALID_INTEGER_TYPE, error_msg)即可
| if error_message_suffix: | ||
| error_msg += f" {error_message_suffix}" | ||
| raise AISBenchConfigError(UTILS_CODES.ARGUMENT_TOO_SMALL, error_msg) | ||
| raise AISBenchConfigError( |
There was a problem hiding this comment.
[review] 这样的格式修改没有必要,此行未超长,下同
| if "infer_cfg" not in dataset_cfg: | ||
| logger.debug(f"Filling in infer config for dataset {dataset_cfg['abbr']}") | ||
| dataset_cfg["infer_cfg"] = dict( | ||
| prompt_template=dict(type=get_config_type(PromptTemplate), template="{dummy}"), |
There was a problem hiding this comment.
【review】没有infer_cfg,reader_cfg,eval_cfg的自定义配置场景也往往用不到,也没必要给个默认的取值
| f"list_decorator({func.__name__}): processing single item" | ||
| ) | ||
| return func(text_or_list, *args, **kwargs) | ||
| return wrapper |
There was a problem hiding this comment.
装饰器的方式更优雅,不必修改,只需要加functools.wrap装饰器保留元数据即可
要保留被装饰函数的元数据(函数名、文档字符串、参数签名等),核心是给装饰器的包装函数 wrapper 绑定原函数的元数据,Python 标准库 functools.wraps 就是专门做这件事的。
修改后的完整代码
from functools import wraps # 导入核心工具
import logging
# 配置日志(可选,用于测试)
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def list_decorator(func):
"""Decorator: make the function able to handle list input"""
@wraps(func) # 关键:保留原函数元数据
def wrapper(text_or_list, *args, **kwargs):
if isinstance(text_or_list, list):
logger.debug(
f"list_decorator({func.__name__}): processing list of {len(text_or_list)} item(s)"
)
return [func(text, *args, **kwargs) for text in text_or_list]
logger.debug(
f"list_decorator({func.__name__}): processing single item"
)
return func(text_or_list, *args, **kwargs)
return wrapper # 修复你原代码的笔误:wrappe → wrapper核心修改点
- 导入
functools.wraps
这是 Python 内置的标准工具,专门用于装饰器中保留原函数元数据。 - 给
wrapper加装饰器@wraps(func)
这一行会把被装饰函数func的所有元数据(__name__、__doc__、__module__、参数签名等)复制到wrapper上。
| type="LiteLLMChat", | ||
| model="", | ||
| api_key="EMPTY", | ||
| url="http://127.0.0.1:8000/v1", #API base, e.g. http://127.0.0.1:8000/v1 |
There was a problem hiding this comment.
【review】注释符'#'后面要加空格,同时url的含义和常规api model中url的含义不同,这里注释需要更详细说明一下 http://127.0.0.1:8000/v1访问的是http://127.0.0.1:8000/v1/chat/completions
| dict( | ||
| attr="local", | ||
| abbr="swebench", | ||
| type="LiteLLMChat", |
There was a problem hiding this comment.
【review】如果有type的话,这个地方是否能够遵循传统使用导入的具体的类,而非注册的签名?主要是为了能够直接跳转到类的定义,况且dataset配置中的type直接用类
| from ais_bench.benchmark.datasets import SWEBenchDataset | ||
| from ais_bench.benchmark.partitioners import NaivePartitioner | ||
| from ais_bench.benchmark.runners import LocalRunner | ||
| from ais_bench.benchmark.tasks import SWEBenchInferTask, SWEBenchEvalTask | ||
| from ais_bench.benchmark.summarizers import SWEBenchSummarizer | ||
|
|
||
| STEP_LIMIT = 200 | ||
|
|
||
| models = [ | ||
| dict( | ||
| attr="local", | ||
| abbr="swebench", | ||
| type="LiteLLMChat", | ||
| model="", | ||
| api_key="EMPTY", | ||
| url="http://127.0.0.1:8000/v1", # API base, e.g. http://127.0.0.1:8000/v1 | ||
| batch_size=1, | ||
| generation_kwargs=dict(), | ||
| ) | ||
| ] | ||
|
|
||
| datasets = [ | ||
| dict( | ||
| type=SWEBenchDataset, | ||
| abbr="swebench_lite", | ||
| # Relative to AIS_BENCH_DATASETS_CACHE (default: project root); missing -> HF download | ||
| path="", | ||
| name="lite", | ||
| split="test", | ||
| filter_spec="", | ||
| shuffle=False, | ||
| step_limit=STEP_LIMIT, | ||
| ), | ||
| ] | ||
|
|
||
| summarizer = dict( | ||
| attr="accuracy", | ||
| type=SWEBenchSummarizer, | ||
| ) | ||
|
|
||
|
|
||
| infer = dict( | ||
| partitioner=dict(type=NaivePartitioner), | ||
| runner=dict( | ||
| type=LocalRunner, | ||
| task=dict(type=SWEBenchInferTask), | ||
| ), | ||
| ) | ||
|
|
||
| eval = dict( | ||
| partitioner=dict(type=NaivePartitioner), | ||
| runner=dict( | ||
| type=LocalRunner, | ||
| task=dict(type=SWEBenchEvalTask), | ||
| ), | ||
| ) |
There was a problem hiding this comment.
【review】SWE-Bench这四个子集都是用相同的agent和数据集类,配置文件上除了数据集的abbr和name全是重复的,而且abbr就是由name组成的,建议4个配置文件直接归一,用户去选择不同子集即可。4个不同配置文件模型还要配4次,相对来说更麻烦。
dataset_type = "lite" # choose from ["verified", "lite", "full", "multilingual"]
datasets = [
dict(
type=SWEBenchDataset,
abbr=f"swebench_{dataset_type}",
# Relative to AIS_BENCH_DATASETS_CACHE (default: project root); missing -> HF download
path="",
name=dataset_type,
split="test",
filter_spec="",
shuffle=False,
step_limit=STEP_LIMIT,
),
]
PR Type / PR类型
Related Issue | 关联 Issue
无 / N/A(如有请改为:
Fixes #/Relates to #)🔍 Motivation / 变更动机
为在
ais_bench中跑通 SWE-bench(mini-swe-agent推理 + SWE-bench harness 评测)提供开箱即用的示例配置,并补充中英文说明,降低首次接入与排障成本。📝 Modification / 修改内容
ais_bench/configs/swe_bench_examples/。name/abbr不同):mini_swe_agent_swe_bench_lite.py(lite)mini_swe_agent_swe_bench_verified.py(verified)mini_swe_agent_swe_bench_full.py(full)mini_swe_agent_swe_bench_multilingual.py(multilingual)SWEBenchDataset、SWEBenchInferTask/SWEBenchEvalTask、SWEBenchSummarizer、NaivePartitioner、LocalRunner;默认step_limit=200,path=""便于在线从 Hugging Face 加载;模型端需用户填写model/url/api_key等。README_en.md、README_zh_cn.md,涵盖能力概览、依赖(mini-swe-agent、SWE-bench harness、Docker)、最小配置、运行命令(all/infer/eval、--reuse)、输出目录与指标说明,以及常见SWEB-*错误码与 FAQ 引用。📐 Associated Test Results / 关联测试结果
待补充 CI 链接或本地冒烟结果(例如在仓库根目录执行
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py等)。否。本 PR 仅新增文件,无破坏性变更。
无。
🌟 Use cases (Optional) / 使用案例(可选)
lite配置快速验证推理与评测流水线,再切换verified/full/multilingual。ais_bench .../mini_swe_agent_swe_bench_lite.py -m infer;基于已有 predictions 评测:-m eval;中断后续跑可加--reuse。README_zh_cn.md/README_en.md。✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build