-
Notifications
You must be signed in to change notification settings - Fork 46
Release Note: v3.1‐20260630‐master
Hanye edited this page Jun 30, 2026
·
1 revision
Diff range:
v3.1-20260330-master..HEAD
- Agent Benchmark Ecosystem: Built end-to-end support for multiple mainstream agent benchmarks, including SWE-Bench, SWE-Bench Pro, τ²-Bench, and Terminal-Bench 2.0 (Harbor), providing dataset loaders, inferencers, summarizers, example configs and bilingual user guides.
- Video & Image Generation Evaluation: Added VBench 1.0 Video Quality Evaluation Pipeline and OneIG-Benchmark (EN/ZH) for text-to-image multi-dimensional evaluation (alignment, text rendering, reasoning, style, diversity), covering both generation pipelines and judging pipelines.
- Multimodal & Reasoning Benchmarks: Newly integrated HLE, RealWorldQA, MathVision, AIME 2026, and the RefCOCO / RefCOCO+ / RefCOCOG grounding benchmark family.
- Mini Subset Support: Provided mini subsets for SWE-Bench, τ²-Bench and VBench to significantly reduce evaluation cost for quick validation.
-
Multi-architecture Docker & PyPI Release: AISBench Docker images now support both x86_64 and aarch64 (Ubuntu 22.04/24.04, openEuler 22.03/24.03 × Python 3.10/3.11/3.12), and the package has been published to PyPI — install via
pip install ais_bench_benchmark/pip install ais_bench_benchmark[full]. - New Models: Added Vita generate-chat model backend.
- Dataset: Added OneIG benchmark (EN/ZH) for text-to-image multi-dimensional evaluation — covering alignment, text rendering, reasoning, style and diversity dimensions, with LLM-as-Judge + dedicated small-model hybrid judging. (#361)(#368)(#364)
- Dataset: Added SWE-Bench Pro dataset with
full/minisubsets, supporting long-horizon software engineering agent evaluation. (#333) - Dataset: Added VBench 1.0 Video Quality Evaluation Pipeline (Part 1 / Part 2 / third-party & license), covering subject consistency, motion smoothness, temporal flickering, dynamic degree, aesthetic quality, imaging quality, object class, color, spatial relationship, scene, overall consistency, human action and multiple objects. (#273)(#270)(#152)
- Dataset: Added τ²-Bench dataset and mini subset, supporting multi-turn dialogue agent evaluation in dual-control environments. (#249)
- Dataset: Added Terminal-Bench 2.0 (Harbor) dataset and mini subset, supporting terminal-based agent evaluation. (#318)(#319)(#320)(#321)
- Dataset: Added SWE-Bench dataset with mini subset, supporting software engineering agent evaluation. (#240)(#241)
- Dataset: Added HLE dataset, supporting high-difficulty reasoning and knowledge benchmark evaluation. (#301)
- Dataset: Added RealWorldQA dataset, supporting real-world image QA evaluation. (#268)
- Dataset: Added MathVision dataset, supporting mathematical visual reasoning evaluation. (#264)
- Dataset: Added AIME 2026 dataset, supporting latest competition math evaluation. (#274)
- Dataset: Added RefCOCO / RefCOCO+ / RefCOCOG referring expression grounding benchmark family. (#201)
- Dataset: Added
verified_minidataset mapping and example configs for several benchmarks, reducing evaluation cost. (#271)
- Model: Added Vita generate-chat model backend. (#237)
- Feature: Built end-to-end SWE-Bench benchmark pipeline, integrating dataset loader, infer task, eval task and summarizer; integrated Mini SWE Agent as the inferencer. (#241)(#240)
- Feature: Provided SWE-Bench example configs and bilingual (EN/ZH) user guide for quick onboarding. (#191)
- Feature: Provided τ²-Bench example scripts, dependency declarations and docs to support one-shot evaluation launch. (#249)
- Feature: Provided Terminal-Bench 2.0 (Harbor) example configs and scripts with mini dataset support. (#318)(#319)(#320)(#321)
- Feature: Provided OneIG example configs, evaluation examples and bilingual docs. (#361)(#379)
- Feature: Provided VBench 1.0 example configs, evaluation examples, dependency caching guidance and bilingual docs. (#270)(#273)
- Feature: Provided RefCOCO / RefCOCO+ / RefCOCOG example configs (vLLM API / local) and bilingual docs. (#201)(#277)
- Feature: SWE-Bench LiteLLM inference default timeout set to 200s to avoid premature termination on long-horizon tasks. (#383)
- Feature: SWE-Bench example configs now expose
temperature,top_k,top_pgeneration kwargs for fine-grained control. (#377) - Feature: τ²-Bench example configs support
llm_call_kwargsfor custom LLM call parameters. (#366) - Feature: Provided Agentic Coding evaluation scheme design documentation, guiding users to build agent benchmarks on AISBench. (#292)
- Feature: Added judge-model-based evaluation guide (
judge_model_evaluate) and bilingual docs. (#225) - Feature: Added error-code documentation (EN/ZH), allowing users to quickly locate solutions via error code URLs. (#225)
- Feature: Support multi-architecture Docker images (x86_64 / aarch64) with multiple base OS and Python versions. (#332)
- Feature: Added Docker
OVERVIEWdocs (EN/ZH), Dockerfile build scripts and label / exception logging for better diagnostics. (#339)(#340)(#351)
- Fix: Fixed MMMU inference where option content could not be correctly extracted — added a postprocessor to extract the last option letter. (#238)
- Fix: Fixed multi-modal scenario where
input_tokenswas 0, which affected performance metric accuracy. (#299) - Fix: Fixed incorrect path
agent_examples→agent_examplein τ²-Bench documentation. (#312) - Fix: Fixed τ²-Bench metrics display under pass^k evaluation scenarios. (#272)
- Fix: Fixed multilingual mini dataset config issues. (#305)
- Fix: Fixed VBench burstiness divisor-by-zero issue. (#357)
- Fix: Fixed VBench evaluation issues. (#363)
- Fix: Fixed aiohttp sessions not respecting HTTP proxy environment variables — added
trust_env=True. (#367) - Fix: Fixed custom URL trailing-slash issue where
urljoinwould drop the last path segment. (#358) - Fix: Fixed HumanEvalX file path join error (issue #139). (#370)
- Fix: Fixed HuggingFace
trust_remote_codenot taking effect in some model backends. (#226) - Fix: Fixed transformers version compatibility issues. (#261)
- Fix: Fixed postprocess bug affecting result accuracy. (#263)
- Fix: Fixed MMMU dataset result inconsistency. (#265)
- Fix: Fixed HLE README task name from
hletohle_llmjudge. (#354) - Fix: Removed duplicate content in SWE-Bench Pro README. (#355)
- Refactor: Improved multi-modal benchmark evaluation flow for Vita model, including datagen and postprocess. (#237)
- Refactor: Added util adapter module to unify model / task utility interfaces. (#239)
- Refactor: Support
PreTrainedTokenizerFastfor DS-V3-2 model. (#330) - Refactor: Updated textvqa.py to align with new dataset interface. (#230)
- Refactor: Unified
verified_minidataset mapping and example configuration approach. (#271)
- Infrastructure: Refactored Dockerfile build pipeline, added
Dockerfile.py310/3.11/3.12×Ubuntu 22.04/24.04, openEuler 22.03/24.03build matrix. (#332)(#339) - Infrastructure: Added Docker
OVERVIEWdocumentation (EN/ZH) and standardized image build scripts. (#340) - Infrastructure: Added Docker container logging for exceptions and standard labels for traceability. (#351)
- CI/CD: Published AISBench to PyPI as
ais_bench_benchmark(with optional[full]extra), addedupload_pypi.shand updatedsetup.pyfor PyPI build/distribution. (#344)
- Docs: Added SWE-Bench bilingual user guide and example configs. (#191)(#308)
- Docs: Added τ²-Bench user guide (EN/ZH) and test-case docs. (#250)(#251)
- Docs: Added Terminal-Bench 2.0 (Harbor) user guide (EN/ZH). (#318)
- Docs: Added SWE-Bench Pro dataset documentation (EN/ZH). (#334)
- Docs: Added VBench 1.0 user guide (EN/ZH) and dependency caching guide. (#270)
- Docs: Added OneIG-Benchmark documentation (EN/ZH). (#379)
- Docs: Added RefCOCO / RefCOCO+ / RefCOCOG user guide (EN/ZH). (#277)
- Docs: Added AIME 2026 user guide (EN/ZH). (#289)
- Docs: Added RealWorldQA official materials and dataset docs (EN/ZH). (#268)(#290)
- Docs: Added MathVision user guide (EN/ZH). (#288)
- Docs: Added mini dataset guide (EN/ZH) covering SWE-Bench / τ²-Bench / VBench mini subsets. (#302)
- Docs: Added Agentic Coding evaluation scheme design document. (#292)
- Docs: Updated readthedocs base URL and pre-release docs (20260415). (#225)(#252)
- Docs: Fixed
Error in user YAML: (<unknown>)issue in documentation. (#293) - Docs: Assigned omnidocbench dataset to v1.5 in docs. (#371)
- Docs: Updated datasets.md and install.md (EN/ZH) to reflect the new datasets, Docker images and PyPI installation. (#225)
- Docs: Updated main README status and fixed remaining documentation issues. (#381)
- Test: Added UT coverage for SWE-Bench Pro (infer / utils / eval / summarizer). (#335)(#336)(#337)(#338)
- Test: Added UT coverage for OneIG evaluation (alignment / text / reasoning / style / diversity / eval utils). (#372)(#373)(#374)
- Test: Added UT coverage for core datasets (AIME, AIME 2026, RealWorldQA, Math, GSM8K, GPQA, DAPO-Math). (#315)
- Test: Added UT coverage for HLE dataset. (#301)
- Test: Added UT coverage for MathVision dataset. (#288)
- Test: Added UT coverage for MMMU dataset. (#325)
- Test: Added UT coverage for MMMU-Pro and MMStar datasets. (#326)
- Test: Added UT coverage for RefCOCO / RefCOCO+ / RefCOCOG datasets. (#276)
- Test: Added UT coverage for RealWorldQA dataset. (#287)
- Test: Added UT coverage for SWE-Bench summarizer. (#316)
- Test: Added UT coverage for VBench dataset and summarizer. (#273)
- Test: Added UT coverage for Harbor task and summarizer, supporting Terminal-Bench 2.0 testing. (#318)
- Test: Added UT coverage for τ²-Bench custom task. (#249)
- Test: Added
bbox_iou_evaluatorUT for visual grounding evaluation. (#315)
- 智能体评测基准生态:端到端接入 SWE-Bench、SWE-Bench Pro、τ²-Bench、Terminal-Bench 2.0 (Harbor) 等主流智能体评测基准,提供数据集加载、推理、汇总、示例配置与中英文使用文档。
- 视频/图像生成评测:新增 VBench 1.0 视频生成质量评测流水线与 OneIG-Benchmark(EN/ZH)文生图多维评测基准(对齐、文本渲染、推理、风格、多样性),覆盖生成与判官两端。
- 多模态与推理基准:新增 HLE、RealWorldQA、MathVision、AIME 2026 以及 RefCOCO / RefCOCO+ / RefCOCOG 指代表达定位基准族。
- Mini 子集支持:为 SWE-Bench、τ²-Bench、VBench 提供 mini 子集,大幅降低快速验证的评测成本。
-
多架构 Docker 与 PyPI 发布:AISBench Docker 镜像同步支持 x86_64 与 aarch64(Ubuntu 22.04/24.04、openEuler 22.03/24.03 × Python 3.10/3.11/3.12);同时 AISBench 已发布到 PyPI,可通过
pip install ais_bench_benchmark/pip install ais_bench_benchmark[full]一键安装。 - 新增模型:新增 Vita generate-chat 模型后端。
- 数据集:新增 OneIG-Benchmark(EN/ZH)文生图多维评测基准,覆盖对齐、文本渲染、推理、风格、多样性五个维度,采用 LLM-as-Judge + 专用小模型混合评测方式。(#361)(#368)(#364)
- 数据集:新增 SWE-Bench Pro 数据集,提供
full/mini子集,支持长时域软件工程智能体评测。(#333) - 数据集:新增 VBench 1.0 视频生成质量评测流水线(Part 1 / Part 2 / 第三方与许可证),覆盖主体一致性、运动平滑度、时间闪烁、动态程度、美学质量、成像质量、物体类别、颜色、空间关系、场景、整体一致性、人物动作、多物体等维度。(#273)(#270)(#152)
- 数据集:新增 τ²-Bench 数据集及 mini 子集,支持双控环境下的多轮对话智能体评测。(#249)
- 数据集:新增 Terminal-Bench 2.0(Harbor)数据集及 mini 子集,支持终端类智能体评测。(#318)(#319)(#320)(#321)
- 数据集:新增 SWE-Bench 数据集及 mini 子集,支持软件工程智能体评测。(#240)(#241)
- 数据集:新增 HLE 数据集,支持高难度推理与知识型基准评测。(#301)
- 数据集:新增 RealWorldQA 数据集,支持真实世界图像问答评测。(#268)
- 数据集:新增 MathVision 数据集,支持数学视觉推理评测。(#264)
- 数据集:新增 AIME 2026 数据集,支持最新数学竞赛题评测。(#274)
- 数据集:新增 RefCOCO / RefCOCO+ / RefCOCOG 指代表达定位基准族。(#201)
- 数据集:新增
verified_mini数据集映射与示例配置,降低多个基准的评测成本。(#271)
- 模型:新增 Vita generate-chat 模型后端。(#237)
- 功能:构建端到端 SWE-Bench 评测流水线,集成数据集加载、推理任务、评测任务与汇总器,并集成 Mini SWE Agent 作为推理器。(#241)(#240)
- 功能:提供 SWE-Bench 示例配置与中英文使用文档,便于快速上手。(#191)
- 功能:提供 τ²-Bench 示例脚本、依赖声明与文档,支持一键启动评测。(#249)
- 功能:提供 Terminal-Bench 2.0(Harbor)示例配置与脚本,支持 mini 数据集。(#318)(#319)(#320)(#321)
- 功能:提供 OneIG 示例配置、评测样例与中英文文档。(#361)(#379)
- 功能:提供 VBench 1.0 示例配置、评测样例、依赖缓存说明与中英文文档。(#270)(#273)
- 功能:提供 RefCOCO / RefCOCO+ / RefCOCOG 示例配置(vLLM API / 本地)与中英文文档。(#201)(#277)
- 功能:SWE-Bench LiteLLM 推理默认超时设置为 200s,避免长时域任务提前终止。(#383)
- 功能:SWE-Bench 示例配置新增
temperature、top_k、top_p生成参数,支持细粒度控制。(#377) - 功能:τ²-Bench 示例配置支持
llm_call_kwargs自定义 LLM 调用参数。(#366) - 功能:提供 Agentic Coding 评测方案设计 文档,指导用户在 AISBench 上构建智能体评测。(#292)
- 功能:新增基于裁判模型的评测指南
judge_model_evaluate与中英文文档。(#225) - 功能:新增错误码文档(EN/ZH),通过错误码 URL 快速定位解决方案。(#225)
- 功能:Docker 镜像支持多架构(x86_64 / aarch64)与多 OS / Python 版本组合。(#332)
- 功能:新增 Docker
OVERVIEW文档(EN/ZH)、Dockerfile 构建脚本与异常/标签日志,便于问题定位。(#339)(#340)(#351)
- 修复:MMMU 推理无法正确抽取选项内容的问题,新增后处理器抽取最后一个选项字母。(#238)
- 修复:多模态场景下
input_tokens=0导致性能指标不准的问题。(#299) - 修复:τ²-Bench 文档中路径
agent_examples→agent_example的笔误。(#312) - 修复:τ²-Bench 在 pass^k 评测场景下指标展示异常的问题。(#272)
- 修复:多语言 mini 数据集配置问题。(#305)
- 修复:VBench burstiness 计算被零除问题。(#357)
- 修复:VBench 评测相关问题。(#363)
- 修复:aiohttp 会话未遵循 HTTP 代理环境变量的问题,新增
trust_env=True。(#367) - 修复:自定义 URL 缺少末尾斜杠导致
urljoin丢失最后一段路径的问题。(#358) - 修复:HumanEvalX 文件路径拼接错误(issue #139)。(#370)
- 修复:部分模型后端 HuggingFace
trust_remote_code未生效的问题。(#226) - 修复:transformers 版本兼容性问题。(#261)
- 修复:影响评测结果准确性的后处理 bug。(#263)
- 修复:MMMU 数据集结果不一致问题。(#265)
- 修复:HLE README 中任务名称
hle应为hle_llmjudge。(#354) - 修复:移除 SWE-Bench Pro README 中的重复内容。(#355)
- 重构:完善 Vita 模型的多模态基准评测流程,包括 datagen 与后处理。(#237)
- 重构:新增 util adapter 模块,统一模型/任务工具接口。(#239)
- 重构:DS-V3-2 模型支持
PreTrainedTokenizerFast。(#330) - 重构:更新 textvqa.py 以对齐新的数据集接口。(#230)
- 重构:统一
verified_mini数据集映射与示例配置方式。(#271)
- 基础设施:重构 Dockerfile 构建流水线,新增
Dockerfile.py310/3.11/3.12×Ubuntu 22.04/24.04、openEuler 22.03/24.03构建矩阵。(#332)(#339) - 基础设施:新增 Docker
OVERVIEW文档(EN/ZH)并标准化镜像构建脚本。(#340) - 基础设施:新增 Docker 容器异常日志和标准标签,便于追踪。(#351)
- CI/CD:AISBench 发布至 PyPI,包名
ais_bench_benchmark(可选[full]extra),新增upload_pypi.sh脚本并更新setup.py以支持 PyPI 构建/分发。(#344)
- 文档:新增 SWE-Bench 中英文使用文档与示例配置。(#191)(#308)
- 文档:新增 τ²-Bench 中英文使用文档与测试用例文档。(#250)(#251)
- 文档:新增 Terminal-Bench 2.0(Harbor)中英文使用文档。(#318)
- 文档:新增 SWE-Bench Pro 数据集文档(EN/ZH)。(#334)
- 文档:新增 VBench 1.0 中英文使用文档与依赖缓存指南。(#270)
- 文档:新增 OneIG-Benchmark 中英文文档。(#379)
- 文档:新增 RefCOCO / RefCOCO+ / RefCOCOG 中英文使用文档。(#277)
- 文档:新增 AIME 2026 中英文使用文档。(#289)
- 文档:新增 RealWorldQA 官方资料与数据集文档(EN/ZH)。(#268)(#290)
- 文档:新增 MathVision 中英文使用文档。(#288)
- 文档:新增 mini 数据集文档(EN/ZH),覆盖 SWE-Bench / τ²-Bench / VBench mini 子集。(#302)
- 文档:新增 Agentic Coding 评测方案设计文档。(#292)
- 文档:更新 readthedocs 基础 URL 与 20260415 预发布文档。(#225)(#252)
- 文档:修复文档中
Error in user YAML: (<unknown>)的说明问题。(#293) - 文档:omnidocbench 数据集文档指定为 v1.5 版本。(#371)
- 文档:更新 datasets.md 与 install.md(EN/ZH),同步新数据集、Docker 镜像与 PyPI 安装方式。(#225)
- 文档:更新主 README 状态并修复剩余文档问题。(#381)
- 测试:新增 SWE-Bench Pro 全链路 UT 覆盖(infer / utils / eval / summarizer)。(#335)(#336)(#337)(#338)
- 测试:新增 OneIG 评测全维度 UT 覆盖(alignment / text / reasoning / style / diversity / eval utils)。(#372)(#373)(#374)
- 测试:新增核心数据集 UT 覆盖(AIME、AIME 2026、RealWorldQA、Math、GSM8K、GPQA、DAPO-Math)。(#315)
- 测试:新增 HLE 数据集 UT 覆盖。(#301)
- 测试:新增 MathVision 数据集 UT 覆盖。(#288)
- 测试:新增 MMMU 数据集 UT 覆盖。(#325)
- 测试:新增 MMMU-Pro 与 MMStar 数据集 UT 覆盖。(#326)
- 测试:新增 RefCOCO / RefCOCO+ / RefCOCOG 数据集 UT 覆盖。(#276)
- 测试:新增 RealWorldQA 数据集 UT 覆盖。(#287)
- 测试:新增 SWE-Bench summarizer UT 覆盖。(#316)
- 测试:新增 VBench 数据集与 summarizer UT 覆盖。(#273)
- 测试:新增 Harbor task 与 summarizer UT 覆盖,支持 Terminal-Bench 2.0 测试。(#318)
- 测试:新增 τ²-Bench custom task UT 覆盖。(#249)
- 测试:新增
bbox_iou_evaluatorUT,用于视觉定位评测。(#315)