Skip to content

Release Note: v3.1‐20260630‐master

Hanye edited this page Jun 30, 2026 · 1 revision

Release Note: v3.1‐20260630‐master

Diff range: v3.1-20260330-master..HEAD

English Version

🌟 Highlights

  1. Agent Benchmark Ecosystem: Built end-to-end support for multiple mainstream agent benchmarks, including SWE-Bench, SWE-Bench Pro, τ²-Bench, and Terminal-Bench 2.0 (Harbor), providing dataset loaders, inferencers, summarizers, example configs and bilingual user guides.
  2. Video & Image Generation Evaluation: Added VBench 1.0 Video Quality Evaluation Pipeline and OneIG-Benchmark (EN/ZH) for text-to-image multi-dimensional evaluation (alignment, text rendering, reasoning, style, diversity), covering both generation pipelines and judging pipelines.
  3. Multimodal & Reasoning Benchmarks: Newly integrated HLE, RealWorldQA, MathVision, AIME 2026, and the RefCOCO / RefCOCO+ / RefCOCOG grounding benchmark family.
  4. Mini Subset Support: Provided mini subsets for SWE-Bench, τ²-Bench and VBench to significantly reduce evaluation cost for quick validation.
  5. Multi-architecture Docker & PyPI Release: AISBench Docker images now support both x86_64 and aarch64 (Ubuntu 22.04/24.04, openEuler 22.03/24.03 × Python 3.10/3.11/3.12), and the package has been published to PyPI — install via pip install ais_bench_benchmark / pip install ais_bench_benchmark[full].
  6. New Models: Added Vita generate-chat model backend.

🚀 New Features

Datasets

  • Dataset: Added OneIG benchmark (EN/ZH) for text-to-image multi-dimensional evaluation — covering alignment, text rendering, reasoning, style and diversity dimensions, with LLM-as-Judge + dedicated small-model hybrid judging. (#361)(#368)(#364)
  • Dataset: Added SWE-Bench Pro dataset with full / mini subsets, supporting long-horizon software engineering agent evaluation. (#333)
  • Dataset: Added VBench 1.0 Video Quality Evaluation Pipeline (Part 1 / Part 2 / third-party & license), covering subject consistency, motion smoothness, temporal flickering, dynamic degree, aesthetic quality, imaging quality, object class, color, spatial relationship, scene, overall consistency, human action and multiple objects. (#273)(#270)(#152)
  • Dataset: Added τ²-Bench dataset and mini subset, supporting multi-turn dialogue agent evaluation in dual-control environments. (#249)
  • Dataset: Added Terminal-Bench 2.0 (Harbor) dataset and mini subset, supporting terminal-based agent evaluation. (#318)(#319)(#320)(#321)
  • Dataset: Added SWE-Bench dataset with mini subset, supporting software engineering agent evaluation. (#240)(#241)
  • Dataset: Added HLE dataset, supporting high-difficulty reasoning and knowledge benchmark evaluation. (#301)
  • Dataset: Added RealWorldQA dataset, supporting real-world image QA evaluation. (#268)
  • Dataset: Added MathVision dataset, supporting mathematical visual reasoning evaluation. (#264)
  • Dataset: Added AIME 2026 dataset, supporting latest competition math evaluation. (#274)
  • Dataset: Added RefCOCO / RefCOCO+ / RefCOCOG referring expression grounding benchmark family. (#201)
  • Dataset: Added verified_mini dataset mapping and example configs for several benchmarks, reducing evaluation cost. (#271)

Models

  • Model: Added Vita generate-chat model backend. (#237)

Features

  • Feature: Built end-to-end SWE-Bench benchmark pipeline, integrating dataset loader, infer task, eval task and summarizer; integrated Mini SWE Agent as the inferencer. (#241)(#240)
  • Feature: Provided SWE-Bench example configs and bilingual (EN/ZH) user guide for quick onboarding. (#191)
  • Feature: Provided τ²-Bench example scripts, dependency declarations and docs to support one-shot evaluation launch. (#249)
  • Feature: Provided Terminal-Bench 2.0 (Harbor) example configs and scripts with mini dataset support. (#318)(#319)(#320)(#321)
  • Feature: Provided OneIG example configs, evaluation examples and bilingual docs. (#361)(#379)
  • Feature: Provided VBench 1.0 example configs, evaluation examples, dependency caching guidance and bilingual docs. (#270)(#273)
  • Feature: Provided RefCOCO / RefCOCO+ / RefCOCOG example configs (vLLM API / local) and bilingual docs. (#201)(#277)
  • Feature: SWE-Bench LiteLLM inference default timeout set to 200s to avoid premature termination on long-horizon tasks. (#383)
  • Feature: SWE-Bench example configs now expose temperature, top_k, top_p generation kwargs for fine-grained control. (#377)
  • Feature: τ²-Bench example configs support llm_call_kwargs for custom LLM call parameters. (#366)
  • Feature: Provided Agentic Coding evaluation scheme design documentation, guiding users to build agent benchmarks on AISBench. (#292)
  • Feature: Added judge-model-based evaluation guide (judge_model_evaluate) and bilingual docs. (#225)
  • Feature: Added error-code documentation (EN/ZH), allowing users to quickly locate solutions via error code URLs. (#225)
  • Feature: Support multi-architecture Docker images (x86_64 / aarch64) with multiple base OS and Python versions. (#332)
  • Feature: Added Docker OVERVIEW docs (EN/ZH), Dockerfile build scripts and label / exception logging for better diagnostics. (#339)(#340)(#351)

🐛 Bug Fixes

  • Fix: Fixed MMMU inference where option content could not be correctly extracted — added a postprocessor to extract the last option letter. (#238)
  • Fix: Fixed multi-modal scenario where input_tokens was 0, which affected performance metric accuracy. (#299)
  • Fix: Fixed incorrect path agent_examplesagent_example in τ²-Bench documentation. (#312)
  • Fix: Fixed τ²-Bench metrics display under pass^k evaluation scenarios. (#272)
  • Fix: Fixed multilingual mini dataset config issues. (#305)
  • Fix: Fixed VBench burstiness divisor-by-zero issue. (#357)
  • Fix: Fixed VBench evaluation issues. (#363)
  • Fix: Fixed aiohttp sessions not respecting HTTP proxy environment variables — added trust_env=True. (#367)
  • Fix: Fixed custom URL trailing-slash issue where urljoin would drop the last path segment. (#358)
  • Fix: Fixed HumanEvalX file path join error (issue #139). (#370)
  • Fix: Fixed HuggingFace trust_remote_code not taking effect in some model backends. (#226)
  • Fix: Fixed transformers version compatibility issues. (#261)
  • Fix: Fixed postprocess bug affecting result accuracy. (#263)
  • Fix: Fixed MMMU dataset result inconsistency. (#265)
  • Fix: Fixed HLE README task name from hle to hle_llmjudge. (#354)
  • Fix: Removed duplicate content in SWE-Bench Pro README. (#355)

⚙️ Optimizations and Refactoring

  • Refactor: Improved multi-modal benchmark evaluation flow for Vita model, including datagen and postprocess. (#237)
  • Refactor: Added util adapter module to unify model / task utility interfaces. (#239)
  • Refactor: Support PreTrainedTokenizerFast for DS-V3-2 model. (#330)
  • Refactor: Updated textvqa.py to align with new dataset interface. (#230)
  • Refactor: Unified verified_mini dataset mapping and example configuration approach. (#271)

🏗️ Infrastructure Refactoring

  • Infrastructure: Refactored Dockerfile build pipeline, added Dockerfile.py310/3.11/3.12 × Ubuntu 22.04/24.04, openEuler 22.03/24.03 build matrix. (#332)(#339)
  • Infrastructure: Added Docker OVERVIEW documentation (EN/ZH) and standardized image build scripts. (#340)
  • Infrastructure: Added Docker container logging for exceptions and standard labels for traceability. (#351)

🔄 CI/CD Optimizations

  • CI/CD: Published AISBench to PyPI as ais_bench_benchmark (with optional [full] extra), added upload_pypi.sh and updated setup.py for PyPI build/distribution. (#344)

📚 Documentation

  • Docs: Added SWE-Bench bilingual user guide and example configs. (#191)(#308)
  • Docs: Added τ²-Bench user guide (EN/ZH) and test-case docs. (#250)(#251)
  • Docs: Added Terminal-Bench 2.0 (Harbor) user guide (EN/ZH). (#318)
  • Docs: Added SWE-Bench Pro dataset documentation (EN/ZH). (#334)
  • Docs: Added VBench 1.0 user guide (EN/ZH) and dependency caching guide. (#270)
  • Docs: Added OneIG-Benchmark documentation (EN/ZH). (#379)
  • Docs: Added RefCOCO / RefCOCO+ / RefCOCOG user guide (EN/ZH). (#277)
  • Docs: Added AIME 2026 user guide (EN/ZH). (#289)
  • Docs: Added RealWorldQA official materials and dataset docs (EN/ZH). (#268)(#290)
  • Docs: Added MathVision user guide (EN/ZH). (#288)
  • Docs: Added mini dataset guide (EN/ZH) covering SWE-Bench / τ²-Bench / VBench mini subsets. (#302)
  • Docs: Added Agentic Coding evaluation scheme design document. (#292)
  • Docs: Updated readthedocs base URL and pre-release docs (20260415). (#225)(#252)
  • Docs: Fixed Error in user YAML: (<unknown>) issue in documentation. (#293)
  • Docs: Assigned omnidocbench dataset to v1.5 in docs. (#371)
  • Docs: Updated datasets.md and install.md (EN/ZH) to reflect the new datasets, Docker images and PyPI installation. (#225)
  • Docs: Updated main README status and fixed remaining documentation issues. (#381)

✅ Tests

  • Test: Added UT coverage for SWE-Bench Pro (infer / utils / eval / summarizer). (#335)(#336)(#337)(#338)
  • Test: Added UT coverage for OneIG evaluation (alignment / text / reasoning / style / diversity / eval utils). (#372)(#373)(#374)
  • Test: Added UT coverage for core datasets (AIME, AIME 2026, RealWorldQA, Math, GSM8K, GPQA, DAPO-Math). (#315)
  • Test: Added UT coverage for HLE dataset. (#301)
  • Test: Added UT coverage for MathVision dataset. (#288)
  • Test: Added UT coverage for MMMU dataset. (#325)
  • Test: Added UT coverage for MMMU-Pro and MMStar datasets. (#326)
  • Test: Added UT coverage for RefCOCO / RefCOCO+ / RefCOCOG datasets. (#276)
  • Test: Added UT coverage for RealWorldQA dataset. (#287)
  • Test: Added UT coverage for SWE-Bench summarizer. (#316)
  • Test: Added UT coverage for VBench dataset and summarizer. (#273)
  • Test: Added UT coverage for Harbor task and summarizer, supporting Terminal-Bench 2.0 testing. (#318)
  • Test: Added UT coverage for τ²-Bench custom task. (#249)
  • Test: Added bbox_iou_evaluator UT for visual grounding evaluation. (#315)

简体中文版

🌟 亮点

  1. 智能体评测基准生态:端到端接入 SWE-Bench、SWE-Bench Pro、τ²-Bench、Terminal-Bench 2.0 (Harbor) 等主流智能体评测基准,提供数据集加载、推理、汇总、示例配置与中英文使用文档。
  2. 视频/图像生成评测:新增 VBench 1.0 视频生成质量评测流水线与 OneIG-Benchmark(EN/ZH)文生图多维评测基准(对齐、文本渲染、推理、风格、多样性),覆盖生成与判官两端。
  3. 多模态与推理基准:新增 HLE、RealWorldQA、MathVision、AIME 2026 以及 RefCOCO / RefCOCO+ / RefCOCOG 指代表达定位基准族。
  4. Mini 子集支持:为 SWE-Bench、τ²-Bench、VBench 提供 mini 子集,大幅降低快速验证的评测成本。
  5. 多架构 Docker 与 PyPI 发布:AISBench Docker 镜像同步支持 x86_64aarch64(Ubuntu 22.04/24.04、openEuler 22.03/24.03 × Python 3.10/3.11/3.12);同时 AISBench 已发布到 PyPI,可通过 pip install ais_bench_benchmark / pip install ais_bench_benchmark[full] 一键安装。
  6. 新增模型:新增 Vita generate-chat 模型后端。

🚀 新特性

数据集

  • 数据集:新增 OneIG-Benchmark(EN/ZH)文生图多维评测基准,覆盖对齐、文本渲染、推理、风格、多样性五个维度,采用 LLM-as-Judge + 专用小模型混合评测方式。(#361)(#368)(#364)
  • 数据集:新增 SWE-Bench Pro 数据集,提供 full / mini 子集,支持长时域软件工程智能体评测。(#333)
  • 数据集:新增 VBench 1.0 视频生成质量评测流水线(Part 1 / Part 2 / 第三方与许可证),覆盖主体一致性、运动平滑度、时间闪烁、动态程度、美学质量、成像质量、物体类别、颜色、空间关系、场景、整体一致性、人物动作、多物体等维度。(#273)(#270)(#152)
  • 数据集:新增 τ²-Bench 数据集及 mini 子集,支持双控环境下的多轮对话智能体评测。(#249)
  • 数据集:新增 Terminal-Bench 2.0(Harbor)数据集及 mini 子集,支持终端类智能体评测。(#318)(#319)(#320)(#321)
  • 数据集:新增 SWE-Bench 数据集及 mini 子集,支持软件工程智能体评测。(#240)(#241)
  • 数据集:新增 HLE 数据集,支持高难度推理与知识型基准评测。(#301)
  • 数据集:新增 RealWorldQA 数据集,支持真实世界图像问答评测。(#268)
  • 数据集:新增 MathVision 数据集,支持数学视觉推理评测。(#264)
  • 数据集:新增 AIME 2026 数据集,支持最新数学竞赛题评测。(#274)
  • 数据集:新增 RefCOCO / RefCOCO+ / RefCOCOG 指代表达定位基准族。(#201)
  • 数据集:新增 verified_mini 数据集映射与示例配置,降低多个基准的评测成本。(#271)

模型

  • 模型:新增 Vita generate-chat 模型后端。(#237)

功能

  • 功能:构建端到端 SWE-Bench 评测流水线,集成数据集加载、推理任务、评测任务与汇总器,并集成 Mini SWE Agent 作为推理器。(#241)(#240)
  • 功能:提供 SWE-Bench 示例配置与中英文使用文档,便于快速上手。(#191)
  • 功能:提供 τ²-Bench 示例脚本、依赖声明与文档,支持一键启动评测。(#249)
  • 功能:提供 Terminal-Bench 2.0(Harbor)示例配置与脚本,支持 mini 数据集。(#318)(#319)(#320)(#321)
  • 功能:提供 OneIG 示例配置、评测样例与中英文文档。(#361)(#379)
  • 功能:提供 VBench 1.0 示例配置、评测样例、依赖缓存说明与中英文文档。(#270)(#273)
  • 功能:提供 RefCOCO / RefCOCO+ / RefCOCOG 示例配置(vLLM API / 本地)与中英文文档。(#201)(#277)
  • 功能:SWE-Bench LiteLLM 推理默认超时设置为 200s,避免长时域任务提前终止。(#383)
  • 功能:SWE-Bench 示例配置新增 temperaturetop_ktop_p 生成参数,支持细粒度控制。(#377)
  • 功能:τ²-Bench 示例配置支持 llm_call_kwargs 自定义 LLM 调用参数。(#366)
  • 功能:提供 Agentic Coding 评测方案设计 文档,指导用户在 AISBench 上构建智能体评测。(#292)
  • 功能:新增基于裁判模型的评测指南 judge_model_evaluate 与中英文文档。(#225)
  • 功能:新增错误码文档(EN/ZH),通过错误码 URL 快速定位解决方案。(#225)
  • 功能:Docker 镜像支持多架构(x86_64 / aarch64)与多 OS / Python 版本组合。(#332)
  • 功能:新增 Docker OVERVIEW 文档(EN/ZH)、Dockerfile 构建脚本与异常/标签日志,便于问题定位。(#339)(#340)(#351)

🐛 问题修复

  • 修复:MMMU 推理无法正确抽取选项内容的问题,新增后处理器抽取最后一个选项字母。(#238)
  • 修复:多模态场景下 input_tokens=0 导致性能指标不准的问题。(#299)
  • 修复:τ²-Bench 文档中路径 agent_examplesagent_example 的笔误。(#312)
  • 修复:τ²-Bench 在 pass^k 评测场景下指标展示异常的问题。(#272)
  • 修复:多语言 mini 数据集配置问题。(#305)
  • 修复:VBench burstiness 计算被零除问题。(#357)
  • 修复:VBench 评测相关问题。(#363)
  • 修复:aiohttp 会话未遵循 HTTP 代理环境变量的问题,新增 trust_env=True。(#367)
  • 修复:自定义 URL 缺少末尾斜杠导致 urljoin 丢失最后一段路径的问题。(#358)
  • 修复:HumanEvalX 文件路径拼接错误(issue #139)。(#370)
  • 修复:部分模型后端 HuggingFace trust_remote_code 未生效的问题。(#226)
  • 修复:transformers 版本兼容性问题。(#261)
  • 修复:影响评测结果准确性的后处理 bug。(#263)
  • 修复:MMMU 数据集结果不一致问题。(#265)
  • 修复:HLE README 中任务名称 hle 应为 hle_llmjudge。(#354)
  • 修复:移除 SWE-Bench Pro README 中的重复内容。(#355)

⚙️ 优化与重构

  • 重构:完善 Vita 模型的多模态基准评测流程,包括 datagen 与后处理。(#237)
  • 重构:新增 util adapter 模块,统一模型/任务工具接口。(#239)
  • 重构:DS-V3-2 模型支持 PreTrainedTokenizerFast。(#330)
  • 重构:更新 textvqa.py 以对齐新的数据集接口。(#230)
  • 重构:统一 verified_mini 数据集映射与示例配置方式。(#271)

🏗️ 基础设施重构

  • 基础设施:重构 Dockerfile 构建流水线,新增 Dockerfile.py310/3.11/3.12 × Ubuntu 22.04/24.04、openEuler 22.03/24.03 构建矩阵。(#332)(#339)
  • 基础设施:新增 Docker OVERVIEW 文档(EN/ZH)并标准化镜像构建脚本。(#340)
  • 基础设施:新增 Docker 容器异常日志和标准标签,便于追踪。(#351)

🔄 CI/CD 优化

  • CI/CD:AISBench 发布至 PyPI,包名 ais_bench_benchmark(可选 [full] extra),新增 upload_pypi.sh 脚本并更新 setup.py 以支持 PyPI 构建/分发。(#344)

📚 文档

  • 文档:新增 SWE-Bench 中英文使用文档与示例配置。(#191)(#308)
  • 文档:新增 τ²-Bench 中英文使用文档与测试用例文档。(#250)(#251)
  • 文档:新增 Terminal-Bench 2.0(Harbor)中英文使用文档。(#318)
  • 文档:新增 SWE-Bench Pro 数据集文档(EN/ZH)。(#334)
  • 文档:新增 VBench 1.0 中英文使用文档与依赖缓存指南。(#270)
  • 文档:新增 OneIG-Benchmark 中英文文档。(#379)
  • 文档:新增 RefCOCO / RefCOCO+ / RefCOCOG 中英文使用文档。(#277)
  • 文档:新增 AIME 2026 中英文使用文档。(#289)
  • 文档:新增 RealWorldQA 官方资料与数据集文档(EN/ZH)。(#268)(#290)
  • 文档:新增 MathVision 中英文使用文档。(#288)
  • 文档:新增 mini 数据集文档(EN/ZH),覆盖 SWE-Bench / τ²-Bench / VBench mini 子集。(#302)
  • 文档:新增 Agentic Coding 评测方案设计文档。(#292)
  • 文档:更新 readthedocs 基础 URL 与 20260415 预发布文档。(#225)(#252)
  • 文档:修复文档中 Error in user YAML: (<unknown>) 的说明问题。(#293)
  • 文档:omnidocbench 数据集文档指定为 v1.5 版本。(#371)
  • 文档:更新 datasets.md 与 install.md(EN/ZH),同步新数据集、Docker 镜像与 PyPI 安装方式。(#225)
  • 文档:更新主 README 状态并修复剩余文档问题。(#381)

✅ 测试

  • 测试:新增 SWE-Bench Pro 全链路 UT 覆盖(infer / utils / eval / summarizer)。(#335)(#336)(#337)(#338)
  • 测试:新增 OneIG 评测全维度 UT 覆盖(alignment / text / reasoning / style / diversity / eval utils)。(#372)(#373)(#374)
  • 测试:新增核心数据集 UT 覆盖(AIME、AIME 2026、RealWorldQA、Math、GSM8K、GPQA、DAPO-Math)。(#315)
  • 测试:新增 HLE 数据集 UT 覆盖。(#301)
  • 测试:新增 MathVision 数据集 UT 覆盖。(#288)
  • 测试:新增 MMMU 数据集 UT 覆盖。(#325)
  • 测试:新增 MMMU-Pro 与 MMStar 数据集 UT 覆盖。(#326)
  • 测试:新增 RefCOCO / RefCOCO+ / RefCOCOG 数据集 UT 覆盖。(#276)
  • 测试:新增 RealWorldQA 数据集 UT 覆盖。(#287)
  • 测试:新增 SWE-Bench summarizer UT 覆盖。(#316)
  • 测试:新增 VBench 数据集与 summarizer UT 覆盖。(#273)
  • 测试:新增 Harbor task 与 summarizer UT 覆盖,支持 Terminal-Bench 2.0 测试。(#318)
  • 测试:新增 τ²-Bench custom task UT 覆盖。(#249)
  • 测试:新增 bbox_iou_evaluator UT,用于视觉定位评测。(#315)