[Bug] AISbench不支持针对humanevals数据集指定输入数据量，只能进行全量数据测试

### 操作系统及版本

Ubuntu 22.04.5 LTS

### 安装工具的python环境

在anaconda/miniconda创建的python虚拟环境

### python版本

3.11

### AISBench工具版本

3.1.20260119

### AISBench执行命令

ais_bench --models vllm_api_general_chat --datasets humaneval_gen_0_shot --merge-ds --num-prompts 5 --debug

### 模型配置文件或自定义配置文件内容

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="deepseek",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="100.100.1**.***",
        host_port= 8077,
        url="",
        max_out_len= 8000,
        batch_size= 16,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature =  1,
            top_p = 0.95,
            ignore_eos = False,

        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]


### 预期行为

可以支持指定数据量快速测试配置与环境是否OK

### 实际行为

无法完成测试，报错：
root@accuracy:/home/benchmark/ais_bench/datasets# ais_bench --models vllm_api_general_chat --datasets humaneval_gen_0_shot --merge-ds --num-prompts 5 --debug
[2026-02-05 09:40:04,972] [ais_bench] [INFO] Loading vllm_api_general_chat: /home/benchmark/ais_bench/benchmark/configs/./models/vllm_api/vllm_api_general_chat.py
[2026-02-05 09:40:04,978] [ais_bench] [INFO] Loading humaneval_gen_0_shot: /home/benchmark/ais_bench/benchmark/configs/./datasets/humaneval/humaneval_gen_0_shot.py
[2026-02-05 09:40:04,981] [ais_bench] [INFO] Loading example: /home/benchmark/ais_bench/benchmark/configs/./summarizers/example.py
[2026-02-05 09:40:05,019] [ais_bench] [INFO] Current exp folder: outputs/default/20260205_093956
[2026-02-05 09:40:05,019] [ais_bench] [INFO] Keeping the first 5 prompts for dataset [openai_humaneval]
[2026-02-05 09:40:05,091] [ais_bench] [INFO] Starting inference tasks...
[2026-02-05 09:40:05,095] [ais_bench] [INFO] Partitioned into 1 tasks.
[2026-02-05 09:40:05,095] [ais_bench] [INFO] Merging datasets with the same model and inferencer...
[2026-02-05 09:40:05,136] [ais_bench] [INFO] Launch TasksMonitor, PID: 57321, Refresh interval: 0.5, Run in background: True
[2026-02-05 09:40:12,965] [ais_bench] [INFO] Debug mode, print progress directly
[2026-02-05 09:40:12,967] [ais_bench] [INFO] Task [vllm-api-general-chat/openai_humaneval]
[2026-02-05 09:40:13,024] [ais_bench] [INFO] Zero Retriever initialized, returning empty shot case for all queries
[2026-02-05 09:40:13,025] [ais_bench] [INFO] Apply ice template finished
[2026-02-05 09:40:13,027] [ais_bench] [INFO] Start warmup, run with concurrency: 16
Warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.92s/case]
[2026-02-05 09:40:19,951] [ais_bench] [INFO] Warmup finished Total Count: 1 Success Count: 1 Failed Count: 0
[2026-02-05 09:40:19,952] [ais_bench] [INFO] Dataset needed memory size: 0.00319767 MB
[2026-02-05 09:40:19,952] [ais_bench] [INFO] Memory usage check passed: 1.44% < 80% (Available: 985.75 GB)
[2026-02-05 09:40:19,954] [ais_bench] [INFO] Traffic request rate: 0 RPS with burstiness 1.0.
[2026-02-05 09:40:19,954] [ais_bench] [INFO] Request rate (0.0) or ramp end rps (None) < 0.1, sending all requests simultaneously
[2026-02-05 09:40:19,955] [ais_bench] [INFO] Debug mode, run with concurrency: 16
[2026-02-05 09:40:20,056] [ais_bench] [INFO] All subprocesses have finished deserializing the first batch of data
[2026-02-05 09:40:20,155] [ais_bench] [INFO] Starting progress bar Total data num: 5 Finished data num: 0 Left data num: 5
Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:13<00:00,  2.60s/case]
POST=5 (0.0/s)  RECV=5 (1.0/s)  FAIL=0 (0.0/s)  FINISH=5 (1.0/s)
[2026-02-05 09:40:33,176] [ais_bench] [INFO] Api infer task time elapsed: 20.21s
[2026-02-05 09:40:34,227] [ais_bench] [INFO] Inference tasks completed.
[2026-02-05 09:40:34,228] [ais_bench] [INFO] Starting evaluation tasks...
[2026-02-05 09:40:34,232] [ais_bench] [INFO] Partitioned into 1 tasks.
[2026-02-05 09:40:34,248] [ais_bench] [INFO] Launch TasksMonitor, PID: 57589, Refresh interval: 0.5, Run in background: True
[2026-02-05 09:40:41,998] [ais_bench] [INFO] Debug mode, print progress directly
[2026-02-05 09:40:42,052] [ais_bench] [INFO] Running 1-th replica of evaluation
Reading samples...
5it [00:00, 1029.98it/s]
Traceback (most recent call last):
  File "/home/benchmark/ais_bench/benchmark/tasks/openicl_eval.py", line 521, in <module>
    raise e
  File "/home/benchmark/ais_bench/benchmark/tasks/openicl_eval.py", line 518, in <module>
    evaluator.run()
  File "/home/benchmark/ais_bench/benchmark/tasks/openicl_eval.py", line 98, in run
    self._score()
  File "/home/benchmark/ais_bench/benchmark/tasks/openicl_eval.py", line 283, in _score
    result = icl_evaluator.evaluate(k, n, copy.deepcopy(test_set), **preds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benchmark/ais_bench/benchmark/openicl/icl_evaluator/icl_base_evaluator.py", line 284, in evaluate
    results = self.score(**current_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benchmark/ais_bench/benchmark/datasets/humaneval.py", line 106, in score
    score = evaluate_functional_correctness(out_dir, self.k, n_workers=4, timeout=3.0, problem_file=HUMAN_EVAL)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/human_eval/evaluation.py", line 73, in evaluate_functional_correctness
    assert len(completion_id) == len(problems), "Some problems are not attempted."
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Some problems are not attempted.
[2026-02-05 09:40:43,378] [ais_bench] [INFO] Evaluation tasks completed.
[2026-02-05 09:40:43,378] [ais_bench] [INFO] Summarizing evaluation results...
dataset           version    metric    mode    vllm-api-general-chat
----------------  ---------  --------  ------  -----------------------
openai_humaneval  -      

### 前置检查

- [x] 我已读懂主页文档的快速入门，无法解决问题
- [x] 我已检索过FAQ，无重复问题
- [x] 我已搜索过现有Issue，无重复问题
- [x] 我已更新到最新版本，问题仍存在

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AISbench不支持针对humanevals数据集指定输入数据量，只能进行全量数据测试 #136

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

预期行为

实际行为

前置检查

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] AISbench不支持针对humanevals数据集指定输入数据量，只能进行全量数据测试 #136

Description

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

预期行为

实际行为

前置检查

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions