Add new benchmark suites by Qubitium · Pull Request #108 · ModelCloud/Evalution

Qubitium · 2026-04-16T15:05:48Z

What changed

Added runnable benchmark implementations for hle, supergpqa, hmmt_feb25, hmmt_nov25, hmmt_feb26, imoanswerbench, and livecodebench_v6.
Registered capability-gated placeholders for swe_bench_verified, swe_bench_multilingual, swe_bench_pro, terminal_bench_2, claw_eval_avg, claw_eval_pass3, skillsbench_avg5, qwenclawbench,
nl2repo, qwenwebbench, tau3_bench, vita_bench, deepplanning, tool_decathlon, mcpmark, mcp_atlas, and widesearch, with clear runtime-capability errors instead of misleading partial implementations.
Exported the new suites through evalution.benchmarks and added integration metadata/baselines in tests/models_support.py.
Added unit coverage plus standalone Llama 3.2 1B Instruct regression tests for the new runnable suites.
Hardened math answer extraction to handle boxed answers, explicit final-answer lines, and inline math spans more reliably, using compiled pcre patterns.
Added an optional apply_chat_template mode for HLE while keeping the default benchmark-faithful prompt path unchanged.

PYTHON_GIL=0 pytest tests/test_scorers.py tests/test_hle.py tests/test_hmmt.py tests/test_imoanswerbench.py tests/test_livecodebench.py
PYTHON_GIL=0 pytest tests/test_regex_backend.py
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=4 PYTHON_GIL=0 pytest tests/models/test_hle_llama3_2_transformers.py tests/models/test_imoanswerbench_llama3_2_transformers.py tests/models/ test_livecodebench_llama3_2_transformers.py tests/models/test_supergpqa_llama3_2_transformers.py

The low Llama 3.2 1B baseline scores on several newly added suites were checked directly; the main issue was task difficulty, not a broad scoring failure.
HLE showed mild chat-template sensitivity, which is why the optional chat-wrapping mode was added for controlled A/B use.

Qubitium added 2 commits April 16, 2026 14:54

Add new benchmark suite integrations

ebb1431

Harden math extraction and add optional HLE chat mode

32ad295

Qubitium merged commit e301368 into main Apr 16, 2026
2 checks passed

Qubitium deleted the add-tests-0416 branch April 16, 2026 15:43