Skip to content

Add new benchmark suites#108

Merged
Qubitium merged 2 commits into
mainfrom
add-tests-0416
Apr 16, 2026
Merged

Add new benchmark suites#108
Qubitium merged 2 commits into
mainfrom
add-tests-0416

Conversation

@Qubitium

@Qubitium Qubitium commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

What changed

  • Added runnable benchmark implementations for hle, supergpqa, hmmt_feb25, hmmt_nov25, hmmt_feb26, imoanswerbench, and livecodebench_v6.
  • Registered capability-gated placeholders for swe_bench_verified, swe_bench_multilingual, swe_bench_pro, terminal_bench_2, claw_eval_avg, claw_eval_pass3, skillsbench_avg5, qwenclawbench,
    nl2repo, qwenwebbench, tau3_bench, vita_bench, deepplanning, tool_decathlon, mcpmark, mcp_atlas, and widesearch, with clear runtime-capability errors instead of misleading partial implementations.
  • Exported the new suites through evalution.benchmarks and added integration metadata/baselines in tests/models_support.py.
  • Added unit coverage plus standalone Llama 3.2 1B Instruct regression tests for the new runnable suites.
  • Hardened math answer extraction to handle boxed answers, explicit final-answer lines, and inline math spans more reliably, using compiled pcre patterns.
  • Added an optional apply_chat_template mode for HLE while keeping the default benchmark-faithful prompt path unchanged.

Validation

  • PYTHON_GIL=0 pytest tests/test_scorers.py tests/test_hle.py tests/test_hmmt.py tests/test_imoanswerbench.py tests/test_livecodebench.py
  • PYTHON_GIL=0 pytest tests/test_regex_backend.py
  • CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=4 PYTHON_GIL=0 pytest tests/models/test_hle_llama3_2_transformers.py tests/models/test_imoanswerbench_llama3_2_transformers.py tests/models/ test_livecodebench_llama3_2_transformers.py tests/models/test_supergpqa_llama3_2_transformers.py

Notes

  • The low Llama 3.2 1B baseline scores on several newly added suites were checked directly; the main issue was task difficulty, not a broad scoring failure.
  • HLE showed mild chat-template sensitivity, which is why the optional chat-wrapping mode was added for controlled A/B use.

@Qubitium Qubitium merged commit e301368 into main Apr 16, 2026
2 checks passed
@Qubitium Qubitium deleted the add-tests-0416 branch April 16, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant