The Next Generation Framework for Dynamic and Rigorous Code LLM Evaluation.
Features | News | Evaluation | Construction | Paper
- [2026/01] π Code2Bench has been accepted as a conference paper at ICLR 2026!
The evaluation of code-generating LLMs is currently limited by static, contaminated problem sources and low-rigor testing. CODE2BENCH introduces the Dual Scaling philosophy:
-
Scaling the Source (Dynamic & Contamination-Resistant):
- Temporal Filtering: Automatically ingests code from GitHub commits created after the knowledge cutoff of the evaluated models.
- Principled Classification: Uses language-agnostic Scope Graph analysis to classify tasks into Self-Contained (SC) and Weakly Self-Contained (WSC).
-
Scaling the Rigor (Deep & Diagnostic):
- Property-Based Testing (PBT): Generates hundreds of nuanced test cases automatically per task.
- The "Great Filter": A stringent 100% branch coverage quality gate ensuring every task is logically verifiable and non-trivial.
- Diagnostic Fingerprints: Beyond Pass@1, we provide granular insights into failure modes (Syntax vs. Runtime vs. Logic).
Evaluate your LLM on the CODE2BENCH-2509 suite in minutes.
conda create -n code2bench python=3.10 -y
conda activate code2bench
sudo apt-get update && sudo apt-get install graphviz graphviz-dev -y
pip install -r requirements.txt
export PYTHONPATH=`pwd`:$PYTHONPATHTo evaluate a new model, simply inherit from the LLM base class:
# code2bench/llm/my_model.py
from code2bench.llm.base import LLM
class MyCustomLLM(LLM):
def chat(self, system_prompt, user_input, **kwargs):
# Integrate your API or local inference here
return response_textExecute the evaluation script for Python or Java:
python code2bench/test_runner/benchmark_runner.py --benchmark_name Python --mode weaklyBuild your own dynamic benchmark instances from fresh GitHub repositories.
- Define Sources: Add repository URLs to
code2bench/projects.yaml. - Set Time Window: Define
start_timeandend_timein the execution command to target specific commit history (for anti-contamination).
The pipeline automates: Acquisition β Scope Analysis β PBT Generation β Coverage Filtering β Instruction Generation.
# Example: Constructing a Python Weakly Self-Contained benchmark
python code2bench/run.py \
--benchmark_name Python \
--mode weakly \
--start_time 2024-08-01 \
--end_time 2025-05-30 \
--use_proxyFor Java tasks, use:
python code2bench/run_java.py --benchmark_name Pure_Java --mode selfCODE2BENCH provides a novel Diagnostic Fingerprint visualization to understand why models fail.
| Mode | Fail Mode Peak | Insights |
|---|---|---|
| SC (Algorithm) | LogicErr | Models struggle with core synthesis logic. |
| WSC (Library) | RuntimeErr | Challenges arise from API misapplication. |
| Java Native | Perfect Surge | Static typing acts as a "performance scaffold". |
We welcome contributions! Whether it's adding new language support (C++, Rust, Go) or improving the PBT engines, please feel free to open an Issue or a PR.