Skip to content
View Code2Bench's full-sized avatar

Block or report Code2Bench

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Code2Bench/README.md

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Venue: ICLR 2026 arXiv Website License: MIT Python: 3.10+

The Next Generation Framework for Dynamic and Rigorous Code LLM Evaluation.

Features | News | Evaluation | Construction | Paper


πŸ“’ News

  • [2026/01] πŸŽ‰ Code2Bench has been accepted as a conference paper at ICLR 2026!

✨ Key Features

The evaluation of code-generating LLMs is currently limited by static, contaminated problem sources and low-rigor testing. CODE2BENCH introduces the Dual Scaling philosophy:

  1. Scaling the Source (Dynamic & Contamination-Resistant):

    • Temporal Filtering: Automatically ingests code from GitHub commits created after the knowledge cutoff of the evaluated models.
    • Principled Classification: Uses language-agnostic Scope Graph analysis to classify tasks into Self-Contained (SC) and Weakly Self-Contained (WSC).
  2. Scaling the Rigor (Deep & Diagnostic):

    • Property-Based Testing (PBT): Generates hundreds of nuanced test cases automatically per task.
    • The "Great Filter": A stringent 100% branch coverage quality gate ensuring every task is logically verifiable and non-trivial.
    • Diagnostic Fingerprints: Beyond Pass@1, we provide granular insights into failure modes (Syntax vs. Runtime vs. Logic).

πŸš€ Quick Start: Evaluation

Evaluate your LLM on the CODE2BENCH-2509 suite in minutes.

1. Installation

conda create -n code2bench python=3.10 -y
conda activate code2bench
sudo apt-get update && sudo apt-get install graphviz graphviz-dev -y
pip install -r requirements.txt
export PYTHONPATH=`pwd`:$PYTHONPATH

2. Plug in Your Model

To evaluate a new model, simply inherit from the LLM base class:

# code2bench/llm/my_model.py
from code2bench.llm.base import LLM

class MyCustomLLM(LLM):
    def chat(self, system_prompt, user_input, **kwargs):
        # Integrate your API or local inference here
        return response_text

3. Run Benchmark

Execute the evaluation script for Python or Java:

python code2bench/test_runner/benchmark_runner.py --benchmark_name Python --mode weakly

πŸ› οΈ Benchmark Construction

Build your own dynamic benchmark instances from fresh GitHub repositories.

Configuration

  1. Define Sources: Add repository URLs to code2bench/projects.yaml.
  2. Set Time Window: Define start_time and end_time in the execution command to target specific commit history (for anti-contamination).

Full Pipeline Run

The pipeline automates: Acquisition β†’ Scope Analysis β†’ PBT Generation β†’ Coverage Filtering β†’ Instruction Generation.

# Example: Constructing a Python Weakly Self-Contained benchmark
python code2bench/run.py \
    --benchmark_name Python \
    --mode weakly \
    --start_time 2024-08-01 \
    --end_time 2025-05-30 \
    --use_proxy

For Java tasks, use:

python code2bench/run_java.py --benchmark_name Pure_Java --mode self

πŸ“ˆ Analysis & Visualization

CODE2BENCH provides a novel Diagnostic Fingerprint visualization to understand why models fail.

Mode Fail Mode Peak Insights
SC (Algorithm) LogicErr Models struggle with core synthesis logic.
WSC (Library) RuntimeErr Challenges arise from API misapplication.
Java Native Perfect Surge Static typing acts as a "performance scaffold".

🀝 Contributing

We welcome contributions! Whether it's adding new language support (C++, Rust, Go) or improving the PBT engines, please feel free to open an Issue or a PR.

Popular repositories Loading

  1. Code2Bench Code2Bench Public

    Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

    Python 10

  2. Code2Bench.github.io Code2Bench.github.io Public

    HTML