Skip to content

EnvCommons/CritPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CritPt

OpenReward Environment Hugging Face Dataset

Description

CritPt (Complex Research using Integrated Thinking - Physics Test) is a frontier physics research benchmark consisting of 70 research-level problems across 12 physics domains, created by 50+ active physics researchers from 30+ leading institutions. Each problem requires producing a Python function that returns the correct answer, which may be a numerical value, SymPy expression, or composite structure.

Capabilities

  • Research-level physics reasoning across condensed matter, quantum physics, AMO, astrophysics, HEP, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics
  • Code generation for mathematical and numerical computation
  • Symbolic computation with SymPy
  • Numerical methods (ODEs, integrals, Monte Carlo simulations)

Variants

This environment provides two variants:

  • CritPt (critpt): Tool-augmented mode with sandbox code execution. Agents can use execute_code to iteratively explore and compute, then submit_answer to submit their final answer.
  • CritPtNoCode (critptnocode): Base model mode with answer submission only. Agents reason and produce answer code directly via submit_answer.

Compute Requirements

  • CritPt (sandbox variant): 4 CPUs, 8 GB RAM per sandbox. The original CritPt implementation uses a 30-minute (1800s) timeout for agent code exploration, and problems may involve numerical ODE solving, Monte Carlo simulations, or heavy symbolic computation.
  • CritPtNoCode: No additional compute (no sandbox)

License

Apache 2.0

Tasks

There are 70 test tasks spanning 12 physics domains. Each task is a research-level problem requiring an answer(...) Python function that returns the correct result.

Split: test (70 problems). No training split is provided, as the benchmark is designed for evaluation only.

Answer formats include:

  • Floating-point values (e.g., number of e-folds)
  • Lists of floats (e.g., coefficients)
  • SymPy expressions (e.g., beta functions, fidelity expressions)
  • Tuples of mixed types (e.g., symbolic expression + multiple choice letter)
  • Sets of tuples (e.g., expectation values)

Reward Structure

This is a deferred reward environment. The 70 test problems do not have public ground truth answers. The submit_answer tool returns reward=0 and stores the agent's generated code in the tool output metadata.

To obtain actual scores, collect all 70 submissions and batch-submit them to the CritPt grading server using evaluate.py:

python evaluate.py --results-dir ./results --api-key YOUR_AA_API_KEY

The grading server uses:

  • Numeric tolerance comparison for floating-point values
  • SymPy equivalence checking with algebraic simplification for symbolic expressions
  • Element-wise grading for composite answers (all components must match)
  • 30-second execution timeout per problem
  • Restricted library set: math, numpy, sympy, scipy

Grading is binary: a problem is correct only if all components match the ground truth.

Rate limit: 10 batch submissions per 24-hour window per API key.

Data

Problem data is loaded from the HuggingFace dataset. Each problem includes:

  • Full problem description with LaTeX
  • Code template with function signature and docstring
  • Physics domain tag

Ground truth answers are held privately on the grading server.

Tools

CritPt (sandbox variant)

  • execute_code(code: str): Run Python code in a sandboxed environment (1800s timeout). Available libraries: math, numpy, sympy, scipy.
  • submit_answer(code: str): Submit final answer as Python code defining the answer(...) function.

CritPtNoCode (base variant)

  • submit_answer(code: str): Submit final answer as Python code defining the answer(...) function.

Evaluation Workflow

  1. Run the environment (either variant) against all 70 tasks
  2. Collect the generated_code from each task's submit_answer metadata
  3. Save each submission as a JSON file: {"problem_id": "...", "generated_code": "..."}
  4. Batch-submit to the grading server: python evaluate.py --results-dir ./results --api-key KEY
  5. Review the returned accuracy and timeout metrics

Local Testing

An example challenge (quantum_error_correction) with ground truth is available for validating the grading pipeline locally. See test_local.py for tests using this example.

# Clone CritPt repo for example challenge data
git clone https://github.com/EnvCommons/CritPt.git /tmp/CritPt

# Run tests
pytest test_local.py -v

Safety

CritPt problems are purely computational physics research tasks. Agents interact only with mathematical problem descriptions and produce Python code. No safety-sensitive capabilities are involved.

Citations

@article{critpt2025,
  title={Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark},
  author={Zhu, Minhui and Tian, Minyang and Yang, Xiaocheng and Zhou, Tianci and Yuan, Lifan and Zhu, Penghao and Chertkov, Eli and Liu, Shengyan and Du, Yufeng and Ji, Ziming and Das, Indranil and Cao, Junyi and Du, Yufeng and Yu, Jiabin and Wu, Peixue and He, Jinchen and Su, Yifan and Jiang, Yikun and Zhang, Yujie and Liu, Chang and Huang, Ze-Min and Jia, Weizhen and Wang, Yunkai and Jafarpour, Farshid and Zhao, Yong and Chen, Xinan and Shelton, Jessie and Young, Aaron W. and Bartolotta, John and Xu, Wenchao and Sun, Yue and Chu, Anjun and Colussi, Victor and Akers, Chris and Brooks, Nathan and Fu, Wenbo and Zhao, Jinchao and Qi, Marvin and Mu, Anqi and Yang, Yubo and Zang, Allen and Lyu, Yang and Mai, Peizhi and Wilson, Christopher and Guo, Xuefei and Zhou, Juntai and Inafuku, Daniel and Xue, Chi and Gao, Luyu and Yang, Ze and Hein, Ya{\"\i}r and Kahn, Yonatan and Zhou, Kevin and Luo, Di and Wilson, John Drew and Reilly, Jarrod T. and Bandak, Dmytro and Press, Ofir and Yang, Liang and Wang, Xueying and Tong, Hao and Chia, Nicolas and Huerta, Eliu and Peng, Hao},
  journal={arXiv preprint arXiv:2509.26574},
  year={2025}
}

About

CritPT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors