CritPt

Description

CritPt (Complex Research using Integrated Thinking - Physics Test) is a frontier physics research benchmark consisting of 70 research-level problems across 12 physics domains, created by 50+ active physics researchers from 30+ leading institutions. Each problem requires producing a Python function that returns the correct answer, which may be a numerical value, SymPy expression, or composite structure.

Capabilities

Research-level physics reasoning across condensed matter, quantum physics, AMO, astrophysics, HEP, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics
Code generation for mathematical and numerical computation
Symbolic computation with SymPy
Numerical methods (ODEs, integrals, Monte Carlo simulations)

Variants

This environment provides two variants:

CritPt (critpt): Tool-augmented mode with sandbox code execution. Agents can use execute_code to iteratively explore and compute, then submit_answer to submit their final answer.
CritPtNoCode (critptnocode): Base model mode with answer submission only. Agents reason and produce answer code directly via submit_answer.

Compute Requirements

CritPt (sandbox variant): 4 CPUs, 8 GB RAM per sandbox. The original CritPt implementation uses a 30-minute (1800s) timeout for agent code exploration, and problems may involve numerical ODE solving, Monte Carlo simulations, or heavy symbolic computation.
CritPtNoCode: No additional compute (no sandbox)

License

Apache 2.0

Tasks

There are 70 test tasks spanning 12 physics domains. Each task is a research-level problem requiring an answer(...) Python function that returns the correct result.

Split: test (70 problems). No training split is provided, as the benchmark is designed for evaluation only.

Answer formats include:

Floating-point values (e.g., number of e-folds)
Lists of floats (e.g., coefficients)
SymPy expressions (e.g., beta functions, fidelity expressions)
Tuples of mixed types (e.g., symbolic expression + multiple choice letter)
Sets of tuples (e.g., expectation values)

Reward Structure

This is a deferred reward environment. The 70 test problems do not have public ground truth answers. The submit_answer tool returns reward=0 and stores the agent's generated code in the tool output metadata.

To obtain actual scores, collect all 70 submissions and batch-submit them to the CritPt grading server using evaluate.py:

python evaluate.py --results-dir ./results --api-key YOUR_AA_API_KEY

The grading server uses:

Numeric tolerance comparison for floating-point values
SymPy equivalence checking with algebraic simplification for symbolic expressions
Element-wise grading for composite answers (all components must match)
30-second execution timeout per problem
Restricted library set: math, numpy, sympy, scipy

Grading is binary: a problem is correct only if all components match the ground truth.

Rate limit: 10 batch submissions per 24-hour window per API key.

Data

Problem data is loaded from the HuggingFace dataset. Each problem includes:

Full problem description with LaTeX
Code template with function signature and docstring
Physics domain tag

Ground truth answers are held privately on the grading server.

Tools

CritPt (sandbox variant)

execute_code(code: str): Run Python code in a sandboxed environment (1800s timeout). Available libraries: math, numpy, sympy, scipy.
submit_answer(code: str): Submit final answer as Python code defining the answer(...) function.

CritPtNoCode (base variant)

submit_answer(code: str): Submit final answer as Python code defining the answer(...) function.

Evaluation Workflow

Run the environment (either variant) against all 70 tasks
Collect the generated_code from each task's submit_answer metadata
Save each submission as a JSON file: {"problem_id": "...", "generated_code": "..."}
Batch-submit to the grading server: python evaluate.py --results-dir ./results --api-key KEY
Review the returned accuracy and timeout metrics

Local Testing

An example challenge (quantum_error_correction) with ground truth is available for validating the grading pipeline locally. See test_local.py for tests using this example.

# Clone CritPt repo for example challenge data
git clone https://github.com/EnvCommons/CritPt.git /tmp/CritPt

# Run tests
pytest test_local.py -v

Safety

CritPt problems are purely computational physics research tasks. Agents interact only with mathematical problem descriptions and produce Python code. No safety-sensitive capabilities are involved.

Citations

@article{critpt2025,
  title={Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark},
  author={Zhu, Minhui and Tian, Minyang and Yang, Xiaocheng and Zhou, Tianci and Yuan, Lifan and Zhu, Penghao and Chertkov, Eli and Liu, Shengyan and Du, Yufeng and Ji, Ziming and Das, Indranil and Cao, Junyi and Du, Yufeng and Yu, Jiabin and Wu, Peixue and He, Jinchen and Su, Yifan and Jiang, Yikun and Zhang, Yujie and Liu, Chang and Huang, Ze-Min and Jia, Weizhen and Wang, Yunkai and Jafarpour, Farshid and Zhao, Yong and Chen, Xinan and Shelton, Jessie and Young, Aaron W. and Bartolotta, John and Xu, Wenchao and Sun, Yue and Chu, Anjun and Colussi, Victor and Akers, Chris and Brooks, Nathan and Fu, Wenbo and Zhao, Jinchao and Qi, Marvin and Mu, Anqi and Yang, Yubo and Zang, Allen and Lyu, Yang and Mai, Peizhi and Wilson, Christopher and Guo, Xuefei and Zhou, Juntai and Inafuku, Daniel and Xue, Chi and Gao, Luyu and Yang, Ze and Hein, Ya{\"\i}r and Kahn, Yonatan and Zhou, Kevin and Luo, Di and Wilson, John Drew and Reilly, Jarrod T. and Bandak, Dmytro and Press, Ofir and Yang, Liang and Wang, Xueying and Tong, Hao and Chia, Nicolas and Huerta, Eliu and Peng, Hao},
  journal={arXiv preprint arXiv:2509.26574},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dockerfile		Dockerfile
README.md		README.md
critpt.py		critpt.py
evaluate.py		evaluate.py
grader.py		grader.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
test_local.py		test_local.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CritPt

Description

Capabilities

Variants

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

CritPt (sandbox variant)

CritPtNoCode (base variant)

Evaluation Workflow

Local Testing

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CritPt

Description

Capabilities

Variants

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

CritPt (sandbox variant)

CritPtNoCode (base variant)

Evaluation Workflow

Local Testing

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages