PhysicsEval is an environment for evaluating agents on physics problems. It contains 19,609 physics problems from authoritative textbooks covering mechanics, thermodynamics, electromagnetism, quantum physics, and more. An LLM grader evaluates the agent's answer against the gold target.
- Solving physics problems across multiple domains (mechanics, thermodynamics, electromagnetism, quantum physics, etc.)
- Mathematical reasoning and problem-solving
- Single-turn question answering with LLM-graded correctness
PhysicsEval does not require a sandbox. It has minimal compute requirements.
There are two splits: train (17,647 tasks) and test (1,962 tasks), totaling 19,609 physics problems. Each task includes a problem statement, gold answer, category, and difficulty level. Problems span a range of physics topics from authoritative textbooks.
This is a sparse, verifiable reward environment. The agent calls the answer tool once with its solution, and the environment grades it using an LLM grader (gpt-5-mini). The grader evaluates whether the answer is correct versus a ground truth response:
- CORRECT: Reward 1.0.
- INCORRECT: Reward 0.0.
Problems are sourced from the hosted HuggingFace dataset, converted to parquet format for efficient loading.
Agents are given a single tool:
answer: Submit a final answer to the physics problem. The answer is graded by the LLM grader against the gold target. Returns whether the answer is correct. This tool can only be called once per task.
PhysicsEval is a single-turn environment. The agent receives a physics problem and submits one answer. Each task requires exactly one tool call.
Problems are rated on a 1-10 difficulty scale, grouped into three tiers:
| Difficulty Tier | Scale | Train Tasks | Test Tasks |
|---|---|---|---|
| Easy | 1-4 | 3,308 (18.8%) | 365 (18.6%) |
| Medium | 5-7 | 13,492 (76.4%) | 1,488 (75.8%) |
| Hard | 8-10 | 847 (4.8%) | 109 (5.6%) |
Model Performance (from paper):
| Model | Easy | Medium | Hard |
|---|---|---|---|
| Phi-4-reasoning-plus (multi-agent) | 94.7 | 93.9 | 87.6 |
| o4-mini | 86.8 | 88.2 | 85.4 |
| DeepSeek-R1 | 94.1 | 83.4 | 72.7 |
| QwQ-32B | 94.6 | 81.9 | 71.0 |
| Llama 4 Maverick | 92.9 | 82.4 | 52.1 |
| Gemma 3 27B | 87.6 | 59.1 | 40.6 |
The benchmark shows clear difficulty scaling: top models achieve ~95% on easy problems but drop to ~70-88% on hard problems. The majority of tasks (76%) are medium difficulty.
PhysicsEval requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.
Agents in PhysicsEval are asked to solve physics problems. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet.
@misc{siddique2025physicsevalinferencetimetechniquesimprove,
title={PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems},
author={Oshayer Siddique and J. M Areeb Uzair Alam and Md Jobayer Rahman Rafy and Syed Rifat Raiyan and Hasan Mahmud and Md Kamrul Hasan},
year={2025},
eprint={2508.00079},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.00079},
}