PhysicsEval

Description

PhysicsEval is an environment for evaluating agents on physics problems. It contains 19,609 physics problems from authoritative textbooks covering mechanics, thermodynamics, electromagnetism, quantum physics, and more. An LLM grader evaluates the agent's answer against the gold target.

Capabilities

Solving physics problems across multiple domains (mechanics, thermodynamics, electromagnetism, quantum physics, etc.)
Mathematical reasoning and problem-solving
Single-turn question answering with LLM-graded correctness

Compute Requirements

PhysicsEval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are two splits: train (17,647 tasks) and test (1,962 tasks), totaling 19,609 physics problems. Each task includes a problem statement, gold answer, category, and difficulty level. Problems span a range of physics topics from authoritative textbooks.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls the answer tool once with its solution, and the environment grades it using an LLM grader (gpt-5-mini). The grader evaluates whether the answer is correct versus a ground truth response:

CORRECT: Reward 1.0.
INCORRECT: Reward 0.0.

Data

Problems are sourced from the hosted HuggingFace dataset, converted to parquet format for efficient loading.

Tools

Agents are given a single tool:

answer: Submit a final answer to the physics problem. The answer is graded by the LLM grader against the gold target. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

PhysicsEval is a single-turn environment. The agent receives a physics problem and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Problems are rated on a 1-10 difficulty scale, grouped into three tiers:

Difficulty Tier	Scale	Train Tasks	Test Tasks
Easy	1-4	3,308 (18.8%)	365 (18.6%)
Medium	5-7	13,492 (76.4%)	1,488 (75.8%)
Hard	8-10	847 (4.8%)	109 (5.6%)

Model Performance (from paper):

Model	Easy	Medium	Hard
Phi-4-reasoning-plus (multi-agent)	94.7	93.9	87.6
o4-mini	86.8	88.2	85.4
DeepSeek-R1	94.1	83.4	72.7
QwQ-32B	94.6	81.9	71.0
Llama 4 Maverick	92.9	82.4	52.1
Gemma 3 27B	87.6	59.1	40.6

The benchmark shows clear difficulty scaling: top models achieve ~95% on easy problems but drop to ~70-88% on hard problems. The majority of tasks (76%) are medium difficulty.

Other Environment Requirements

PhysicsEval requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents in PhysicsEval are asked to solve physics problems. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet.

Citations

@misc{siddique2025physicsevalinferencetimetechniquesimprove,
      title={PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems}, 
      author={Oshayer Siddique and J. M Areeb Uzair Alam and Md Jobayer Rahman Rafy and Syed Rifat Raiyan and Hasan Mahmud and Md Kamrul Hasan},
      year={2025},
      eprint={2508.00079},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.00079}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
constants.py		constants.py
download_data.py		download_data.py
prompts.py		prompts.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhysicsEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhysicsEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages