English | 简体中文
PhysLogic is a physics reasoning benchmark for evaluating both final-answer accuracy and the logicality of model reasoning processes.
This repository contains the evaluation code. The benchmark data are hosted on Hugging Face and loaded at runtime.
🎉 This work has been accepted to ICML 2026.
- Process-aware evaluation: PhysLogic evaluates not only whether a model reaches the right answer, but also whether its reasoning follows the core scientific logic behind the problem.
- Paper-derived physics problems: questions are constructed from logical derivations in physics academic papers, rather than from isolated textbook exercises.
- Logicality annotations: each example includes ordered logical nexuses and their importance weights, enabling automatic evaluation of reasoning fidelity, ordering, and progress.
- Structured benchmark split: the released benchmark covers physics subdomains, difficulty levels, and question types in a structured way.
PhysLogic evaluates physics problem solving from two views: final-answer accuracy and reasoning-process logicality. The benchmark uses the concept of scientific logicality: a model response is compared with the key logical steps needed to solve the problem and with the final answer.
The evaluator compares model reasoning against the logical nexuses using three metrics:
F: Logical Fidelity, measuring coverage of required logical content;O: Causal Connection, measuring consistency with the intended derivational order;P: Inferential Progress, measuring forward movement in the reasoning path.
Use Python 3.10+.
pip install -r requirements.txtSet an API key for an OpenAI-compatible Chat Completions endpoint:
export OPENAI_API_KEY=...For non-OpenAI providers, pass --base_url and optionally choose a different
API-key environment variable with --api_key_env.
python src/benchmarking.py \
--model_id gpt-4o-mini \
--run_name gpt-4o-mini \
--concurrency 12--concurrency is the number of parallel API workers. It is not a single-request
batch size. Increase it only within your provider's rate limits.
For an OpenAI-compatible local or third-party endpoint:
python src/benchmarking.py \
--model_id your-model \
--base_url http://localhost:8000/v1 \
--api_key_env OPENAI_API_KEY \
--run_name your-model \
--concurrency 8Outputs are written to:
results/<run_name>/{choice,comp_n,comp_e,proof}.json
If you already generated model outputs, create a JSONL file with one object per line:
{"uid": "example-id", "answer_pred": "solution text with \\boxed{...}", "reasoning_pred": "optional reasoning text"}Then run:
python src/benchmarking.py \
--predictions_path predictions.jsonl \
--run_name my_predictionsreasoning_pred is optional. If it is absent, the evaluator uses API
reasoning_content when available, then <think>...</think> content when
present, and otherwise the full visible output.
python src/result_summary.py --run_name gpt-4o-mini --save_jsonThis prints per-type and overall averages:
Acc: macro accuracy overchoiceandcomp_nonlyF/O/P: macro averages over all evaluated examplesRecall/Precision: supporting values for Logical Fidelity
For direct aggregation from a path:
python src/result_summary.py --results_dir results/gpt-4o-minisrc/benchmarking.py Benchmark runner and CLI orchestration
src/model_client.py OpenAI-compatible API client with concurrent requests
src/answer_scoring.py Final-answer accuracy scoring for choice and comp_n
src/logicality_metrics.py F/O/P logicality metric implementation
src/result_summary.py Per-type and overall result aggregation
src/prompt/LLM_judge.md Optional LLM judge prompt for non-numeric comp_n cases
--output_dir results
--question_types choice,comp_n,comp_e,proof
--encoder_model all-MiniLM-L6-v2
--similarity_threshold 0.3
--judge_model_id <model>
--judge_concurrency 10
--limit_per_type 2
--judge_model_id is only needed when a comp_n answer cannot be judged by
numeric extraction and requires an LLM textual judgement.