PRBench (Professional Reasoning Benchmark) is an environment for evaluating high-stakes professional reasoning in finance and legal domains. It contains 1,650 expert-authored conversations with detailed evaluation rubrics (10-30 weighted criteria per task) covering market microstructure, risk management, regulatory compliance, contract analysis, and case law application.
- Professional reasoning in finance and legal domains
- Multi-turn conversational context understanding
- Expert-level analysis with reference material synthesis
- Rubric-based quality evaluation
Agents are given a standard environment with no sandbox or file system access.
There is one split in this environment:
- test: 1,650 tasks (combined from finance, legal, finance_hard, legal_hard)
Tasks include full conversation history, reference texts, and hidden rubric criteria.
This is a single-turn environment. The agent submits a response via the submit_response tool. An LLM grader (gpt-5-mini) evaluates against 10-30 weighted rubric criteria with importance levels:
- Critically Important (8-10 points): Core requirements
- Important (4-7 points): Significant contributions
- Slightly Important (2-3 points): Nice-to-have additions
- Detrimental (-6 to -8 points): Harmful if present (penalty)
Scoring uses the clipped score formula: raw_points / max_positive_points, resulting in reward from 0.0 to 1.0+.
Data consists of Parquet files (finance.parquet, legal.parquet, finance_hard.parquet, legal_hard.parquet) sourced from HuggingFace ScaleAI/PRBench. Each row contains conversation history, reference texts, and rubric criteria. Data is stored on the OpenReward platform.
| Tool | Description |
|---|---|
submit_response |
Submit your professional response for rubric-based evaluation. Ends the episode. |
Single-turn. The agent reads the conversation context and reference materials, then submits one professional response.
PRBench evaluates expert-level professional reasoning with weighted multi-criteria rubrics in finance and legal domains.
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.
Agents in PRBench produce professional analysis in a standard environment. Responses are for evaluation purposes only and should not be used for actual financial or legal advice.
@article{akyurek2025prbench,
title={PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning},
author={Afra Feyza Aky{\"u}rek and Advait Gosai and Chen Bo Calvin Zhang and Vipul Gupta and Jaehwan Jeong and Anisha Gunjal and Tahseen Rabbani and Maria Mazzone and David Randolph and Mohammad Mahmoudi Meymand and Gurshaan Chattha and Paula Rodriguez and Diego Mares and Pavit Singh and Michael Liu and Subodh Chawla and Pete Cline and Lucy Ogaz and Ernesto Hernandez and Zihao Wang and Pavi Bhatter and Marcos Ayestaran and Bing Liu and Yunzhong He},
journal={arXiv preprint arXiv:2511.11562},
year={2025}
}