Skip to content

EnvCommons/PRBench

Repository files navigation

PRBench

OpenReward Environment Hugging Face Dataset

Description

PRBench (Professional Reasoning Benchmark) is an environment for evaluating high-stakes professional reasoning in finance and legal domains. It contains 1,650 expert-authored conversations with detailed evaluation rubrics (10-30 weighted criteria per task) covering market microstructure, risk management, regulatory compliance, contract analysis, and case law application.

Capabilities

  • Professional reasoning in finance and legal domains
  • Multi-turn conversational context understanding
  • Expert-level analysis with reference material synthesis
  • Rubric-based quality evaluation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY 4.0.

Tasks

There is one split in this environment:

  • test: 1,650 tasks (combined from finance, legal, finance_hard, legal_hard)

Tasks include full conversation history, reference texts, and hidden rubric criteria.

Reward Structure

This is a single-turn environment. The agent submits a response via the submit_response tool. An LLM grader (gpt-5-mini) evaluates against 10-30 weighted rubric criteria with importance levels:

  • Critically Important (8-10 points): Core requirements
  • Important (4-7 points): Significant contributions
  • Slightly Important (2-3 points): Nice-to-have additions
  • Detrimental (-6 to -8 points): Harmful if present (penalty)

Scoring uses the clipped score formula: raw_points / max_positive_points, resulting in reward from 0.0 to 1.0+.

Data

Data consists of Parquet files (finance.parquet, legal.parquet, finance_hard.parquet, legal_hard.parquet) sourced from HuggingFace ScaleAI/PRBench. Each row contains conversation history, reference texts, and rubric criteria. Data is stored on the OpenReward platform.

Tools

Tool Description
submit_response Submit your professional response for rubric-based evaluation. Ends the episode.

Time Horizon

Single-turn. The agent reads the conversation context and reference materials, then submits one professional response.

Environment Difficulty

PRBench evaluates expert-level professional reasoning with weighted multi-criteria rubrics in finance and legal domains.

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in PRBench produce professional analysis in a standard environment. Responses are for evaluation purposes only and should not be used for actual financial or legal advice.

Citation

@article{akyurek2025prbench,
  title={PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning},
  author={Afra Feyza Aky{\"u}rek and Advait Gosai and Chen Bo Calvin Zhang and Vipul Gupta and Jaehwan Jeong and Anisha Gunjal and Tahseen Rabbani and Maria Mazzone and David Randolph and Mohammad Mahmoudi Meymand and Gurshaan Chattha and Paula Rodriguez and Diego Mares and Pavit Singh and Michael Liu and Subodh Chawla and Pete Cline and Lucy Ogaz and Ernesto Hernandez and Zihao Wang and Pavi Bhatter and Marcos Ayestaran and Bing Liu and Yunzhong He},
  journal={arXiv preprint arXiv:2511.11562},
  year={2025}
}

About

PRBench

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors