Skip to content

EnvCommons/GDPVal

Repository files navigation

GDPVal

OpenReward Environment Hugging Face

Description

GDPVal is an environment for evaluating agents on real-world knowledge work tasks. Based on the GDPval benchmark from OpenAI, agents are given workplace tasks across 44 occupations and must analyze reference materials (Excel, PDF, Word, PowerPoint) and create deliverable files. Evaluation uses weighted rubric-based scoring with an LLM grader that has tool access to inspect submitted files.

Capabilities

  • Analyzing reference materials (Excel, PDF, Word, PowerPoint documents)
  • Creating deliverable files (reports, spreadsheets, presentations)
  • Meeting detailed rubric criteria (30-56 criteria per task)
  • Multi-step knowledge work reasoning across 44 occupations

Compute Requirements

Agents in GDPVal are given a sandbox with 2 CPUs and 2 GB RAM, with access to document manipulation tools (Excel, PDF, Word, PowerPoint toolsets).

License

CC BY 4.0.

Tasks

There is one split: testv2 (220 tasks). Each task corresponds to a real-world knowledge work challenge across 44 occupations spanning 9 major U.S. GDP-contributing sectors (Professional Services, Finance, Healthcare, Education, etc.).

Reward Structure

This is a multi-turn environment with weighted rubric-based scoring. The agent works in the sandbox and calls submit_deliverable when finished. An LLM grader (gpt-5-mini with tool access) evaluates each rubric criterion:

  • Reward ranges from 0.0 to 1.0 based on: (sum of passed criteria points) / (total possible points)
  • Tasks have 30-56 evaluation criteria with varying point values

Data

Tasks are derived from the GDPval benchmark from OpenAI. Reference files and deliverable templates are stored on the OpenReward platform.

Tools

Agents are given CLI tools and document manipulation tools:

CLI Tools: bash, read, write, edit, glob, grep, ls, todo_write

Document Tools:

  • Excel: excel_read, excel_write, excel_list_sheets, etc.
  • PDF: pdf_extract_text, pdf_read, etc.
  • Word: word_read, word_write, etc.
  • PowerPoint: ppt_read, ppt_write, etc.

Submission: submit_deliverable - Submit completed file for rubric-based evaluation.

Time Horizon

GDPVal is a multi-turn environment. Agents iteratively analyze reference files, create deliverables, and refine their work before submitting the final file.

Environment Difficulty

Model performance on GDPVal from the original paper (win rates against human experts):

Model Win Rate
Claude Opus 4.1 47.6%
GPT-5 39.0%
o3 35.2%
o4-mini 29.1%
GPT-4o 12.5%

Frontier models are approaching but have not yet matched industry experts (averaging 14 years of experience) in deliverable quality.

Other Environment Requirements

  • OpenAI API key: Required for LLM-based rubric grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in GDPVal complete knowledge work tasks in a sandboxed environment. The environment does not involve sensitive personal data or real business operations.

Citations

@article{patwardhan2025gdpval,
  title={GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks},
  author={Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim{\'o}n Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia and Tworek, Jerry},
  journal={arXiv preprint arXiv:2510.04374},
  year={2025},
  url={https://arxiv.org/abs/2510.04374}
}

About

GDPVal environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors