GDPVal is an environment for evaluating agents on real-world knowledge work tasks. Based on the GDPval benchmark from OpenAI, agents are given workplace tasks across 44 occupations and must analyze reference materials (Excel, PDF, Word, PowerPoint) and create deliverable files. Evaluation uses weighted rubric-based scoring with an LLM grader that has tool access to inspect submitted files.
- Analyzing reference materials (Excel, PDF, Word, PowerPoint documents)
- Creating deliverable files (reports, spreadsheets, presentations)
- Meeting detailed rubric criteria (30-56 criteria per task)
- Multi-step knowledge work reasoning across 44 occupations
Agents in GDPVal are given a sandbox with 2 CPUs and 2 GB RAM, with access to document manipulation tools (Excel, PDF, Word, PowerPoint toolsets).
There is one split: testv2 (220 tasks). Each task corresponds to a real-world knowledge work challenge across 44 occupations spanning 9 major U.S. GDP-contributing sectors (Professional Services, Finance, Healthcare, Education, etc.).
This is a multi-turn environment with weighted rubric-based scoring. The agent works in the sandbox and calls submit_deliverable when finished. An LLM grader (gpt-5-mini with tool access) evaluates each rubric criterion:
- Reward ranges from 0.0 to 1.0 based on: (sum of passed criteria points) / (total possible points)
- Tasks have 30-56 evaluation criteria with varying point values
Tasks are derived from the GDPval benchmark from OpenAI. Reference files and deliverable templates are stored on the OpenReward platform.
Agents are given CLI tools and document manipulation tools:
CLI Tools: bash, read, write, edit, glob, grep, ls, todo_write
Document Tools:
- Excel:
excel_read,excel_write,excel_list_sheets, etc. - PDF:
pdf_extract_text,pdf_read, etc. - Word:
word_read,word_write, etc. - PowerPoint:
ppt_read,ppt_write, etc.
Submission: submit_deliverable - Submit completed file for rubric-based evaluation.
GDPVal is a multi-turn environment. Agents iteratively analyze reference files, create deliverables, and refine their work before submitting the final file.
Model performance on GDPVal from the original paper (win rates against human experts):
| Model | Win Rate |
|---|---|
| Claude Opus 4.1 | 47.6% |
| GPT-5 | 39.0% |
| o3 | 35.2% |
| o4-mini | 29.1% |
| GPT-4o | 12.5% |
Frontier models are approaching but have not yet matched industry experts (averaging 14 years of experience) in deliverable quality.
- OpenAI API key: Required for LLM-based rubric grading. Pass via
secrets={"openai_api_key": "..."}.
Agents in GDPVal complete knowledge work tasks in a sandboxed environment. The environment does not involve sensitive personal data or real business operations.
@article{patwardhan2025gdpval,
title={GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks},
author={Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim{\'o}n Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia and Tworek, Jerry},
journal={arXiv preprint arXiv:2510.04374},
year={2025},
url={https://arxiv.org/abs/2510.04374}
}