GDPVal

Description

GDPVal is an environment for evaluating agents on real-world knowledge work tasks. Based on the GDPval benchmark from OpenAI, agents are given workplace tasks across 44 occupations and must analyze reference materials (Excel, PDF, Word, PowerPoint) and create deliverable files. Evaluation uses weighted rubric-based scoring with an LLM grader that has tool access to inspect submitted files.

Capabilities

Analyzing reference materials (Excel, PDF, Word, PowerPoint documents)
Creating deliverable files (reports, spreadsheets, presentations)
Meeting detailed rubric criteria (30-56 criteria per task)
Multi-step knowledge work reasoning across 44 occupations

Compute Requirements

Agents in GDPVal are given a sandbox with 2 CPUs and 2 GB RAM, with access to document manipulation tools (Excel, PDF, Word, PowerPoint toolsets).

License

CC BY 4.0.

Tasks

There is one split: testv2 (220 tasks). Each task corresponds to a real-world knowledge work challenge across 44 occupations spanning 9 major U.S. GDP-contributing sectors (Professional Services, Finance, Healthcare, Education, etc.).

Reward Structure

This is a multi-turn environment with weighted rubric-based scoring. The agent works in the sandbox and calls submit_deliverable when finished. An LLM grader (gpt-5-mini with tool access) evaluates each rubric criterion:

Reward ranges from 0.0 to 1.0 based on: (sum of passed criteria points) / (total possible points)
Tasks have 30-56 evaluation criteria with varying point values

Data

Tasks are derived from the GDPval benchmark from OpenAI. Reference files and deliverable templates are stored on the OpenReward platform.

Tools

Agents are given CLI tools and document manipulation tools:

CLI Tools: bash, read, write, edit, glob, grep, ls, todo_write

Document Tools:

Excel: excel_read, excel_write, excel_list_sheets, etc.
PDF: pdf_extract_text, pdf_read, etc.
Word: word_read, word_write, etc.
PowerPoint: ppt_read, ppt_write, etc.

Submission: submit_deliverable - Submit completed file for rubric-based evaluation.

Time Horizon

GDPVal is a multi-turn environment. Agents iteratively analyze reference files, create deliverables, and refine their work before submitting the final file.

Environment Difficulty

Model performance on GDPVal from the original paper (win rates against human experts):

Model	Win Rate
Claude Opus 4.1	47.6%
GPT-5	39.0%
o3	35.2%
o4-mini	29.1%
GPT-4o	12.5%

Frontier models are approaching but have not yet matched industry experts (averaging 14 years of experience) in deliverable quality.

Other Environment Requirements

OpenAI API key: Required for LLM-based rubric grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in GDPVal complete knowledge work tasks in a sandboxed environment. The environment does not involve sensitive personal data or real business operations.

Citations

@article{patwardhan2025gdpval,
  title={GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks},
  author={Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim{\'o}n Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia and Tworek, Jerry},
  journal={arXiv preprint arXiv:2510.04374},
  year={2025},
  url={https://arxiv.org/abs/2510.04374}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
README.md		README.md
cli_environment.py		cli_environment.py
constants.py		constants.py
gdpval.py		gdpval.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
test_grader.py		test_grader.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDPVal

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GDPVal

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages