APEX-Agents

Description

APEX-Agents (AI Productivity Index for Agents) is an environment for evaluating AI agents on realistic workplace tasks across three professional domains: Investment Banking, Law, and Management Consulting. It contains 480 tasks based on 33 realistic workplace scenarios, requiring multi-turn interaction with file exploration, document analysis, and creation of professional deliverables.

Capabilities

Multi-turn workplace task completion
Document analysis and file exploration (PDFs, spreadsheets, Word, PowerPoint)
Professional deliverable creation
Sandboxed command execution and file manipulation

Compute Requirements

Each agent is given an isolated Docker sandbox with 2 CPUs and 2 GB RAM. Task-specific filesystems with PDFs, spreadsheets, and documents are mounted read-only.

License

CC BY 4.0.

Tasks

There is one split in this environment:

test: 480 tasks (160 per domain: Investment Banking, Law, Management Consulting)

Tasks include task-specific filesystems based on 33 realistic workplace scenarios ("worlds") populated with relevant files, emails, presentations, and spreadsheets.

Reward Structure

This is a multi-turn environment with rubric-based evaluation. The agent uses CLI tools to explore files and complete tasks, then submits via submit_answer (for console message tasks) or submit_files (for file-based outputs). An LLM grader (gpt-5-mini) evaluates against 1-10 binary rubric criteria. ALL criteria must pass for reward=1.0, otherwise reward=0.0.

Data

Data consists of JSON metadata (tasks_and_rubrics.json), world filesystems (world_files/) containing realistic workplace documents for each of the 33 scenarios, and task-specific files (task_files/) for individual tasks. Sourced from HuggingFace mercor/apex-agents. Data is stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit text response for console message tasks. Ends the episode.
`submit_files`	Submit created/edited files for file-based tasks. Ends the episode.
`bash`	Execute shell commands in sandbox.
`read`	Read text file contents.
`write`	Write files.
`edit`	Edit existing files.
`grep`	Search file contents.
`glob`	Find files by pattern.
`ls`	List directory contents.
`excel_read`	Read Excel file contents.
`excel_list_sheets`	List sheets in an Excel file.
`word_read`	Read Word document contents.
`pdf_read`	Read PDF file contents.
`powerpoint_read`	Read PowerPoint file contents.
`powerpoint_list_slides`	List slides in a PowerPoint file.

Time Horizon

Multi-turn. Agents explore files and execute commands before submitting final deliverables.

Environment Difficulty

Tasks are complex multi-step professional workflows that experienced professionals estimate take 1-2 hours to complete. Current leaderboard scores (Pass@1) from mercor.com/apex:

Model	Pass@1
Gemini 3.1 Pro (High)	33.5%
GPT 5.3 Codex (High)	31.7%
Opus 4.6 (High)	29.8%
GPT 5.2 Codex (High)	27.6%
Gemini 3 Flash (High)	24.0%
GPT 5.2 (High)	23.0%
GPT 5.1 Codex (High)	20.6%
GPT 5 Codex (High)	20.0%
Opus 4.5 (High)	18.4%
Gemini 3 Pro (High)	18.4%
GPT 5 (High)	18.3%
Grok 4	15.2%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in APEX-Agents operate within sandboxed environments with read-only data mounts. The environment does not present direct safety risks.

Citation

@misc{vidgen2026apexagents,
  title={APEX--Agents},
  author={Vidgen, Bertie and Mann, Austin and Fennelly, Abby and Wright Stanly, John and Rothman, Lucas and Burstein, Marco and Benchek, Julien and Ostrofsky, David and Ravichandran, Anirudh and Sur, Debnil and Venugopal, Neel and Hsia, Alannah and Robinson, Isaac and Huang, Calix and Varones, Olivia and Khan, Daniyal and Haines, Michael and Richards, Zach and Mahapatra, Chirag and Foody, Brendan and Nitski, Osvald},
  year={2026},
  howpublished={arXiv},
  url={https://arxiv.org/abs/2601.14242}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
apexagents.py		apexagents.py
cli_environment.py		cli_environment.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
test_rollout_agent.py		test_rollout_agent.py
utils.py		utils.py
world_descriptions.json		world_descriptions.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APEX-Agents

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

APEX-Agents

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages