MMLU-ProX

Description

MMLU-ProX is an environment for evaluating agents on multilingual multiple-choice question answering. It is based on the MMLU-ProX dataset from HuggingFace (li-lab/MMLU-ProX), which extends MMLU-Pro to 29 languages. Each task presents a question with 10 answer options (A through J) across 14+ subject categories. Grading is deterministic via exact match.

Capabilities

Multilingual multiple-choice question answering across 29 languages
Knowledge reasoning across 14+ subject categories (mathematics, science, health, business, humanities, etc.)
Single-turn evaluation with deterministic grading

Compute Requirements

MMLU-ProX extends Environment directly and does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There are 58 splits (29 languages x 2 split types) in the format {language}_{split}:

Validation: 70 examples per language (2,030 total)
Test: ~11,800 examples per language (~341,011 total)
Total: 343,041 examples

Languages: af, ar, bn, cs, de, en, es, fr, hi, hu, id, it, ja, ko, mr, ne, pt, ru, sr, sw, te, th, uk, ur, vi, wo, yo, zh, zu.

Questions span 14+ subject areas including mathematics, science, health, business, humanities, computer science, law, and more.

Reward Structure

This is a sparse, verifiable reward environment with binary scoring. The agent calls submit_answer once with a letter (A-J). The answer is compared via exact match against the correct answer:

Correct: Reward 1.0.
Incorrect: Reward 0.0.

We do not use LLM graders for this task.

Data

Questions are sourced from the li-lab/MMLU-ProX HuggingFace dataset, consolidated into a single parquet file for efficient loading via predicate pushdown. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

submit_answer: Submit an answer letter (A through J) for the current question. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

MMLU-ProX is a single-turn environment. The agent receives a question with 10 options and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Model performance on MMLU-ProX from the original paper (5-shot CoT):

Model	English	Swahili
QwQ-32B	70.7%	32.8%
Qwen2.5-72B	70.3%	40.1%
Llama3.1-405B	68.8%	52.1%

Performance degrades significantly from high-resource to low-resource languages, with gaps of up to 30% between English and Swahili.

Other Environment Requirements

There are no further environment requirements; MMLU-ProX works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in MMLU-ProX are asked to answer multiple-choice knowledge questions. The environment does not present direct safety risks, as agents only provide letter answers with no access to external systems, tools, or the internet.

Citation

@inproceedings{xuan2025mmluprox,
  title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
  author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025},
  url={https://arxiv.org/abs/2503.10497}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
constants.py		constants.py
download_data.py		download_data.py
mmlu_prox.py		mmlu_prox.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMLU-ProX

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMLU-ProX

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages