MMMLU is an environment for evaluating multilingual massive multitask language understanding. Based on OpenAI's MMMLU dataset, it consists of professional human translations of MMLU into 14 languages. Agents answer 4-option multiple-choice questions (A/B/C/D) across 57 subject categories spanning STEM, humanities, social sciences, and professional domains.
- Multilingual knowledge reasoning across 14 languages
- Multiple-choice question answering across 57 subject categories
- Evaluation of understanding of professional human translations (not machine-translated)
Agents are given a standard environment with no sandbox or file system access.
15 splits:
- 14 language-specific splits (ar_xy, bn_bd, de_de, es_la, fr_fr, hi_in, id_id, it_it, ja_jp, ko_kr, pt_br, sw_ke, yo_ng, zh_cn), each with 14,042 tasks
- 1 combined test split (196,588 tasks)
Total: ~196,588 unique questions.
Single-turn evaluation. The agent submits an answer (A, B, C, or D) via the submit_answer tool. Reward is deterministic and based on exact match: 1.0 if correct, 0.0 if incorrect.
15 parquet files (~100 MB total) sourced from Hugging Face openai/MMMLU. Data is stored on the OpenReward platform.
submit_answer — Submit an answer choice (A, B, C, or D).
Single-turn.
OpenAI evaluates models on MMMLU across 14 languages (Average Accuracy):
| Model | Accuracy |
|---|---|
| o3-high | 88.8% |
| o1 | 87.7% |
| gpt-4.5-preview | 85.1% |
| o4-mini-high | 85.2% |
| gpt-4.1 | 83.7% |
| gpt-4o | 81.4% |
| gpt-4.1-mini | 78.5% |
| gpt-4o-mini | 70.5% |
There are no further environment requirements; MMMLU works out of the box with the OpenReward endpoint without any external API keys.
Agents in MMMLU answer multilingual multiple-choice questions in a standard environment. The environment does not present direct safety risks.
@article{hendrycks2021measuring,
title={Measuring Massive Multitask Language Understanding},
author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}