MMMLU

Description

MMMLU is an environment for evaluating multilingual massive multitask language understanding. Based on OpenAI's MMMLU dataset, it consists of professional human translations of MMLU into 14 languages. Agents answer 4-option multiple-choice questions (A/B/C/D) across 57 subject categories spanning STEM, humanities, social sciences, and professional domains.

Capabilities

Multilingual knowledge reasoning across 14 languages
Multiple-choice question answering across 57 subject categories
Evaluation of understanding of professional human translations (not machine-translated)

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

15 splits:

14 language-specific splits (ar_xy, bn_bd, de_de, es_la, fr_fr, hi_in, id_id, it_it, ja_jp, ko_kr, pt_br, sw_ke, yo_ng, zh_cn), each with 14,042 tasks
1 combined test split (196,588 tasks)

Total: ~196,588 unique questions.

Reward Structure

Single-turn evaluation. The agent submits an answer (A, B, C, or D) via the submit_answer tool. Reward is deterministic and based on exact match: 1.0 if correct, 0.0 if incorrect.

Data

15 parquet files (~100 MB total) sourced from Hugging Face openai/MMMLU. Data is stored on the OpenReward platform.

Tools

submit_answer — Submit an answer choice (A, B, C, or D).

Time Horizon

Single-turn.

Environment Difficulty

OpenAI evaluates models on MMMLU across 14 languages (Average Accuracy):

Model	Accuracy
o3-high	88.8%
o1	87.7%
gpt-4.5-preview	85.1%
o4-mini-high	85.2%
gpt-4.1	83.7%
gpt-4o	81.4%
gpt-4.1-mini	78.5%
gpt-4o-mini	70.5%

Other Environment Requirements

There are no further environment requirements; MMMLU works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MMMLU answer multilingual multiple-choice questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
constants.py		constants.py
download_data.py		download_data.py
mmmlu.py		mmmlu.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMMLU

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMMLU

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages