Skip to content

EnvCommons/MMMLU

Repository files navigation

MMMLU

⭐ OpenReward Environment Hugging Face Dataset

Description

MMMLU is an environment for evaluating multilingual massive multitask language understanding. Based on OpenAI's MMMLU dataset, it consists of professional human translations of MMLU into 14 languages. Agents answer 4-option multiple-choice questions (A/B/C/D) across 57 subject categories spanning STEM, humanities, social sciences, and professional domains.

Capabilities

  • Multilingual knowledge reasoning across 14 languages
  • Multiple-choice question answering across 57 subject categories
  • Evaluation of understanding of professional human translations (not machine-translated)

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

15 splits:

  • 14 language-specific splits (ar_xy, bn_bd, de_de, es_la, fr_fr, hi_in, id_id, it_it, ja_jp, ko_kr, pt_br, sw_ke, yo_ng, zh_cn), each with 14,042 tasks
  • 1 combined test split (196,588 tasks)

Total: ~196,588 unique questions.

Reward Structure

Single-turn evaluation. The agent submits an answer (A, B, C, or D) via the submit_answer tool. Reward is deterministic and based on exact match: 1.0 if correct, 0.0 if incorrect.

Data

15 parquet files (~100 MB total) sourced from Hugging Face openai/MMMLU. Data is stored on the OpenReward platform.

Tools

submit_answer — Submit an answer choice (A, B, C, or D).

Time Horizon

Single-turn.

Environment Difficulty

OpenAI evaluates models on MMMLU across 14 languages (Average Accuracy):

Model Accuracy
o3-high 88.8%
o1 87.7%
gpt-4.5-preview 85.1%
o4-mini-high 85.2%
gpt-4.1 83.7%
gpt-4o 81.4%
gpt-4.1-mini 78.5%
gpt-4o-mini 70.5%

Other Environment Requirements

There are no further environment requirements; MMMLU works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MMMLU answer multilingual multiple-choice questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

About

MMMLU environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors