GlobalPIQA is an environment for evaluating multilingual physical commonsense reasoning across 116 language variants. Given an incomplete prompt and two possible completions, the agent must choose the more physically plausible solution. Tasks span diverse languages and scripts, testing both linguistic understanding and cultural knowledge.
- Multilingual physical commonsense reasoning
- Cross-cultural knowledge understanding
- Binary choice evaluation across 116 languages and scripts
Agents are given a standard environment with no sandbox or file system access.
CC BY-SA 4.0 (evaluation only, no training).
There is one split in this environment:
- test: 11,600 tasks. 100 tasks per each of 116 language variants covering Indo-European, Afro-Asiatic, Sino-Tibetan, Japonic, Koreanic, Niger-Congo, Austronesian, Dravidian, and other language families.
Single-turn binary evaluation. The agent submits an answer (0 or 1) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth label from the dataset. Reward is 1.0 if the answer matches the label, 0.0 otherwise. No LLM grader is used.
globalpiqa_test.parquet (~12 MB) sourced from HuggingFace mrlbenchmarks/global-piqa-nonparallel. Contains prompts, solutions, labels, and cultural relevance scores for all 116 language variants. Stored on the OpenReward platform.
| Tool | Description |
|---|---|
submit_answer |
Submit 0 (for Solution 0) or 1 (for Solution 1) as the answer. Deterministic evaluation. |
Single-turn. The agent reads the prompt and two solutions, then submits one choice.
The original paper evaluates frontier models on Global PIQA (Accuracy %):
| Model | Accuracy |
|---|---|
| Human | 95.1% |
| Gemini 2.5 Pro | 91.7% |
| Gemma 3 27B | 82.4% |
Performance varies significantly by language resource level, with up to 37% accuracy gap between high-resource and low-resource languages (random chance is 50%).
There are no further environment requirements; GlobalPIQA works out of the box with the OpenReward endpoint without any external API keys.
Agents in GlobalPIQA answer physical commonsense questions in a standard environment. The environment does not present direct safety risks.
@article{mrl-workshop-2025-global-piqa,
title={Global {PIQA}: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures},
author={Tyler A. Chang and Catherine Arnett and Abdelrahman Eldesokey and others},
journal={Preprint},
year={2025},
url={https://arxiv.org/abs/2510.24081},
}