GlobalPIQA

Description

GlobalPIQA is an environment for evaluating multilingual physical commonsense reasoning across 116 language variants. Given an incomplete prompt and two possible completions, the agent must choose the more physically plausible solution. Tasks span diverse languages and scripts, testing both linguistic understanding and cultural knowledge.

Capabilities

Multilingual physical commonsense reasoning
Cross-cultural knowledge understanding
Binary choice evaluation across 116 languages and scripts

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY-SA 4.0 (evaluation only, no training).

Tasks

There is one split in this environment:

test: 11,600 tasks. 100 tasks per each of 116 language variants covering Indo-European, Afro-Asiatic, Sino-Tibetan, Japonic, Koreanic, Niger-Congo, Austronesian, Dravidian, and other language families.

Reward Structure

Single-turn binary evaluation. The agent submits an answer (0 or 1) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth label from the dataset. Reward is 1.0 if the answer matches the label, 0.0 otherwise. No LLM grader is used.

Data

globalpiqa_test.parquet (~12 MB) sourced from HuggingFace mrlbenchmarks/global-piqa-nonparallel. Contains prompts, solutions, labels, and cultural relevance scores for all 116 language variants. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit 0 (for Solution 0) or 1 (for Solution 1) as the answer. Deterministic evaluation.

Time Horizon

Single-turn. The agent reads the prompt and two solutions, then submits one choice.

Environment Difficulty

The original paper evaluates frontier models on Global PIQA (Accuracy %):

Model	Accuracy
Human	95.1%
Gemini 2.5 Pro	91.7%
Gemma 3 27B	82.4%

Performance varies significantly by language resource level, with up to 37% accuracy gap between high-resource and low-resource languages (random chance is 50%).

Other Environment Requirements

There are no further environment requirements; GlobalPIQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GlobalPIQA answer physical commonsense questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{mrl-workshop-2025-global-piqa,
  title={Global {PIQA}: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures},
  author={Tyler A. Chang and Catherine Arnett and Abdelrahman Eldesokey and others},
  journal={Preprint},
  year={2025},
  url={https://arxiv.org/abs/2510.24081},
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
constants.py		constants.py
download_data.py		download_data.py
globalpiqa.py		globalpiqa.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlobalPIQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GlobalPIQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages