Skip to content

EnvCommons/GlobalPIQA

Repository files navigation

GlobalPIQA

OpenReward Environment Hugging Face Dataset

Description

GlobalPIQA is an environment for evaluating multilingual physical commonsense reasoning across 116 language variants. Given an incomplete prompt and two possible completions, the agent must choose the more physically plausible solution. Tasks span diverse languages and scripts, testing both linguistic understanding and cultural knowledge.

Capabilities

  • Multilingual physical commonsense reasoning
  • Cross-cultural knowledge understanding
  • Binary choice evaluation across 116 languages and scripts

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY-SA 4.0 (evaluation only, no training).

Tasks

There is one split in this environment:

  • test: 11,600 tasks. 100 tasks per each of 116 language variants covering Indo-European, Afro-Asiatic, Sino-Tibetan, Japonic, Koreanic, Niger-Congo, Austronesian, Dravidian, and other language families.

Reward Structure

Single-turn binary evaluation. The agent submits an answer (0 or 1) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth label from the dataset. Reward is 1.0 if the answer matches the label, 0.0 otherwise. No LLM grader is used.

Data

globalpiqa_test.parquet (~12 MB) sourced from HuggingFace mrlbenchmarks/global-piqa-nonparallel. Contains prompts, solutions, labels, and cultural relevance scores for all 116 language variants. Stored on the OpenReward platform.

Tools

Tool Description
submit_answer Submit 0 (for Solution 0) or 1 (for Solution 1) as the answer. Deterministic evaluation.

Time Horizon

Single-turn. The agent reads the prompt and two solutions, then submits one choice.

Environment Difficulty

The original paper evaluates frontier models on Global PIQA (Accuracy %):

Model Accuracy
Human 95.1%
Gemini 2.5 Pro 91.7%
Gemma 3 27B 82.4%

Performance varies significantly by language resource level, with up to 37% accuracy gap between high-resource and low-resource languages (random chance is 50%).

Other Environment Requirements

There are no further environment requirements; GlobalPIQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GlobalPIQA answer physical commonsense questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{mrl-workshop-2025-global-piqa,
  title={Global {PIQA}: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures},
  author={Tyler A. Chang and Catherine Arnett and Abdelrahman Eldesokey and others},
  journal={Preprint},
  year={2025},
  url={https://arxiv.org/abs/2510.24081},
}

About

GlobalPIQA environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors