Skip to content

EnvCommons/LongFact

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongFact

⭐ OpenReward Environment Hugging Face Dataset

Description

LongFact is an environment for evaluating long-form factual accuracy. Based on Google DeepMind's LongFact benchmark, agents are given open-ended questions requiring detailed factual responses. Evaluation uses the SAFE (Search-Augmented Factuality Evaluation) pipeline: responses are decomposed into atomic facts, each verified via web search, and scored by factual precision.

Capabilities

  • Long-form factual question answering
  • Generating detailed and accurate responses across 38 subject areas

Compute Requirements

Agents in LongFact are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

One split: test (2,280 tasks) spanning 38 subject areas.

Reward Structure

Single-turn evaluation. Agent submits a long-form response via submit_answer. The response is decomposed into atomic facts using gpt-5-mini, then each fact is verified via web search. Reward is factual precision: supported_facts / relevant_facts, ranging from 0.0 to 1.0. Irrelevant facts are excluded from the denominator.

Data

longfact_data.parquet sourced from HuggingFace claserken/longfact. Stored on the OpenReward platform.

Tools

Single tool: submit_answer — submit a long-form factual response for SAFE evaluation.

Time Horizon

Single-turn.

Environment Difficulty

The original paper evaluates 13 models across four model families (Gemini, GPT, Claude, PaLM-2) using F1@K metrics. Top performers were GPT-4-Turbo, Gemini-Ultra, and PaLM-2-L-IT-RLHF. Larger models consistently achieve higher factual precision than smaller variants within the same family.

Other Environment Requirements

OpenAI API key required for fact decomposition and web-search-based verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in LongFact generate factual responses in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{wei2024longfact,
  title={Long-form factuality in large language models},
  author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V.},
  booktitle={NeurIPS},
  year={2024}
}

About

LongFact evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors