LongFact is an environment for evaluating long-form factual accuracy. Based on Google DeepMind's LongFact benchmark, agents are given open-ended questions requiring detailed factual responses. Evaluation uses the SAFE (Search-Augmented Factuality Evaluation) pipeline: responses are decomposed into atomic facts, each verified via web search, and scored by factual precision.
- Long-form factual question answering
- Generating detailed and accurate responses across 38 subject areas
Agents in LongFact are given a standard environment with no sandbox or file system access.
MIT.
One split: test (2,280 tasks) spanning 38 subject areas.
Single-turn evaluation. Agent submits a long-form response via submit_answer. The response is decomposed into atomic facts using gpt-5-mini, then each fact is verified via web search. Reward is factual precision: supported_facts / relevant_facts, ranging from 0.0 to 1.0. Irrelevant facts are excluded from the denominator.
longfact_data.parquet sourced from HuggingFace claserken/longfact. Stored on the OpenReward platform.
Single tool: submit_answer — submit a long-form factual response for SAFE evaluation.
Single-turn.
The original paper evaluates 13 models across four model families (Gemini, GPT, Claude, PaLM-2) using F1@K metrics. Top performers were GPT-4-Turbo, Gemini-Ultra, and PaLM-2-L-IT-RLHF. Larger models consistently achieve higher factual precision than smaller variants within the same family.
OpenAI API key required for fact decomposition and web-search-based verification. Pass via secrets={"openai_api_key": "..."}.
Agents in LongFact generate factual responses in a standard environment. The environment does not present direct safety risks.
@inproceedings{wei2024longfact,
title={Long-form factuality in large language models},
author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V.},
booktitle={NeurIPS},
year={2024}
}