LongFact

Description

LongFact is an environment for evaluating long-form factual accuracy. Based on Google DeepMind's LongFact benchmark, agents are given open-ended questions requiring detailed factual responses. Evaluation uses the SAFE (Search-Augmented Factuality Evaluation) pipeline: responses are decomposed into atomic facts, each verified via web search, and scored by factual precision.

Capabilities

Long-form factual question answering
Generating detailed and accurate responses across 38 subject areas

Compute Requirements

Agents in LongFact are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

One split: test (2,280 tasks) spanning 38 subject areas.

Reward Structure

Single-turn evaluation. Agent submits a long-form response via submit_answer. The response is decomposed into atomic facts using gpt-5-mini, then each fact is verified via web search. Reward is factual precision: supported_facts / relevant_facts, ranging from 0.0 to 1.0. Irrelevant facts are excluded from the denominator.

Data

longfact_data.parquet sourced from HuggingFace claserken/longfact. Stored on the OpenReward platform.

Tools

Single tool: submit_answer — submit a long-form factual response for SAFE evaluation.

Time Horizon

Single-turn.

Environment Difficulty

The original paper evaluates 13 models across four model families (Gemini, GPT, Claude, PaLM-2) using F1@K metrics. Top performers were GPT-4-Turbo, Gemini-Ultra, and PaLM-2-L-IT-RLHF. Larger models consistently achieve higher factual precision than smaller variants within the same family.

Other Environment Requirements

OpenAI API key required for fact decomposition and web-search-based verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in LongFact generate factual responses in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{wei2024longfact,
  title={Long-form factuality in large language models},
  author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V.},
  booktitle={NeurIPS},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
longfact.py		longfact.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongFact

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LongFact

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages