LitQATrain

Description

LitQATrain is an ORS environment for evaluating scientific literature question answering with web search capabilities, inspired by FutureHouse's LitQA2 benchmark. Like LitQA2, each question asks about a specific verifiable fact found in a particular scientific paper, and is designed to be specific enough that it can only be answered from a single source. LitQATrain extends this approach to 984 QA pairs across 10 broad scientific domains, with open-ended (non-multiple-choice) answers and web search tools for retrieval.

Capabilities

Scientific question answering across diverse domains
Web search and information retrieval from academic literature
Multi-step research: searching, reading papers, and synthesizing answers
Verifiable factual recall from published research

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split: train with 984 tasks. Questions span 10 scientific domains:

Domain	Count
Molecular biology / Genomics	100
Neuroscience	100
Ecology / Environmental science	100
Chemistry / Materials science	100
Physics / Astronomy	100
Computer science / AI	100
Medicine / Clinical research	100
Earth science / Geology	100
Pharmacology / Drug development	100
Engineering / Applied science	84

Each task provides a question and metadata (source DOI, domain). The agent prompt contains only the question; the agent must find the answer through web search.

Reward Structure

This is a multi-turn environment. Agents use web_search and fetch_url tools to gather information, then submit via submit_answer. An LLM grader (gpt-5-mini) evaluates semantic equivalence between the submitted answer and the reference answer, handling synonyms, abbreviations, and equivalent scientific terminology. Reward is binary: 1.0 if correct, 0.0 if incorrect.

We do not use LLM graders from a different family for this task.

Data

Data consists of a single Parquet file (train.parquet) containing QA pairs generated from scientific papers published in trusted journals (Nature, Science, PNAS, Cell, PLOS, ACS, arXiv, IEEE, AGU). Each row contains a question, answer, source DOI, key passage, and domain. Data is stored on the OpenReward platform.

Tools

Tool	Description
`web_search`	Search the web using Tavily API. Returns up to 5 results with titles, URLs, and snippets.
`fetch_url`	Fetch full text content from a specific URL (truncated at 8000 characters).
`submit_answer`	Submit your final answer for LLM grading. Ends the episode.

Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use a different provider for search you can exclude these tools and use external tools instead.

Time Horizon

Multi-turn. Agents can perform multiple web searches and URL fetches before submitting a final answer.

Environment Difficulty

[To be determined]

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.
Tavily API key required for web search and URL fetching. Pass via secrets={"tavily_api_key": "..."}.

Safety

Agents in LitQATrain answer scientific questions using web search in a standard environment. The environment does not present direct safety risks.

Citations

This environment is inspired by LitQA2 from FutureHouse's LAB-Bench. Please cite the original work:

@article{laurent2024labbench,
  title     = {LAB-Bench: Measuring Capabilities of Language Models for Biology Research},
  author    = {Laurent, Jon M. and Janizek, Joseph D. and Ruzo, Michael and Hinks, Michaela M. and Hammerling, Michael J. and Narayanan, Siddharth and Ponnapati, Manvitha and White, Andrew D. and Rodriques, Samuel G.},
  year      = {2024},
  eprint    = {2407.10362},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}

@dataset{GRLitQATrain,
  author    = {General Reasoning Inc. Team},
  title     = {LitQATrain},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/litqatrain}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
generate_dataset.py		generate_dataset.py
litqatrain.py		litqatrain.py
reference.txt		reference.txt
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LitQATrain

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LitQATrain

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages