BrowseComp is an environment for evaluating web search reasoning capabilities. Based on OpenAI's simple-evals benchmark, it contains 1,266 encrypted research questions that require multi-hop reasoning and cannot be answered without current web information. The environment provides built-in web search and URL fetching tools powered by Tavily.
- Multi-hop web search reasoning
- Information retrieval and synthesis
- Research question answering
- Confidence calibration
Agents are given a standard environment with no sandbox or file system access.
There is one split in this environment:
- test: 1,266 encrypted research questions
Questions require multi-hop reasoning across multiple web searches. Example: "What was the name of the 1995 film starring the actress who played Victoria that married the final scorer of the World Cup 1998 winner?"
This is a sparse reward environment with LLM-based grading:
- Agent receives a research question
- Agent uses
web_searchandfetch_urltools to gather information - Agent submits answer with explanation, exact_answer, and confidence
- An LLM grader (gpt-5-mini) evaluates semantic equivalence
- Binary reward: 1.0 if correct, 0.0 if incorrect
Data is sourced from OpenAI's BrowseComp benchmark. Data is stored on the OpenReward platform.
| Tool | Description |
|---|---|
web_search |
Search the web using Tavily (returns titles, URLs, snippets) |
fetch_url |
Fetch full content from a URL (truncated to 8000 chars) |
submit_answer |
Submit answer with explanation, exact_answer, and confidence |
Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use a different provider for search you can exclude these tools and use external tools instead.
Multi-turn. Agents can perform multiple web searches before submitting a final answer.
| Model | Accuracy |
|---|---|
| Gemini 3.1 Pro (search, Python, browse) | 85.9% |
| Claude Opus 4.6 | 84.0% |
| Kimi K2.5 (agent swarm) | 78.4% |
| MiniMax M2.5 | 76.3% |
| GLM-5 (with ctx management) | 75.9% |
This benchmark requires persistent multi-hop web navigation to find hard-to-find, entangled information.
- OpenAI API key required for LLM-based grading
- Tavily API key required for web search
Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.
Agents in BrowseComp perform web searches and answer research questions. The environment does not present direct safety risks. Note: Do not share decrypted questions publicly per OpenAI's request.
@article{wei2025browsecomp,
title={BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
author={Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia},
journal={arXiv preprint arXiv:2504.12516},
year={2025},
url={https://arxiv.org/abs/2504.12516}
}