📰Tech Blog | 📄Tech Report
Evaluation harness for Apodex-1.0 on public deep-research benchmarks.
AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.
Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:
| Model | BrowseComp | BrowseComp-ZH | HLE-Text | DeepSearchQA |
|---|---|---|---|---|
| Apodex-1.0-mini | 71.5 | 80.6 | 46.8 | 82.2 |
| Apodex-1.0-4B-SFT | 48.8 | 63.5 | 32.9 | 69.9 |
| Apodex-1.0-2B-SFT | 27.9 | 35.0 | 18.2 | 49.9 |
| Apodex-1.0-0.8B-SFT | 13.9 | 10.7 | 11.2 | 25.8 |
uv sync --python 3.12python3 -m sglang.launch_server \
--model-path apodex/Apodex-1.0-35B-A3B \
--tp 8 \
--host 0.0.0.0 \
--port 1234 \
--context-length 262144 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \For smaller variants, other serving options, see the Hugging Face model card.
cp .env.example .envFill in the required keys in .env — OPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.
wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zipThe single quotes around the password are required — it contains shell-meta characters (*, (, )).
HLE is not included. Its license forbids redistributing the answers. To run
hle_text, accept the license oncais/hleand place the JSONL atbenchmarks/datasets/HLE-text/standardized_data.jsonl.
uv run python -m benchmarks.runner.run_subprocess \
--benchmark browsecomp \
--pipeline react_base \
--profile default \
--limit 1 \
--concurrency 1 \
--out ./tmp/smokeuv run python -m benchmarks.runner.run_subprocess \
--benchmark browsecomp \
--pipeline react_base \
--profile default \
--runs 5 \
--concurrency 30 \
--out ./bc-runsuv run python -m benchmarks.runner.check_progress ./bc-runsEach question runs in its own subprocess, which makes runs easier to reproduce and debug:
- isolated execution per question
- no asyncio saturation
- individual hangs can be
SIGKILL'd - failed samples can be rerun independently
BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch
See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.
@techreport{apodex2026,
title = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence},
author = {Apodex Team},
year = {2026}
}Apache 2.0 — see LICENSE.
