Skip to content

ApodexAI/AgentHarness

Repository files navigation

Apodex-1.0

Research Homepage API
Hugging Face GitHub Discord License

📰Tech Blog | 📄Tech Report


AgentHarness

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.

Apodex-1.0 results across deep-research benchmarks


📊 Performance

Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:

Model BrowseComp BrowseComp-ZH HLE-Text DeepSearchQA
Apodex-1.0-mini 71.5 80.6 46.8 82.2
Apodex-1.0-4B-SFT 48.8 63.5 32.9 69.9
Apodex-1.0-2B-SFT 27.9 35.0 18.2 49.9
Apodex-1.0-0.8B-SFT 13.9 10.7 11.2 25.8

⚡ Quick Start on the harness

1. Install dependencies

uv sync --python 3.12

2. Serve the model (SGLang)

python3 -m sglang.launch_server \
  --model-path apodex/Apodex-1.0-35B-A3B \
  --tp 8 \
  --host 0.0.0.0 \
  --port 1234 \
  --context-length 262144 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \

For smaller variants, other serving options, see the Hugging Face model card.

3. Configure environment variables

cp .env.example .env

Fill in the required keys in .envOPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.

4. Download the benchmark datasets

wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zip

The single quotes around the password are required — it contains shell-meta characters (*, (, )).

HLE is not included. Its license forbids redistributing the answers. To run hle_text, accept the license on cais/hle and place the JSONL at benchmarks/datasets/HLE-text/standardized_data.jsonl.

5. Run a smoke test

uv run python -m benchmarks.runner.run_subprocess \
  --benchmark browsecomp \
  --pipeline react_base \
  --profile default \
  --limit 1 \
  --concurrency 1 \
  --out ./tmp/smoke

6. Run a full benchmark

uv run python -m benchmarks.runner.run_subprocess \
  --benchmark browsecomp \
  --pipeline react_base \
  --profile default \
  --runs 5 \
  --concurrency 30 \
  --out ./bc-runs

7. Check progress and aggregate accuracy

uv run python -m benchmarks.runner.check_progress ./bc-runs

Each question runs in its own subprocess, which makes runs easier to reproduce and debug:

  • isolated execution per question
  • no asyncio saturation
  • individual hangs can be SIGKILL'd
  • failed samples can be rerun independently

✅ Supported Benchmarks

BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch

See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.


⭐ Star History

Star History Chart

📚 Citation

@techreport{apodex2026,
  title  = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence},
  author = {Apodex Team},
  year   = {2026}
}

📄 License

Apache 2.0 — see LICENSE.

About

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors