AgentHarness

📰Tech Blog | 📄Tech Report

AgentHarness

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.

📊 Performance

Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:

Model	BrowseComp	BrowseComp-ZH	HLE-Text	DeepSearchQA
Apodex-1.0-mini	71.5	80.6	46.8	82.2
Apodex-1.0-4B-SFT	48.8	63.5	32.9	69.9
Apodex-1.0-2B-SFT	27.9	35.0	18.2	49.9
Apodex-1.0-0.8B-SFT	13.9	10.7	11.2	25.8

⚡ Quick Start on the harness

1. Install dependencies

uv sync --python 3.12

2. Serve the model (SGLang)

python3 -m sglang.launch_server \
  --model-path apodex/Apodex-1.0-35B-A3B \
  --tp 8 \
  --host 0.0.0.0 \
  --port 1234 \
  --context-length 262144 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \

For smaller variants, other serving options, see the Hugging Face model card.

3. Configure environment variables

cp .env.example .env

Fill in the required keys in .env — OPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.

4. Download the benchmark datasets

wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zip

The single quotes around the password are required — it contains shell-meta characters (*, (, )).

HLE is not included. Its license forbids redistributing the answers. To run hle_text, accept the license on cais/hle and place the JSONL at benchmarks/datasets/HLE-text/standardized_data.jsonl.

5. Run a smoke test

uv run python -m benchmarks.runner.run_subprocess \
  --benchmark browsecomp \
  --pipeline react_base \
  --profile default \
  --limit 1 \
  --concurrency 1 \
  --out ./tmp/smoke

6. Run a full benchmark

uv run python -m benchmarks.runner.run_subprocess \
  --benchmark browsecomp \
  --pipeline react_base \
  --profile default \
  --runs 5 \
  --concurrency 30 \
  --out ./bc-runs

7. Check progress and aggregate accuracy

uv run python -m benchmarks.runner.check_progress ./bc-runs

Each question runs in its own subprocess, which makes runs easier to reproduce and debug:

isolated execution per question
no asyncio saturation
individual hangs can be SIGKILL'd
failed samples can be rerun independently

✅ Supported Benchmarks

BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch

See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.

⭐ Star History

📚 Citation

@techreport{apodex2026,
  title  = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence},
  author = {Apodex Team},
  year   = {2026}
}

📄 License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
agent_harness		agent_harness
assets		assets
benchmarks		benchmarks
plugins		plugins
scripts		scripts
workflows		workflows
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentHarness

📊 Performance

⚡ Quick Start on the harness

1. Install dependencies

2. Serve the model (SGLang)

3. Configure environment variables

4. Download the benchmark datasets

5. Run a smoke test

6. Run a full benchmark

7. Check progress and aggregate accuracy

✅ Supported Benchmarks

⭐ Star History

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentHarness

📊 Performance

⚡ Quick Start on the harness

1. Install dependencies

2. Serve the model (SGLang)

3. Configure environment variables

4. Download the benchmark datasets

5. Run a smoke test

6. Run a full benchmark

7. Check progress and aggregate accuracy

✅ Supported Benchmarks

⭐ Star History

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages