Skip to content

GAIA L1 eval — atomic vs Hermes (2026-06-11)

Pre-release
Pre-release

Choose a tag to compare

@Ooooze Ooooze released this 11 Jun 11:08
· 26 commits to main since this release

GAIA validation Level 1 (53 tasks) benchmark artifacts.

Model: qwen-3.6-35b-a3b (local llama-server).

Asset gaia-l1-eval.tar.gz contains:

  • GAIA-L1-EXPERIMENT.md — reproducible experiment write-up.
  • reports/atomic-agent-L1/ — atomic-agent run (matrix.csv/jsonl, environment.json, per-task NDJSON traces).
  • reports/hermes-L1/ — Hermes run (matrix.csv/jsonl, environment.json).
  • logs/atomic-l1.log — atomic-agent run log.

Headline: atomic-agent 69.8% vs Hermes 58.5% accuracy.