GAIA L1 eval — atomic vs Hermes (2026-06-11)
Pre-release
Pre-release
·
26 commits
to main
since this release
GAIA validation Level 1 (53 tasks) benchmark artifacts.
Model: qwen-3.6-35b-a3b (local llama-server).
Asset gaia-l1-eval.tar.gz contains:
GAIA-L1-EXPERIMENT.md— reproducible experiment write-up.reports/atomic-agent-L1/— atomic-agent run (matrix.csv/jsonl, environment.json, per-task NDJSON traces).reports/hermes-L1/— Hermes run (matrix.csv/jsonl, environment.json).logs/atomic-l1.log— atomic-agent run log.
Headline: atomic-agent 69.8% vs Hermes 58.5% accuracy.