GAIA L1 eval — atomic vs Hermes (2026-06-11)

Pre-release

Pre-release

Ooooze released this 11 Jun 11:08

· 26 commits to main since this release

gaia-l1-eval-2026-06-11

a2a8d97

GAIA validation Level 1 (53 tasks) benchmark artifacts.

Model: qwen-3.6-35b-a3b (local llama-server).

Asset gaia-l1-eval.tar.gz contains:

GAIA-L1-EXPERIMENT.md — reproducible experiment write-up.
reports/atomic-agent-L1/ — atomic-agent run (matrix.csv/jsonl, environment.json, per-task NDJSON traces).
reports/hermes-L1/ — Hermes run (matrix.csv/jsonl, environment.json).
logs/atomic-l1.log — atomic-agent run log.

Headline: atomic-agent 69.8% vs Hermes 58.5% accuracy.

Assets 3