fix(orchestrator): hard-enforce ZOTrainingCallback contract#59
Merged
Conversation
The model-builder could silently bypass ZOTrainingCallback. The Phase 4
gate only verified result.md, and consumers (cli watch-training, wrapper
auto-split-pane) hardcoded the wrong path (logs/training/) so even
compliant agents couldn't drive the dashboard. Surfaced when the first
--low-token MNIST bench completed at 98.83% test accuracy with
zo watch-training showing "Waiting for training to start..." the entire
run — the agent (Sonnet) wrote vanilla PyTorch with a manual JSON dump
to experiments/exp-001/results.json, ignoring two contradictory
contract instructions.
Three layers of fix:
1. New resolve_active_experiment_dir() helper in src/zo/experiments.py
— single source of truth (most-recent RUNNING > most-recent COMPLETE
> None). Skips FAILED and ABORTED.
2. Consumers refactored: cli.py:watch_training and
wrapper.py:_maybe_open_training_pane now consume the helper instead
of hardcoding logs/training/. Wrapper also passes --repo to the
spawned subprocess so cwd detection isn't required.
3. orchestrator._finalize_experiments now requires metrics.jsonl AND
training_status.json AND result.md per running experiment. Missing
artifacts surface in the gate rationale ("ZOTrainingCallback not
used") so the next iteration's Lead prompt makes the cause
unambiguous to the Model Builder.
Plus contract reconciliation: model-builder.md previously had two
contradictory paths (Phase 4 said .zo/experiments/<exp_id>/, "Training
Metrics Protocol REQUIRED" said log_dir="logs/training"). Stale
section removed; specs/agents.md Section 3 updated to match. Validation
Checklist now includes existence checks for the callback artifacts.
Bonus: 4 hooks in .claude/settings.json made cwd-tolerant — sub-agents
running from delivery repo cwd no longer error on missing scripts.
Tests: +19 covering the resolver (8), Phase 4 gate enforcement (4),
watch-training path resolution (3), wrapper auto-split (4). Existing
test_experiment_flow.py and test_auto_iteration.py helpers updated to
write the new required artifacts. Test count 706 → 725 + 7 skipped.
ruff src/zo/ clean. validate-docs 10/10.
End-to-end demo: planted a stub training_status.json in
.zo/experiments/exp-001/, ran zo watch-training — full dashboard
renders (progress bar, metrics, sparkline, checkpoints) and exits
cleanly when training completes.
Captured in PRIORS.md as PR-035: aspirational agent contracts get
ignored under sub-optimal models — hard gate enforcement is mandatory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying zero-operators with
|
| Latest commit: |
c824c15
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://b0d0669b.zero-operators.pages.dev |
| Branch Preview URL: | https://claude-recursing-einstein-73.zero-operators.pages.dev |
This was referenced Apr 27, 2026
SamPlvs
added a commit
that referenced
this pull request
Apr 30, 2026
fix(orchestrator): hard-enforce ZOTrainingCallback contract
SamPlvs
added a commit
that referenced
this pull request
Apr 30, 2026
First completed --low-token MNIST bench (2026-04-27) measured $7.75 end-to-end ($4.48 lead Sonnet + $3.27 sub-agents Sonnet, captured via npx ccusage --instances) vs. ~$11 default-mode reference = ~30% reduction, not the 70-80% projected before measurement. Structural reason: in default mode, sub-agents are already on Sonnet via their .md frontmatter. Only the Lead Orchestrator is Opus. So --low-token only affects the lead's ~30-40% cost share. At ~5x cheaper per token on Sonnet, that's a ceiling of ~25-30% savings — exactly what we measured. The earlier 70-80% projection extrapolated the per-token rate (5x) to the whole run; the whole run was already mostly Sonnet. Cascade across 5 surfaces: - docs/reference/cost-benchmark.mdx: Note callout updated; "Estimated results (pre-measurement)" → "Measured results (2026-04-27)" with cost breakdown table; new "Why the savings ceiling is structural" section explaining the role-by-role token-share math; new "What --low-token actually moves" knob-by-knob attribution table; new "What would push savings higher" roadmap (Haiku for code-reviewer/ test-engineer/oracle-qa, Phase-1 trim, prompt caching via SDK refactor); "Findings from the partial run" rewritten to "Findings from the first measured run" capturing override-mechanism confirmation + the watch-training contract violation discovery (PR #59); final tracking table now has the 2026-04-27 row. - docs/concepts/low-token-mode.mdx: "Estimated savings" → "Measured savings" with honest 30% number + plan-shape caveats; "Worked example: MNIST" updated with measured cost breakdown; FAQ entry on sub-agents-on-Sonnet rewritten — they were always Sonnet, the override is defence-in-depth not a fix. - docs/reference/low-token-preset.mdx: added Note callout at top with measured $7.75 / 30% headline + link to cost-benchmark. - README.md: --low-token paragraph rewritten with measured number + structural-ceiling caveat + link to cost-benchmark for higher-savings roadmap. - memory/zo-platform/STATE.md: session 025 hand-off rewritten with final $7.75 number; old 70-80% claim in the long status block marked "now refuted — measured ~30%"; benchmark line rewritten. - memory/zo-platform/DECISION_LOG.md: new entry at 2026-04-27T16:30:00Z documenting the cascade and the rationale (PR-005 enforcement > aspiration applied to marketing claims). Quality gates: validate-docs 10/10, pytest 725/725 + 7 skipped, ruff src/zo/ clean. No code changes — pure documentation + memory cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
resolve_active_experiment_dir()helper insrc/zo/experiments.py— single source of truth for "where is the active Phase 4 experiment writing?". Bothcli.py:watch_trainingandwrapper.py:_maybe_open_training_pane()consume it instead of hardcodinglogs/training/.metrics.jsonlandtraining_status.jsonare missing alongsideresult.md. Missing artifacts surface in the gate rationale with the literal hint"ZOTrainingCallback not used".model-builder.mdreconciled: removed the contradictorylog_dir="logs/training"example.specs/agents.mdSection 3 updated to match..claude/settings.jsonmade cwd-tolerant so sub-agents running from delivery-repo cwd don't error on missing scripts.Why
First
--low-tokenMNIST bench completed at 98.83% test accuracy withzo watch-trainingrendering "Waiting for training to start…" the entire run. Investigation surfaced three stacked contract violations:model-builder.mdhad two contradictory paths. Phase 4 section pointed at.zo/experiments/<exp_id>/viafor_experiment(), but the prominent "Training Metrics Protocol REQUIRED" code example usedlog_dir="logs/training". Sonnet (low-token mode) ignored both and wrote vanilla PyTorch with a manual JSON dump toexperiments/exp-001/results.json— a third path the agent invented.wrapper.py:530andcli.py:2638hardcoded<delivery>/logs/training/training_status.json, so the dashboard would still have been blank._finalize_experimentsonly checkedresult.mdexistence, not the actualZOTrainingCallbackartifacts. The phase passed despite the contract being bypassed.PR-005 ("enforcement > aspiration") principle applies. PRIORS.md PR-035 captures the generalised lesson: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory.
Test plan
TestResolveActiveExperimentDir(8) — running-preferred-over-complete, fallback-to-complete, skips-failed-and-aborted, etc.TestPhase4GateRequiresTrainingArtifacts(4) — missing-metrics-blocks, missing-status-blocks, all-three-pass, rationale-lists-all-missingTestWatchTrainingPathResolution(3) — resolves-running-exp, falls-back-to-experiments-root, does-not-use-legacy-pathTestMaybeOpenTrainingPane(4) — skips-when-no-zo-dir, skips-when-no-active-exp, opens-pane-when-status-json-exists, ignores-legacy-logs-training-pathtest_experiment_flow.py+test_auto_iteration.pyhelpers updated to write the new required artifacts.ruff src/zo/clean../scripts/validate-docs.sh10/10 (1 pre-existing test-count badge warning, unrelated).training_status.jsonin.zo/experiments/exp-001/, ranzo watch-training— full dashboard renders (progress bar, metrics table, sparkline, checkpoints, completion status) and exits cleanly. Compare to the "Waiting…" panel that motivated this PR.Out of scope (separate follow-up)
--low-tokenbench measured ~30% savings ($7.75 vs ~$11 default), not the 70-80% claimed in v1.0.2 docs.cost-benchmark.mdx,low-token-mode.mdx,low-token-preset.mdx, README, STATE need an honesty pass with the measured number — separate PR.STATE.md(separate code path, same family as this fix).🤖 Generated with Claude Code