Skip to content

fix(orchestrator): hard-enforce ZOTrainingCallback contract#59

Merged
SamPlvs merged 1 commit into
mainfrom
claude/recursing-einstein-73777c
Apr 27, 2026
Merged

fix(orchestrator): hard-enforce ZOTrainingCallback contract#59
SamPlvs merged 1 commit into
mainfrom
claude/recursing-einstein-73777c

Conversation

@SamPlvs
Copy link
Copy Markdown
Owner

@SamPlvs SamPlvs commented Apr 27, 2026

Summary

  • New resolve_active_experiment_dir() helper in src/zo/experiments.py — single source of truth for "where is the active Phase 4 experiment writing?". Both cli.py:watch_training and wrapper.py:_maybe_open_training_pane() consume it instead of hardcoding logs/training/.
  • Phase 4 gate now hard-fails when metrics.jsonl and training_status.json are missing alongside result.md. Missing artifacts surface in the gate rationale with the literal hint "ZOTrainingCallback not used".
  • model-builder.md reconciled: removed the contradictory log_dir="logs/training" example. specs/agents.md Section 3 updated to match.
  • 4 hooks in .claude/settings.json made cwd-tolerant so sub-agents running from delivery-repo cwd don't error on missing scripts.

Why

First --low-token MNIST bench completed at 98.83% test accuracy with zo watch-training rendering "Waiting for training to start…" the entire run. Investigation surfaced three stacked contract violations:

  1. model-builder.md had two contradictory paths. Phase 4 section pointed at .zo/experiments/<exp_id>/ via for_experiment(), but the prominent "Training Metrics Protocol REQUIRED" code example used log_dir="logs/training". Sonnet (low-token mode) ignored both and wrote vanilla PyTorch with a manual JSON dump to experiments/exp-001/results.json — a third path the agent invented.
  2. Consumers looked at the wrong path. Even if the agent had complied with Phase 4, both wrapper.py:530 and cli.py:2638 hardcoded <delivery>/logs/training/training_status.json, so the dashboard would still have been blank.
  3. The gate was aspirational. _finalize_experiments only checked result.md existence, not the actual ZOTrainingCallback artifacts. The phase passed despite the contract being bypassed.

PR-005 ("enforcement > aspiration") principle applies. PRIORS.md PR-035 captures the generalised lesson: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory.

Test plan

  • +19 new tests across 4 classes:
    • TestResolveActiveExperimentDir (8) — running-preferred-over-complete, fallback-to-complete, skips-failed-and-aborted, etc.
    • TestPhase4GateRequiresTrainingArtifacts (4) — missing-metrics-blocks, missing-status-blocks, all-three-pass, rationale-lists-all-missing
    • TestWatchTrainingPathResolution (3) — resolves-running-exp, falls-back-to-experiments-root, does-not-use-legacy-path
    • TestMaybeOpenTrainingPane (4) — skips-when-no-zo-dir, skips-when-no-active-exp, opens-pane-when-status-json-exists, ignores-legacy-logs-training-path
  • Existing test_experiment_flow.py + test_auto_iteration.py helpers updated to write the new required artifacts.
  • Test count 706 → 725 + 7 skipped.
  • ruff src/zo/ clean.
  • ./scripts/validate-docs.sh 10/10 (1 pre-existing test-count badge warning, unrelated).
  • End-to-end demo verified: planted a stub training_status.json in .zo/experiments/exp-001/, ran zo watch-training — full dashboard renders (progress bar, metrics table, sparkline, checkpoints, completion status) and exits cleanly. Compare to the "Waiting…" panel that motivated this PR.

Out of scope (separate follow-up)

  • Cost-benchmark cascade. First --low-token bench measured ~30% savings ($7.75 vs ~$11 default), not the 70-80% claimed in v1.0.2 docs. cost-benchmark.mdx, low-token-mode.mdx, low-token-preset.mdx, README, STATE need an honesty pass with the measured number — separate PR.
  • STATE.md staleness in delivery repo — orchestrator doesn't auto-write phase transitions to delivery STATE.md (separate code path, same family as this fix).

🤖 Generated with Claude Code

The model-builder could silently bypass ZOTrainingCallback. The Phase 4
gate only verified result.md, and consumers (cli watch-training, wrapper
auto-split-pane) hardcoded the wrong path (logs/training/) so even
compliant agents couldn't drive the dashboard. Surfaced when the first
--low-token MNIST bench completed at 98.83% test accuracy with
zo watch-training showing "Waiting for training to start..." the entire
run — the agent (Sonnet) wrote vanilla PyTorch with a manual JSON dump
to experiments/exp-001/results.json, ignoring two contradictory
contract instructions.

Three layers of fix:

1. New resolve_active_experiment_dir() helper in src/zo/experiments.py
   — single source of truth (most-recent RUNNING > most-recent COMPLETE
   > None). Skips FAILED and ABORTED.

2. Consumers refactored: cli.py:watch_training and
   wrapper.py:_maybe_open_training_pane now consume the helper instead
   of hardcoding logs/training/. Wrapper also passes --repo to the
   spawned subprocess so cwd detection isn't required.

3. orchestrator._finalize_experiments now requires metrics.jsonl AND
   training_status.json AND result.md per running experiment. Missing
   artifacts surface in the gate rationale ("ZOTrainingCallback not
   used") so the next iteration's Lead prompt makes the cause
   unambiguous to the Model Builder.

Plus contract reconciliation: model-builder.md previously had two
contradictory paths (Phase 4 said .zo/experiments/<exp_id>/, "Training
Metrics Protocol REQUIRED" said log_dir="logs/training"). Stale
section removed; specs/agents.md Section 3 updated to match. Validation
Checklist now includes existence checks for the callback artifacts.

Bonus: 4 hooks in .claude/settings.json made cwd-tolerant — sub-agents
running from delivery repo cwd no longer error on missing scripts.

Tests: +19 covering the resolver (8), Phase 4 gate enforcement (4),
watch-training path resolution (3), wrapper auto-split (4). Existing
test_experiment_flow.py and test_auto_iteration.py helpers updated to
write the new required artifacts. Test count 706 → 725 + 7 skipped.
ruff src/zo/ clean. validate-docs 10/10.

End-to-end demo: planted a stub training_status.json in
.zo/experiments/exp-001/, ran zo watch-training — full dashboard
renders (progress bar, metrics, sparkline, checkpoints) and exits
cleanly when training completes.

Captured in PRIORS.md as PR-035: aspirational agent contracts get
ignored under sub-optimal models — hard gate enforcement is mandatory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying zero-operators with  Cloudflare Pages  Cloudflare Pages

Latest commit: c824c15
Status: ✅  Deploy successful!
Preview URL: https://b0d0669b.zero-operators.pages.dev
Branch Preview URL: https://claude-recursing-einstein-73.zero-operators.pages.dev

View logs

@SamPlvs SamPlvs merged commit f290311 into main Apr 27, 2026
2 checks passed
@SamPlvs SamPlvs deleted the claude/recursing-einstein-73777c branch April 27, 2026 15:19
SamPlvs added a commit that referenced this pull request Apr 30, 2026
fix(orchestrator): hard-enforce ZOTrainingCallback contract
SamPlvs added a commit that referenced this pull request Apr 30, 2026
First completed --low-token MNIST bench (2026-04-27) measured $7.75
end-to-end ($4.48 lead Sonnet + $3.27 sub-agents Sonnet, captured via
npx ccusage --instances) vs. ~$11 default-mode reference = ~30%
reduction, not the 70-80% projected before measurement.

Structural reason: in default mode, sub-agents are already on Sonnet
via their .md frontmatter. Only the Lead Orchestrator is Opus. So
--low-token only affects the lead's ~30-40% cost share. At ~5x
cheaper per token on Sonnet, that's a ceiling of ~25-30% savings —
exactly what we measured. The earlier 70-80% projection extrapolated
the per-token rate (5x) to the whole run; the whole run was already
mostly Sonnet.

Cascade across 5 surfaces:

- docs/reference/cost-benchmark.mdx: Note callout updated; "Estimated
  results (pre-measurement)" → "Measured results (2026-04-27)" with
  cost breakdown table; new "Why the savings ceiling is structural"
  section explaining the role-by-role token-share math; new "What
  --low-token actually moves" knob-by-knob attribution table; new
  "What would push savings higher" roadmap (Haiku for code-reviewer/
  test-engineer/oracle-qa, Phase-1 trim, prompt caching via SDK
  refactor); "Findings from the partial run" rewritten to "Findings
  from the first measured run" capturing override-mechanism
  confirmation + the watch-training contract violation discovery
  (PR #59); final tracking table now has the 2026-04-27 row.

- docs/concepts/low-token-mode.mdx: "Estimated savings" → "Measured
  savings" with honest 30% number + plan-shape caveats; "Worked
  example: MNIST" updated with measured cost breakdown; FAQ entry on
  sub-agents-on-Sonnet rewritten — they were always Sonnet, the
  override is defence-in-depth not a fix.

- docs/reference/low-token-preset.mdx: added Note callout at top with
  measured $7.75 / 30% headline + link to cost-benchmark.

- README.md: --low-token paragraph rewritten with measured number +
  structural-ceiling caveat + link to cost-benchmark for higher-savings
  roadmap.

- memory/zo-platform/STATE.md: session 025 hand-off rewritten with
  final $7.75 number; old 70-80% claim in the long status block
  marked "now refuted — measured ~30%"; benchmark line rewritten.

- memory/zo-platform/DECISION_LOG.md: new entry at 2026-04-27T16:30:00Z
  documenting the cascade and the rationale (PR-005 enforcement >
  aspiration applied to marketing claims).

Quality gates: validate-docs 10/10, pytest 725/725 + 7 skipped, ruff
src/zo/ clean. No code changes — pure documentation + memory cascade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant