fix(orchestrator): hard-enforce ZOTrainingCallback contract by SamPlvs · Pull Request #59 · SamPlvs/zero-operators

SamPlvs · 2026-04-27T15:13:57Z

Summary

New resolve_active_experiment_dir() helper in src/zo/experiments.py — single source of truth for "where is the active Phase 4 experiment writing?". Both cli.py:watch_training and wrapper.py:_maybe_open_training_pane() consume it instead of hardcoding logs/training/.
Phase 4 gate now hard-fails when metrics.jsonl and training_status.json are missing alongside result.md. Missing artifacts surface in the gate rationale with the literal hint "ZOTrainingCallback not used".
model-builder.md reconciled: removed the contradictory log_dir="logs/training" example. specs/agents.md Section 3 updated to match.
4 hooks in .claude/settings.json made cwd-tolerant so sub-agents running from delivery-repo cwd don't error on missing scripts.

Why

First --low-token MNIST bench completed at 98.83% test accuracy with zo watch-training rendering "Waiting for training to start…" the entire run. Investigation surfaced three stacked contract violations:

model-builder.md had two contradictory paths. Phase 4 section pointed at .zo/experiments/<exp_id>/ via for_experiment(), but the prominent "Training Metrics Protocol REQUIRED" code example used log_dir="logs/training". Sonnet (low-token mode) ignored both and wrote vanilla PyTorch with a manual JSON dump to experiments/exp-001/results.json — a third path the agent invented.
Consumers looked at the wrong path. Even if the agent had complied with Phase 4, both wrapper.py:530 and cli.py:2638 hardcoded <delivery>/logs/training/training_status.json, so the dashboard would still have been blank.
The gate was aspirational. _finalize_experiments only checked result.md existence, not the actual ZOTrainingCallback artifacts. The phase passed despite the contract being bypassed.

PR-005 ("enforcement > aspiration") principle applies. PRIORS.md PR-035 captures the generalised lesson: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory.

Test plan

+19 new tests across 4 classes:
- TestResolveActiveExperimentDir (8) — running-preferred-over-complete, fallback-to-complete, skips-failed-and-aborted, etc.
- TestPhase4GateRequiresTrainingArtifacts (4) — missing-metrics-blocks, missing-status-blocks, all-three-pass, rationale-lists-all-missing
- TestWatchTrainingPathResolution (3) — resolves-running-exp, falls-back-to-experiments-root, does-not-use-legacy-path
- TestMaybeOpenTrainingPane (4) — skips-when-no-zo-dir, skips-when-no-active-exp, opens-pane-when-status-json-exists, ignores-legacy-logs-training-path
Existing test_experiment_flow.py + test_auto_iteration.py helpers updated to write the new required artifacts.
Test count 706 → 725 + 7 skipped.
ruff src/zo/ clean.
./scripts/validate-docs.sh 10/10 (1 pre-existing test-count badge warning, unrelated).
End-to-end demo verified: planted a stub training_status.json in .zo/experiments/exp-001/, ran zo watch-training — full dashboard renders (progress bar, metrics table, sparkline, checkpoints, completion status) and exits cleanly. Compare to the "Waiting…" panel that motivated this PR.

Out of scope (separate follow-up)

Cost-benchmark cascade. First --low-token bench measured ~30% savings ($7.75 vs ~$11 default), not the 70-80% claimed in v1.0.2 docs. cost-benchmark.mdx, low-token-mode.mdx, low-token-preset.mdx, README, STATE need an honesty pass with the measured number — separate PR.
STATE.md staleness in delivery repo — orchestrator doesn't auto-write phase transitions to delivery STATE.md (separate code path, same family as this fix).

🤖 Generated with Claude Code

The model-builder could silently bypass ZOTrainingCallback. The Phase 4 gate only verified result.md, and consumers (cli watch-training, wrapper auto-split-pane) hardcoded the wrong path (logs/training/) so even compliant agents couldn't drive the dashboard. Surfaced when the first --low-token MNIST bench completed at 98.83% test accuracy with zo watch-training showing "Waiting for training to start..." the entire run — the agent (Sonnet) wrote vanilla PyTorch with a manual JSON dump to experiments/exp-001/results.json, ignoring two contradictory contract instructions. Three layers of fix: 1. New resolve_active_experiment_dir() helper in src/zo/experiments.py — single source of truth (most-recent RUNNING > most-recent COMPLETE > None). Skips FAILED and ABORTED. 2. Consumers refactored: cli.py:watch_training and wrapper.py:_maybe_open_training_pane now consume the helper instead of hardcoding logs/training/. Wrapper also passes --repo to the spawned subprocess so cwd detection isn't required. 3. orchestrator._finalize_experiments now requires metrics.jsonl AND training_status.json AND result.md per running experiment. Missing artifacts surface in the gate rationale ("ZOTrainingCallback not used") so the next iteration's Lead prompt makes the cause unambiguous to the Model Builder. Plus contract reconciliation: model-builder.md previously had two contradictory paths (Phase 4 said .zo/experiments/<exp_id>/, "Training Metrics Protocol REQUIRED" said log_dir="logs/training"). Stale section removed; specs/agents.md Section 3 updated to match. Validation Checklist now includes existence checks for the callback artifacts. Bonus: 4 hooks in .claude/settings.json made cwd-tolerant — sub-agents running from delivery repo cwd no longer error on missing scripts. Tests: +19 covering the resolver (8), Phase 4 gate enforcement (4), watch-training path resolution (3), wrapper auto-split (4). Existing test_experiment_flow.py and test_auto_iteration.py helpers updated to write the new required artifacts. Test count 706 → 725 + 7 skipped. ruff src/zo/ clean. validate-docs 10/10. End-to-end demo: planted a stub training_status.json in .zo/experiments/exp-001/, ran zo watch-training — full dashboard renders (progress bar, metrics, sparkline, checkpoints) and exits cleanly when training completes. Captured in PRIORS.md as PR-035: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-27T15:15:05Z

Deploying zero-operators with Cloudflare Pages

Latest commit:	`c824c15`
Status:	✅ Deploy successful!
Preview URL:	https://b0d0669b.zero-operators.pages.dev
Branch Preview URL:	https://claude-recursing-einstein-73.zero-operators.pages.dev

View logs

fix(orchestrator): hard-enforce ZOTrainingCallback contract

First completed --low-token MNIST bench (2026-04-27) measured $7.75 end-to-end ($4.48 lead Sonnet + $3.27 sub-agents Sonnet, captured via npx ccusage --instances) vs. ~$11 default-mode reference = ~30% reduction, not the 70-80% projected before measurement. Structural reason: in default mode, sub-agents are already on Sonnet via their .md frontmatter. Only the Lead Orchestrator is Opus. So --low-token only affects the lead's ~30-40% cost share. At ~5x cheaper per token on Sonnet, that's a ceiling of ~25-30% savings — exactly what we measured. The earlier 70-80% projection extrapolated the per-token rate (5x) to the whole run; the whole run was already mostly Sonnet. Cascade across 5 surfaces: - docs/reference/cost-benchmark.mdx: Note callout updated; "Estimated results (pre-measurement)" → "Measured results (2026-04-27)" with cost breakdown table; new "Why the savings ceiling is structural" section explaining the role-by-role token-share math; new "What --low-token actually moves" knob-by-knob attribution table; new "What would push savings higher" roadmap (Haiku for code-reviewer/ test-engineer/oracle-qa, Phase-1 trim, prompt caching via SDK refactor); "Findings from the partial run" rewritten to "Findings from the first measured run" capturing override-mechanism confirmation + the watch-training contract violation discovery (PR #59); final tracking table now has the 2026-04-27 row. - docs/concepts/low-token-mode.mdx: "Estimated savings" → "Measured savings" with honest 30% number + plan-shape caveats; "Worked example: MNIST" updated with measured cost breakdown; FAQ entry on sub-agents-on-Sonnet rewritten — they were always Sonnet, the override is defence-in-depth not a fix. - docs/reference/low-token-preset.mdx: added Note callout at top with measured $7.75 / 30% headline + link to cost-benchmark. - README.md: --low-token paragraph rewritten with measured number + structural-ceiling caveat + link to cost-benchmark for higher-savings roadmap. - memory/zo-platform/STATE.md: session 025 hand-off rewritten with final $7.75 number; old 70-80% claim in the long status block marked "now refuted — measured ~30%"; benchmark line rewritten. - memory/zo-platform/DECISION_LOG.md: new entry at 2026-04-27T16:30:00Z documenting the cascade and the rationale (PR-005 enforcement > aspiration applied to marketing claims). Quality gates: validate-docs 10/10, pytest 725/725 + 7 skipped, ruff src/zo/ clean. No code changes — pure documentation + memory cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SamPlvs merged commit f290311 into main Apr 27, 2026
2 checks passed

SamPlvs deleted the claude/recursing-einstein-73777c branch April 27, 2026 15:19

This was referenced Apr 27, 2026

docs(low-token): replace 70-80% projection with measured ~30% reduction #60

Merged

feat(low-token): two-tier model routing + Phase 1/5 agent trims #61

Merged

chore(readme): remove MNIST/CIFAR benchmark-dataset framing #63

Merged

SamPlvs added a commit that referenced this pull request Apr 30, 2026

Merge pull request #59 from SamPlvs/claude/recursing-einstein-73777c

8e3a4bc

fix(orchestrator): hard-enforce ZOTrainingCallback contract

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): hard-enforce ZOTrainingCallback contract#59

fix(orchestrator): hard-enforce ZOTrainingCallback contract#59
SamPlvs merged 1 commit into
mainfrom
claude/recursing-einstein-73777c

SamPlvs commented Apr 27, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SamPlvs commented Apr 27, 2026

Summary

Why

Test plan

Out of scope (separate follow-up)

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 27, 2026

Deploying zero-operators with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant