feat: training intelligence — checker agent, checkpoint cadence, experiment checklist (Batch B+C) by SamPlvs · Pull Request #95 · SamPlvs/zero-operators

SamPlvs · 2026-05-30T08:31:31Z

Batch B+C — Training intelligence

First of the process-hardening batches: turn training-time rules that previously had to be re-supplied every run into enforced platform behaviour (PR-005 applied to the autonomous run — aspirational agent prose → code paths that always fire).

What's in it

training-checker agent (Sonnet, phase-in) — a per-model-run live monitor the Lead spawns as training-{modelname}-checker. Tails the active experiment's metrics.jsonl/training_status.json, alerts Model Builder + Lead on NaN/Inf, divergence, gradient blow-up, overfit, dead LR, or stall (kill a broken run early), and writes a mechanistic diagnosis.md + feeds next.md. Enforced via an always-fires Phase-4 instruction in Orchestrator._prompt_experiment_context (not the plan's active-agent roster, which _agents_for_phase filters), backed by the agent file + AGENT_PHASE_MAP.
Checkpoint cadence + disaster recovery — importable should_checkpoint(epoch, total_epochs, every=10, is_best=) (replaces the contract's previously-undefined pseudocode) + a "Checkpointing and Disaster Recovery (REQUIRED)" section in model-builder.md: DL every-10-epochs + best + last with fully-resumable state (optimizer/scheduler/AMP-scaler/epoch/RNG); ML per-fold + best + persisted HPO study state.
Research-scout general-AI track — Phase-4-iteration survey of the broad ML literature (time-series/sequence modelling, optimization, regularization), method-first, pairing with the checker on each failure mode (additive to its domain problem-class survey).
Auto-maintained experiment checklist — render_checklist/write_checklist regenerate .zo/experiments/CHECKLIST.md on every registry mutation: exp → hypothesis → metric → Δ vs parent → tier → top shortfall, + "Next planned".

Cascade + verification

Agent roster 20 → 21 (setup.sh, README, lead-orchestrator, specs/agents.md, plans, PRD).
+20 tests → 780 passed / 7 skipped on Python 3.11 and 3.12.
ruff src/ clean; validate-docs 0 failures (2 warnings: client-blocklist skip + the known grep-vs-pytest test-badge parameterization gap; README badge updated 743 → 780).
PR-040 added (enforce-not-aspirate; cross-refs PR-005/009/035).

Batches A (per-project self-evolution), D (optimization audit + software-engineer agent), E (idle-agent shutdown + swarm reinforcement) queued next.

🤖 Generated with Claude Code

…klist Batch B+C of the process-hardening work — enforce training-time behaviour the user kept re-instructing every run (PR-005 applied to the autonomous run): - New training-checker agent (Sonnet, phase-in): per-model-run live monitor spawned as training-{modelname}-checker; alerts on NaN/ divergence/overfit/stall, writes diagnosis.md + next.md. Enforced via the always-fires Phase-4 instruction in _prompt_experiment_context + AGENT_PHASE_MAP, independent of the plan's active-agent list. - should_checkpoint() helper + model-builder "Checkpointing and Disaster Recovery (REQUIRED)" section: DL every-10-epochs + best + last (fully resumable); ML per-fold + best + persisted HPO state. - research-scout general-AI-research track for Phase-4 iteration (time-series/sequence modelling, method-first), pairs with checker. - Auto-maintained .zo/experiments/CHECKLIST.md via render_checklist/ write_checklist, baked into every registry mutation. Agent roster 20 -> 21 with full doc cascade. +20 tests (760 -> 780 on Python 3.11 & 3.12). ruff src/ clean, validate-docs 0 failures. PR-040 captures the enforce-not-aspirate lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-30T08:31:31Z

Deploying zero-operators with Cloudflare Pages

Latest commit:	`b162b38`
Status:	✅ Deploy successful!
Preview URL:	https://bdce04af.zero-operators.pages.dev
Branch Preview URL:	https://claude-training-intelligence.zero-operators.pages.dev

View logs

SamPlvs merged commit f0a6abd into main May 30, 2026
5 checks passed

SamPlvs deleted the claude/training-intelligence branch May 30, 2026 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: training intelligence — checker agent, checkpoint cadence, experiment checklist (Batch B+C)#95

feat: training intelligence — checker agent, checkpoint cadence, experiment checklist (Batch B+C)#95
SamPlvs merged 1 commit into
mainfrom
claude/training-intelligence

SamPlvs commented May 30, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SamPlvs commented May 30, 2026

Batch B+C — Training intelligence

What's in it

Cascade + verification

Uh oh!

cloudflare-workers-and-pages Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying zero-operators with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented May 30, 2026 •

edited

Loading