Skip to content

Releases: SollanSystems/loop-engineer

loop-engineer v0.6.0 — metrics real

Choose a tag to compare

@SollanSystems SollanSystems released this 04 Jul 01:16
dc9d9cc

"Metrics real" — false-completion-rate and repair-productivity graduate from claims to derivations, red-teamed before release.

loop metrics <loop-dir> derives FCR and RP from a loop's real on-disk evidence — RUNLOG claims × verify bundles, held-out verdicts, repair records, receipts — never from agent narration. FCR is computed two independent ways and disagreement is surfaced, not resolved; unmatched success claims fail closed. Every number ships with a provenance block naming its input files.

loop metrics --baseline publishes docs/metrics-baseline.json and refuses over anything not genuinely gate-backed: a structurally-valid held-out verdict artifact is mandatory, rejected/unanchored repair records block it, disagreeing FCR methods block it, and a vacuous zero-claim run cannot baseline.

Published baseline (gate-backed examples/coverage-repair): FCR 0.0 · RP 1.0 — reproducible with python3 -m loop metrics examples/coverage-repair; a test binds the README literals to the committed JSON.

Honesty invariants:

  • productive is recomputed from each record's own evidence, never trusted; repair records must anchor to a same-task red→green verify-bundle pair.
  • Claim semantics are outcome-class aware: completion-class claims (task_passed/succeeded/terminal) require every attached bundle green, no exceptions; progress-class (advanced) tolerates a red intermediate only if the same task goes green in a strictly later iteration.
  • Canonical record schemas (loop-engineer/repair@1, loop-engineer/rollout@1) end the two-shapes-called-"the repair record" ambiguity.

Red-teamed before merge: two adversarial review rounds confirmed 17 issues in the metrics implementation — including a --baseline that would have published a clean FCR over a run its own gate flagged — all fixed and pinned as regression tests. Documented residual: a committed verdict is evidence, not proof; tamper detection belongs to the anti-cheat layer.

Also: loop console script (pip install -e .), doctor reports validated record schemas, 217-test suite (+67).

Full details in CHANGELOG.md.

loop-engineer v0.5.0 — credibility enforcement + first screen

Choose a tag to compare

@SollanSystems SollanSystems released this 03 Jul 18:48
58da373

The two pre-launch milestones of the v1.0 roadmap, landed together.

Enforce the wedge — false-completion defense is now enforced by validators, not asserted by docs:

  • A Succeeded terminal no longer validates with false_completion: true or an empty/false criteria_met; the inspector grades false-completion defense on invocation evidence, never a self-asserted flag (#8).
  • The held-out gate returns NotReady on an empty visible set; the anti-cheat scanner detects edits that neuter its own gate-decision functions and reports gate tampering with a distinct exit code (#8).
  • examples/coverage-repair runs end-to-end through the real held-out gate — its false_completion: false is backed by a committed gate verdict, not a hand-set flag (#9).
  • The repo's own live contract passes its own gate: python3 -m loop doctor .loop → ok (#8).

First screen — a stranger gets it in 10 seconds, scores a loop in 30:

  • README rebuilt: tagline, concrete failure modes, zero-install first command, stack diagram, comparison table (#7, #12).
  • Weak→strong demo filmed live on the real tools: docs/demo.gif — a self-asserted DIY loop scores 0/weak, the gate-backed example 90/strong (#13).
  • New loop scaffold command + JSON Schemas for the contract artifacts; promised templates shipped (#8, #11).
  • CLI polish: --help/--version, distinct errors, explicit exit codes (#10).

Plus: the v1.0 master roadmap and four strategic design specs are now committed under docs/superpowers/ (#15).

Version jumps 0.3.4 → 0.5.0 to match the roadmap's milestone numbering; there is no 0.4.x tag.

Full details in CHANGELOG.md.

v0.3.4 — dogfood-driven hardening + portable core

Choose a tag to compare

@SollanSystems SollanSystems released this 03 Jul 16:58
3a7c152

Dogfood-driven hardening: ran loop-inspector + loop-runtime-monitor against 9 real
on-disk loops (foreign and in-house). The tools had been built and tested only against
this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
six defects — all fixed here under TDD, each pinned by a regression test.

Fixed

  • (P1) inspect_loop no longer crashes on a malformed manifest.yaml. read_manifest
    (loop/contract.py) ran yaml.safe_load without a guard — the one read path missing the
    json.JSONDecodeError guard every JSON read already had — so a malformed manifest in an
    untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
    report. It now fails safe to {}, fixing the crash for inspect_loop, validate_contract,
    and doctor_report at once.
  • inspect_loop now scores SPEC.md / WORKFLOW.md / TASKS.json dual-location (.loop/
    ∪ workspace root), like manifest/state already resolved. Previously SPEC/WORKFLOW were
    hard-coded to the workspace root, so a loop whose contract lives under .loop/ (including
    loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
    independent verification." Scores on substance, not on where the file sits.
  • inspect_loop recognizes a single-file loop-contract.md as a contract-owned source
    for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
    committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.
  • runtime_monitor is terminal-state-aware. It now reads terminal_state / state == "terminal" and reports recommendation: "done" (surfacing the terminal state) instead of
    advising continue on a loop that has already finished.
  • runtime_monitor no longer reports an unparseable RUNLOG as healthy. A non-empty
    RUNLOG that yields zero parseable iteration records now returns status: "degraded" /
    recommendation: "replan" (with evidence) instead of the benign ok/continue/[] that
    was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
    detection on prose RUNLOGs visible.

Changed

  • Removed the unreferenced broad-substring corpus scoring path from scripts/inspect_loop.py
    (_gather_corpus, _walk_bounded, _evaluate_checks, _terminal_states_covered) — dead
    code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
    loop-inspector/SKILL.md and reference/patterns.md §4 to describe the actual named,
    typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
    harness shape semantically" claim the implementation never honored.

Added

  • pyproject.toml — the portable core is now installable with pip install -e .
    (optional pip install -e ".[yaml]" for faster manifest parsing), so
    python3 -m loop doctor|inspect <workspace> runs from any directory rather than only the
    repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
    test_docs_version check pins the pyproject.toml version to .claude-plugin/plugin.json.

Documentation

  • README: the Portable validator / inspector section documents the editable install for
    running outside the repo root; the 30-second inspect demo now shows the full
    target / present / gaps report; the doctor block notes the omitted paths object;
    validate / verify are documented as doctor aliases; terminal_state.json is noted as
    resolving in either .loop/ or the workspace root.
  • examples/coverage-repair records receipts at the canonical .loop/receipts/*.jsonl (was the
    stale pre-decoupling .gsd/audit/receipts/ path, inconsistent with the example's own .loop/
    layout).
  • loop-runtime-monitor/SKILL.md frames its position generically ("vs a loop-driving operator")
    instead of naming a private plugin agent.Dogfood-driven hardening: ran loop-inspector + loop-runtime-monitor against 9 real
    on-disk loops (foreign and in-house). The tools had been built and tested only against
    this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
    six defects — all fixed here under TDD, each pinned by a regression test.

Fixed

  • (P1) inspect_loop no longer crashes on a malformed manifest.yaml. read_manifest
    (loop/contract.py) ran yaml.safe_load without a guard — the one read path missing the
    json.JSONDecodeError guard every JSON read already had — so a malformed manifest in an
    untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
    report. It now fails safe to {}, fixing the crash for inspect_loop, validate_contract,
    and doctor_report at once.
  • inspect_loop now scores SPEC.md / WORKFLOW.md / TASKS.json dual-location (.loop/
    ∪ workspace root), like manifest/state already resolved. Previously SPEC/WORKFLOW were
    hard-coded to the workspace root, so a loop whose contract lives under .loop/ (including
    loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
    independent verification." Scores on substance, not on where the file sits.
  • inspect_loop recognizes a single-file loop-contract.md as a contract-owned source
    for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
    committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.
  • runtime_monitor is terminal-state-aware. It now reads terminal_state / state == "terminal" and reports recommendation: "done" (surfacing the terminal state) instead of
    advising continue on a loop that has already finished.
  • runtime_monitor no longer reports an unparseable RUNLOG as healthy. A non-empty
    RUNLOG that yields zero parseable iteration records now returns status: "degraded" /
    recommendation: "replan" (with evidence) instead of the benign ok/continue/[] that
    was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
    detection on prose RUNLOGs visible.

Changed

  • Removed the unreferenced broad-substring corpus scoring path from scripts/inspect_loop.py
    (_gather_corpus, _walk_bounded, _evaluate_checks, _terminal_states_covered) — dead
    code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
    loop-inspector/SKILL.md and reference/patterns.md §4 to describe the actual named,
    typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
    harness shape semantically" claim the implementation never honored.

Added

  • pyproject.toml — the portable core is now installable with pip install -e .
    (optional pip install -e ".[yaml]" for faster manifest parsing), so
    python3 -m loop doctor|inspect <workspace> runs from any directory rather than only the
    repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
    test_docs_version check pins the pyproject.toml version to .claude-plugin/plugin.json.

Documentation

  • README: the Portable validator / inspector section documents the editable install for
    running outside the repo root; the 30-second inspect demo now shows the full
    target / present / gaps report; the doctor block notes the omitted paths object;
    validate / verify are documented as doctor aliases; terminal_state.json is noted as
    resolving in either .loop/ or the workspace root.
  • examples/coverage-repair records receipts at the canonical .loop/receipts/*.jsonl (was the
    stale pre-decoupling .gsd/audit/receipts/ path, inconsistent with the example's own .loop/
    layout).
  • loop-runtime-monitor/SKILL.md frames its position generically ("vs a loop-driving operator")
    instead of naming a private plugin agent.

Erratum (2026-06-30): the Documentation note above overstated examples/coverage-repair — the frozen example ships contract artifacts, not a receipts trail. Corrected in the CHANGELOG Errata section; as of the M2 launch slice the example is fully runnable with a committed real holdout-gate verdict.

v0.3.3 — citation fixes

Choose a tag to compare

@SollanSystems SollanSystems released this 03 Jul 16:58
88d12dd

Changed

  • Citation accuracy: corrected three over-reaching attributions to real sources
    (no citations removed, no IDs changed). The "A/B trigger policy / cost-benefit
    knob" and "cuts wasted edits" are reframed as this suite's own design choices
    rather than PreFlect (arXiv 2602.07187) findings — PreFlect reflects on every
    plan unconditionally and reports no edit-efficiency metric. The "repo-native
    run-ledger over a vendor eval UI" is attributed to this suite as its answer to
    the open challenge posed by Code as Agent Harness (arXiv 2605.18747), not as
    that paper's claim.

Fixed

  • Standalone scripts now resolve the loop package when run by path. The
    documented invocations python3 scripts/runtime_monitor.py <loop> and
    python3 scripts/inspect_loop.py <loop> put scripts/ on sys.path (not the
    repo root), so the sibling loop package was unimportable and the scripts
    silently used their degraded fallbacks — runtime_monitor reported
    missing RUNLOG.md on the canonical .loop/RUNLOG.md layout, and
    inspect_loop could not read plan_then_execute from .loop/manifest.yaml.
    Both scripts now self-bootstrap the repo root onto sys.path before importing
    loop.*, matching python -m loop behaviour. The bug was invisible to CI
    because python -m pytest already places the repo root on sys.path; added
    by-path subprocess regression tests that reproduce the real standalone call.

v0.3.2 — public cut + BYO decoupling

Choose a tag to compare

@SollanSystems SollanSystems released this 03 Jul 16:58
6d8bbbd

Loop Contract Core plus a public open-source readiness pass: every skill now runs
on the bundled portable core with no private tooling, and the repo ships CI and
standard community files.

Added

  • Loop Contract Core. The portable loop/ package with
    python3 -m loop doctor|validate|verify|inspect, shared workspace/.loop
    path resolution, and JSON schemas for manifest@1, state@1, tasks@1, and
    terminal@1.
  • Generic receipt schema (schemas/receipt.schema.json, receipt@1) — an
    engine-neutral dispatch/cost record at .loop/receipts/*.jsonl so the flywheel,
    evals, and runtime-monitor compute routing + cost metrics without any private
    telemetry.
  • byo-default structural check (the 13th self-eval check) — fails if any
    skill depends on an unbundled tool without also naming the bundled default path.
  • Continuous integration (.github/workflows/ci.yml) — runs the frontmatter,
    self-eval, pytest, compile, JSON-validity, and quickstart-smoke gates on Python
    3.10 / 3.11 / 3.12.
  • Community files — CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, and issue/PR
    templates.
  • Discoverability metadata in plugin.json (homepage, repository, keywords).

Changed

  • Bring-your-own-verifier decoupling. Skills and reference docs now default to
    the bundled gate (scripts/verify-fastverify-full, python3 -m loop verify)
    and .loop/receipts/*.jsonl. /verify-slice, /verify-milestone, .gsd/
    receipts, model_routing.py / workflow_routing.py, Harmony, and Hermes are now
    documented as optional integrations / example realizations, never requirements.
  • Install is now claude plugin marketplace add SollanSystems/loop-engineer;
    the marketplace is renamed from loop-engineer-local to loop-engineer.
  • .claude-plugin/plugin.json version 0.3.10.3.2.

Fixed

  • scripts/inspect_loop.py now scores contract-owned artifacts instead of broad
    README/prose keyword matches; plan_then_execute: false no longer receives
    credit by substring.
  • scripts/runtime_monitor.py now resolves canonical root RUNLOG.md, returns
    structured reports for partial loop state, and avoids cross-task repair-churn
    false positives.
  • scripts/benchmark_harness.py rejects duplicate task ids before computing A/B
    metrics.
  • scripts/anticheat_scan.py flags semantic self-weakening of safety ranking or
    downgrade mapping as FailedSafety.

v0.3.1 — inspector + runtime monitor

Choose a tag to compare

@SollanSystems SollanSystems released this 03 Jul 16:58
1793a47

Adversarial-fix milestone. The v0.3.0 release closed two false-POSITIVE classes
in the anti-cheat scanner; a GPT-5.5/xhigh codex challenge over the v0.3.0 diff
then found the blind side — evasion paths the scanner failed to flag, plus
boundary-validation gaps in three harness scripts. This patch closes them.

Fixed

Anti-cheat scanner false-negatives (P1.1–P1.5)scripts/anticheat_scan.py

  • Scoped self-exclusion (P1.1). A scanner self-edit that empties or shrinks
    DEFAULT_GATE_PATHS / _ADDED_LINE_SIGNATURES is now graded critical
    (FailedSafety); additive and comment-only self-edits stay clean. Removed
    entries are compared semantically, so a reorder or reformat does not flag.
  • Delete + rename evidence (P1.2). parse_changed_files now also captures
    gate files that are deleted (+++ /dev/null) or renamed
    (rename from/rename to); both of Codex's exact exploit diffs now return
    clean:false.
  • verify-* gate coverage (P1.3). Gate-path matching now covers
    verify-fast / verify-full / verify-safety; tampering one to bypass it is
    flagged.
  • Broader tautology detection (P1.4). Identical-operand assertions (a literal
    or an identifier compared against itself) and always-true unittest calls now
    downgrade to FailedUnverifiable; honest asserts with distinct operands stay clean.
  • Path-shaped hidden-answer names (P1.5). Trajectory reads of held-out /
    hold_out / answer-key / golden / expected-output paths are flagged, while a
    plain assert result == expected stays clean.

Boundary validation (P1.6, P2.1–P2.4)

  • scripts/benchmark_harness.pycompare() raises on a mismatched A/B
    task-set instead of reporting a silent delta; non-bool claimed_done /
    verification_passed and out-of-range repair / criteria counts are rejected.
  • scripts/runtime_monitor.py — robust score parsing for 1e-3, negatives, and
    malformed input (no crash); tests pin the exact intervention per scenario.
  • scripts/inspect_loop.py — bounded shallow walk with a per-file read cap
    replaces the unbounded full-tree traversal.

Changed (P2.5)

  • README.md — present-tense install note corrected to "all 9 skills".
  • .claude-plugin/plugin.json — version 0.3.00.3.1.

Credits

  • The false-negative and boundary findings came from the GPT-5.5/xhigh
    codex adversarial review over the v0.3.0 release diff.