v0.41.0 — Run it anywhere, and trust what it ships

A week ago, ooo auto learned to finish the job on its own. This release makes
that autonomy something you can actually rely on: it runs on one more runtime,
it refuses to start building until the goal is unambiguous, and the verdict that
decides "is this actually done?" can no longer be gamed.

The headline

Autonomy is only worth as much as the trust behind it. v0.40.0 closed the loop —
goal in, product out. v0.41.0 spends its week hardening the two ends of that
loop and widening the floor it runs on.

Run it anywhere. Pi joins Claude, Codex, Gemini, OpenCode, Goose, and
Copilot as a first-class runtime. Ouroboros stays the workflow engine; the
runtime is a swappable kernel. Installing it got more reliable, and every
default model pin now lives in one place.
Trust what it ships — at the input. The Socratic interview no longer thinks
alone. At every ambiguity milestone it convenes a panel — a researcher, a
contrarian, a simplifier — to surface hidden assumptions before the question
reaches you. And ooo auto will not start building until the Seed is genuinely
low-ambiguity and passes QA.
Trust what it ships — at the output. The verifier's verdict is now typed,
audited, and routed by an explicit admission policy. A test that really ran but
reported the wrong evidence form is no longer smeared as "fabrication," and a
faked clean run still doesn't pass.

🖥️ Run it anywhere — the Agent OS gets a new kernel

Pi is now a first-class Ouroboros runtime. Ouroboros owns the workflow engine,
Seed decomposition, checkpointing, evaluation handoff, and ooo skill dispatch;
for each runtime task it shells out to pi --mode json and normalizes Pi's JSONL
events into Ouroboros AgentMessage values. As the new runtime guide puts it:
"Pi is an Ouroboros runtime" means the runtime is selectable — not that Pi is
imported into Ouroboros. That is the whole Agent OS thesis in one sentence.

PiLLMAdapter for --llm-backend pi; pi / pi_cli registered as LLM- and
interview-driver-capable in the backend registry and provider factory (#1326)
Pi backend-aware default-model normalization — default --llm-backend pi uses
Pi's own backend default instead of forwarding an Anthropic model name (#1326)
Align the Pi runtime with documented JSON mode (#1321)
Report malformed Pi runtime events as a typed ProviderError instead of
failing opaquely (#1325)
Wire the Pi runtime setup surface — ouroboros setup --runtime pi installs the
managed Pi bridge (5c674c1)
Opt-in native Pi CLI smoke test for end-to-end confidence (#1329)

Installing and updating got more trustworthy. The week's two same-day releases
surfaced real install-path risk; this closes it.

Run setup with the freshly installed ouroboros binary, not a stale one
left on PATH (#1345)
Installer UX improvements; pipx/pip install paths now preserve existing PATH
precedence (#1343)

One source of truth for model pins. The same default model strings were
hand-copied across three layers, so Opus had silently been frozen at 4.6 since
February. They now live in a single _model_defaults.py.

Centralize every default Claude model pin into one source of truth
(_model_defaults.py) and pin exact snapshots rather than the "default"
sentinel, so evaluation/consensus grading stays reproducible. Net move: the
Opus reasoning tier → 4.8 (interview, seed, ontology, evaluation, consensus
advocate); the Sonnet judgment tier (qa_model) stays pinned at 4.6,
retiring the dated claude-sonnet-4-20250514 (#1324, #1323)

Roadmap, in the open. A point-in-time AgentOS issue-sequencing graph
(Track A / B / C) is now published so you can see which merged PRs resolved which
roadmap tracks. #961 remains the canonical roadmap SSOT (#1293).

Housekeeping. Prune unused optional packages (#1301); pin typer before the
vendored click to stabilize resolution (#1300).

🧠 Trust what it ships — at the input: the interview stops thinking alone

Ouroboros has always opened with a single questioner. Now that questioner has a
panel. Milestone lateral review is promoted from a non-blocking advisory to a
required lightweight subagent pass at exactly the moments hidden assumptions
start to bite.

When an interview crosses an ambiguity milestone — initial → progress,
progress → refined, refined → ready — the main session dispatches
ouroboros_lateral_think with researcher, contrarian, and simplifier
personas (adding architect when the answer changes system shape or ownership)
before answering or asking the returned question (9d229c4)
This is the supported "deep research style" interview experience: multiple
perspectives visibly help, while the final prompt stays easy to answer. Results
are folded into 2–3 concrete options or one recommended draft — not dumped
as a report
Lateral review also fires whenever the main session would otherwise compress a
user's free-text into a decision, or when the question is about tradeoffs,
priorities, non-goals, risk, success criteria, or rollout
run_lateral_review is now a declared interview capability, with per-runtime
capability/instruction artifacts wired in (9d229c4)

ooo auto won't build something underspecified. The interview no longer
closes on ledger completeness alone.

Gate auto runs on backend-confirmed low ambiguity (≤ 0.20) plus a pre-run
Seed QA pass for both the MCP and CLI entrypoints; QA findings feed back into
bounded Seed-repair attempts before blocking, so failures are actionable and
resumable (#1302)
Normalize natural worktree-policy names (e.g. create_isolated_worktree → always)
and fail fast when complete_product=true is paired with a too-short timeout,
instead of burning the budget in the interview and blocking late (#1305)

🛡️ Trust what it ships — at the output: a verdict you can't game

The more autonomous the loop, the more its "done" has to mean done. This release
makes the verifier's decision typed, auditable, and policy-routed (RFC #814,
Verdict Envelope v1).

Promote TraceGuard verdict admission into VerifierVerdict: H1 verifier
output now carries a typed status, evidence refs, and a retry_admission, and
ACCEPT / RETRY / REDISPATCH / ESCALATE_MODEL / ESCALATE_HUMAN / BLOCK decisions
are persisted on atomic typed-evidence events (#1330)
- Benchmark fixtures: accepted → ACCEPT, missing evidence → EVIDENCE_MISSING / RETRY, semantic miss → SCOPE_CREEP / REDISPATCH, repeated fabrication → FABRICATION_SUSPECTED / ESCALATE_MODEL
Prefer the verifier's retry-admission policy (H7): re-run the same leaf only
when retry_admission=RETRY; honor intentional divergence between
failure_class and retry_admission (e.g. FABRICATION_SUSPECTED +
REDISPATCH) instead of inferring policy from the failure class alone (#1331)
Classify masked test evidence fairly (#1292): a transcript that clearly ran
the test command but masked its status behind an output filter (… | tail) is
now EVIDENCE_FORM_MISMATCH — retryable, with actionable feedback (e.g. add
set -o pipefail) — rather than FABRICATION_SUSPECTED. The #1208 guard holds:
unprotected output-filter pipelines still don't prove a clean commands_run
claim. The verifier's evidence boundary is now codified in docs so core stays
language- and runner-agnostic

What's Changed

Runtimes & Agent OS

feat(providers): add Pi LLM adapter (#1326)
fix(pi): align runtime with documented JSON mode (#1321)
fix(pi): report malformed runtime events (#1325)
fix(setup): wire Pi runtime setup surface (5c674c1)
test(orchestrator): add opt-in Pi CLI smoke test (#1329)
fix(installer): prefer freshly installed ouroboros for setup (#1345)
feat(installer): improve install script UX (#1343)
refactor(config): centralize Claude model pins into a single source of truth (align to 4.8) (#1324)
fix(config): replace retiring qa_model default with claude-sonnet-4-6 (#1323)
chore(deps): prune unused optional packages (#1301)
fix(deps): pin typer before vendored click (#1300)
fix(opencode): cover Windows cleanup review blockers (#1320)
fix(goose): keep LLM completion calls profile-free (#1303)
fix(run): guard home dir in _detect_project_root_from_seed_path (#1313)

Interview (the philosophy layer)

feat(interview): dispatch lateral review at milestones (9d229c4)
fix(auto): gate runs on low-ambiguity seed QA (#1302)
Harden ooo auto policy aliases and timeout preflight (#1305)

Verifier & harness integrity

feat(harness): promote TraceGuard verdict admission (#1330, refs #814)
fix(h7): prefer verifier retry admission policy (#1331)
fix(orchestrator): classify masked test evidence forms (#1292, refs #1234)

Docs

docs(providers): document Pi provider surfaces (#1327)
docs(runtime): fix shipped backend wording (#1332)
docs(agentos): add issue sequencing graph snapshot (#1293)
Verdict Envelope v1 RFC, verifier-evidence-policy, runtime-capability-matrix,
Pi runtime guide, and contributing/key-patterns updates

What's Changed

fix(orchestrator): classify masked test evidence forms by @Q00 in #1292
docs(agentos): add issue sequencing graph snapshot by @Q00 in #1293
fix(deps): pin typer before vendored click by @Q00 in #1300
chore(deps): prune unused optional packages by @Q00 in #1301
fix(goose): keep LLM completion calls profile-free by @mdc2122 in #1303
fix(run): guard home dir in _detect_project_root_from_seed_path by @kenlin8827 in #1313
fix(opencode): cover Windows cleanup review blockers by @shaun0927 in #1320
fix(pi): align runtime with documented JSON mode by @shaun0927 in #1321
fix(auto): gate runs on low-ambiguity seed QA by @Q00 in #1302
Harden ooo auto policy aliases and timeout preflight by @shaun0927 in #1305
fix(config): replace retiring qa_model default with claude-sonnet-4-6 by @shaun0927 in #1323
fix(pi): report malformed runtime events by @Q00 in #1325
refactor(config): centralize Claude model pins into a single source of truth (align to 4.8) by @shaun0927 in #1324
feat(providers): add Pi LLM adapter by @Q00 in #1326
feat(harness): promote TraceGuard verdict admission by @Q00 in #1330
fix(h7): prefer verifier retry admission policy by @Q00 in #1331
docs(providers): document Pi provider surfaces by @Q00 in #1327
test(orchestrator): add opt-in Pi CLI smoke test by @Q00 in #1329
docs(runtime): fix shipped backend wording by @Q00 in #1332
fix(setup): wire Pi runtime setup surface by @Q00 in #1333
feat(interview): dispatch lateral review at milestones by @Q00 in #1334
feat(installer): improve install script UX by @Q00 in #1343
fix(installer): prefer freshly installed ouroboros for setup by @Q00 in #1345

New Contributors

@kenlin8827 made their first contribution in #1313

Full Changelog: v0.40.1...v0.41.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.41.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.41.0 — Run it anywhere, and trust what it ships

The headline

🖥️ Run it anywhere — the Agent OS gets a new kernel

🧠 Trust what it ships — at the input: the interview stops thinking alone

🛡️ Trust what it ships — at the output: a verdict you can't game

What's Changed

Runtimes & Agent OS

Interview (the philosophy layer)

Verifier & harness integrity

Docs

What's Changed

New Contributors

Contributors

Uh oh!