Skip to content

feat: Sprint 7 — Safety Model (Phase 1)#213

Merged
mickdarling merged 18 commits intodevelopfrom
feature/sprint-7-safety-model
Feb 26, 2026
Merged

feat: Sprint 7 — Safety Model (Phase 1)#213
mickdarling merged 18 commits intodevelopfrom
feature/sprint-7-safety-model

Conversation

@mickdarling
Copy link
Copy Markdown
Contributor

Summary

Sprint 7 delivers the complete Safety Model foundation for MCP-AQL:

Files Changed

File Changes
docs/versions/v1.0.0-draft.md Sections 8.7, 8.8 (8.6 already merged)
docs/security/execution-safety-loop.md NEW — safety dongle spec (616 lines)
docs/security/gatekeeper.md Section 1.4 cross-ref + Section 7 autonomy integration
docs/adapter/danger-levels.md Section 8 Danger Zone enforcement during execution
CHANGELOG.md Entries for #205, #206, #207

Key Concepts

  • Execution Safety Loop: An opt-in monitoring pattern where the LLM reports ALL intended actions (not just MCP-AQL operations) via nextActionHint and receives AutonomyDirective go/no-go responses
  • Safety Dongle: A standalone MCP-AQL server whose sole purpose is safety enforcement — no domain functionality, just monitoring
  • AutonomyDirective: The response contract with continue, factors, stopped, reason, stepsRemaining, notifications
  • Out-of-Band Verification: Danger Zone hard stops using verification codes displayed through channels inaccessible to the AI

Test plan

  • npx markdownlint-cli2 "docs/**/*.md" — 0 errors
  • npx cspell "docs/**/*.md" "CHANGELOG.md" — 0 issues
  • node scripts/validate-schema-examples.mjs — 27/27 passed
  • Cross-reference consistency between normative and informative docs
  • Section numbering continuity in v1.0.0-draft.md
  • All new document links resolve correctly

🤖 Generated with Claude Code

mickdarling and others added 3 commits February 25, 2026 17:01
…rn (#205)

New informative document defining:
- The safety dongle deployment model — a standalone MCP-AQL server
  that functions as a universal safety layer for all tool calls
- The execution safety loop protocol (plan → report → evaluate → act)
- Minimal operation surface: 7 operations across 3 endpoints
- Permission architecture: endpoint classification determines friction
- Integration with Gatekeeper (pattern-based policy evaluation)
- Integration with Danger Zone (out-of-band verification)
- Opt-out mechanisms: enforcing, monitoring, logging, disabled modes

Also adds cross-reference from gatekeeper.md (new Section 1.4).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…206)

New normative Section 8.7 in v1.0.0-draft.md defining:
- AutonomyDirective response contract (continue, factors, stopped, notifications)
- 5-stage evaluation pipeline (step limit → outcome → patterns → safety tier → risk tolerance)
- Agent notification system (permission_pending, autonomy_pause, danger_zone)
- Configurable elements (step limits, risk tolerance, approval patterns)
- Minimum viable implementation (step limit is only MUST stage)

Also adds Section 7 (Autonomy Evaluator Integration) to gatekeeper.md
describing how Gatekeeper blocks feed into the notification system.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ement (#207)

New normative Section 8.8 in v1.0.0-draft.md defines the challenge-response
protocol for highest-risk actions during execution. Verification codes with
128-bit entropy are displayed through AI-inaccessible channels (OS dialogs,
hardware tokens, SMS), creating cryptographic proof of human approval.

Also adds Section 8 to danger-levels.md documenting how danger levels map to
safety tiers during execution and how forbidden operations trigger out-of-band
verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 4m 15s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Read changed files
  • Analyze content and check cross-references
  • Post review feedback

This is a substantial, well-architected specification PR. The safety model design is sound and the writing is clear. Below are findings across spec consistency, security, and completeness. All items are keyed to specific file locations.


Strengths

  • Clean informative/normative split. execution-safety-loop.md correctly declares itself informative and defers to the normative spec in case of conflict. Cross-references are consistent throughout.
  • CREATE placement rationale (execution-safety-loop.md:288-293) is the strongest design reasoning in the PR. The explanation of why record_execution_step must be on CREATE (not EXECUTE) to avoid collapsing the Gatekeeper permission architecture is excellent and should help implementers.
  • Section numbering is continuous. 8.6 (already merged via feat: define execution safety loop as opt-in enforcement boundary (#208) #212), 8.7, 8.8 — all correct.
  • All document links resolve. Every *.md cross-reference in the new files points to a file that exists.
  • Cooperative enforcement is acknowledged. Section 8.6.5 and the implementation note in the normative spec are honest about the LLM-compliance requirement. This is the right approach.

Issues

1. Risk score boundary inconsistency between danger-levels.md and v1.0.0-draft.md (Medium)

danger-levels.md:529-535 maps destructive (2) to risk scores 41–65 with safety tier confirm. But the normative spec at v1.0.0-draft.md:1140-1145 defines the confirm tier as 31–60 and the verify tier as 61–85.

This means destructive operations assigned scores 61–65 would actually trigger verify (mandatory pause + approval), not confirm (simple approval). The table in danger-levels.md silently misrepresents the enforcement behavior for that upper segment.

Fix: Either cap the destructive range at 60 in danger-levels.md, or update the Enforcement column to say confirm or verify to reflect the overlap. Fix this →

2. deny policy category in informative doc has no normative basis (Medium)

execution-safety-loop.md:335-360 defines three policy categories: deny, requiresApproval, autoApprove, with the resolution note deny > requiresApproval > autoApprove > default. The semantic distinction here is important — deny appears to hard-block with no confirmation path, while requiresApproval pauses for human review.

However, v1.0.0-draft.md:1118-1134 (Stage 3 — Action Pattern Matching) only defines requiresApproval and autoApprove. The deny category has no normative equivalent, so implementations have no authoritative spec to follow for this behavior.

Fix: Either add a deny / hard-block pattern category to Section 8.7.2 Stage 3 of the normative spec (with continue: false and no confirm_operation path), or remove deny from the informative doc's policy config example and clarify that requiresApproval covers both cases. Fix this →

3. confirm_operation AUTO_APPROVE creates a self-approval gap (Medium)

execution-safety-loop.md:310-314 states that confirm_operation has an AUTO_APPROVE override at the client level because it's on the EXECUTE endpoint. Since the LLM receives permission_pending notifications (Section 5.3, execution-safety-loop.md:366-381) that include the operation name, the LLM could autonomously call confirm_operation to approve its own Gatekeeper-blocked operations — without any human involvement.

gatekeeper.md:273-278 describes the intended flow ("the human approves via confirm_operation"), but neither the normative spec nor the informative doc specifies any server-side control that prevents the LLM from calling it. The AUTO_APPROVE client-side semantics remove the only friction point.

Fix: Add a requirement that confirm_operation implementations MUST validate the caller's identity or use a token scheme that cannot be self-issued by the LLM. Alternatively, add a security note explaining the assumed trust model (e.g., system prompt separation ensures the LLM is instructed not to self-confirm). Fix this →

4. v1.0.0-draft.md header date not updated (Minor)

v1.0.0-draft.md:5 reads **Last Updated:** 2026-02-18, but Sections 8.7 and 8.8 (237 lines of normative content) are being added in this PR. The date should be updated to reflect the modification. Fix this →

5. nextStepRisk field omitted from informative doc's AutonomyDirective table (Minor)

v1.0.0-draft.md:1090 includes nextStepRisk?: SafetyTier in the AutonomyDirective TypeScript interface. execution-safety-loop.md:244-251 (Section 3.4) provides a reference table of AutonomyDirective fields but omits nextStepRisk. Implementers reading only the informative doc will miss this field. Fix this →

6. Rate limit behavior after window exhaustion is undefined (Minor)

v1.0.0-draft.md:1291-1295 specifies a rate limit of 10 failed verification attempts per 60-second window. But the spec does not say what happens after the window expires — does the counter reset to zero? Is there progressive lockout for repeat rate-limit violations? Without this, implementations will diverge and some may inadvertently allow brute-force across window resets.


Unchecked test plan items

Three items from the PR checklist are unchecked. Based on manual verification:

  • Cross-reference consistency — found the issues above (risk score ranges, deny category gap). Otherwise cross-references between files are consistent.
  • Section numbering continuity — verified: 8.1–8.5 pre-existing, 8.6 from spec: Define execute loop as mandatory enforcement boundary for agent actions #208, 8.7+8.8 added here. No gaps.
  • New document links resolve — verified: all .md links in execution-safety-loop.md, gatekeeper.md, and danger-levels.md resolve to existing files.

Summary

The safety model architecture is solid and the three normative sections (8.6, 8.7, 8.8) are well-structured. The two medium-priority issues (#1 risk score boundary, #3 confirm_operation self-approval gap) should be addressed before merging. Issues #2 (deny category) and #4#6 are minor polish items.

- Fix risk score boundary mismatch: cap destructive at 60, dangerous
  starts at 61 (aligns with normative confirm=31-60, verify=61-85)
- Add `deny` pattern category to normative Stage 3 with `stopped: true`
  semantics (hard block, no confirmation path)
- Add self-approval prevention requirement: entity confirming an
  operation MUST NOT be the same agent that triggered the block
- Add `nextStepRisk` field to informative AutonomyDirective table
- Clarify rate limit window reset behavior with progressive lockout
- Update v1.0.0-draft.md Last Updated date to 2026-02-25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 2m 48s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Read changed files
  • Check prior review findings (polish commit f8220d0 addressed all 6 prior issues)
  • Analyze content and check cross-references
  • Post review feedback

This is a well-structured Safety Model PR. The polish commit (f8220d0) cleanly resolved all six issues from the previous review. Three new findings remain — two from the polish commit itself.


Prior Review: All 6 Issues Resolved ✓

The polish commit addressed:

  • Risk score boundary (destructive capped at 60, dangerous starts at 61) ✓
  • deny pattern category added to normative Stage 3 ✓
  • Self-approval prevention requirement added to both normative and informative docs ✓
  • nextStepRisk added to AutonomyDirective table in execution-safety-loop.md
  • Rate limit window reset behavior clarified with progressive lockout option ✓
  • Last Updated date updated to 2026-02-25 ✓

New Findings

1. deny patterns omitted from the Configurable Elements table (Medium)

v1.0.0-draft.md:1122-1136 — The polish commit added deny as a first-class pattern category to Stage 3, placing it first in the resolution order (deny > requiresApproval > autoApprove). However, v1.0.0-draft.md:1202-1207 (Section 8.7.4 Configurable Elements table) still only lists requiresApproval and autoApprove. There is no row for deny.

This matters for conformance: Section 8.7.4 is the authoritative list of what adapters SHOULD make configurable. An implementer following only that table will implement requiresApproval and autoApprove but have no normative basis for the deny hard-block pattern list.

Fix: Add a deny row to the Section 8.7.4 table. Fix this →

2. Stage 2 "previous step outcome" has no corresponding API field (Medium)

v1.0.0-draft.md:1113-1116 — Stage 2 of the evaluation pipeline says:

"If the previous step's outcome was a failure, the evaluator SHOULD return continue: false."

But record_execution_step only defines nextActionHint as a parameter (see v1.0.0-draft.md:1018-1022 and execution-safety-loop.md:201-235). There is no outcome, previousStepOutcome, or stepResult field defined anywhere in the spec for the agent to report whether the previous step succeeded or failed.

Stage 2 is unimplementable as written: the server cannot evaluate whether the previous step failed without the agent reporting it. Implementations will either invent their own field names or skip Stage 2 entirely, diverging from each other.

Fix: Either define an outcome field (or equivalent) on record_execution_step — e.g., previousStepOutcome: "success" | "failure" | "skipped" — or revise Stage 2 to explain how the server is expected to obtain this information (e.g., inferred from step limit or explicitly omitted when unknown). Fix this →

3. Imprecise anchor in self-approval security note (Minor)

execution-safety-loop.md:317 — The security note added by the polish commit says:

"See Section 8.7.3 for the normative requirement."

The fragment #87-autonomy-evaluation resolves to the Section 8.7 header, not Section 8.7.3 where the Self-Approval Prevention paragraph actually lives (v1.0.0-draft.md:1196). The correct anchor is #873-agent-notification-system. Fix this →


Strengths

The three sections (8.6, 8.7, 8.8) are well-structured and internally consistent. Highlights:

  • Normative/informative split is clean. execution-safety-loop.md correctly defers to the normative spec and all cross-references resolve.
  • CREATE placement rationale (execution-safety-loop.md:288-294) is precise and well-reasoned — the "confirm to confirm" loop problem is clearly explained.
  • Self-approval prevention (new in polish commit) is correctly placed in both the normative (8.7.3) and informative docs, and the requirement is specific enough to implement.
  • Cooperative enforcement is honest. Section 8.6.5 and the normative implementation note (line 1044) accurately characterize the trust model rather than overstating guarantees.
  • Stage-based minimum viable implementation (Section 8.7.5) is a practical addition — calling Step Limit the only MUST stage is realistic for early adopters.

Summary

Finding #1 (deny missing from configurable elements) and #2 (Stage 2 missing API field) are both medium priority and should be addressed before merging. Finding #3 is a one-line link fix. The overall architecture is sound and the spec is significantly stronger after the polish commit.

- Add `deny` to Section 8.7.4 Configurable Elements table
- Define `record_execution_step` parameter table with `stepDescription`,
  `outcome`, `nextActionHint`, and `findings` fields
- Update Stage 2 to reference the `outcome` field explicitly, with
  skip behavior when omitted
- Fix imprecise anchor in execution-safety-loop.md security note
  (#87-autonomy-evaluation → #873-agent-notification-system)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 4m 29s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Read changed files and verify prior review findings
  • Analyze content and check cross-references
  • Post review feedback

This is a solid safety model PR. The latest polish commit (efa4b6e) cleanly resolved all three findings from the second review. Two new findings remain, both introduced by the outcome field addition in the polish commit.


Prior Review: All 3 Issues Resolved ✓

  • deny missing from Section 8.7.4 Configurable Elements table → added ✓
  • Stage 2 missing API field for outcomeoutcome parameter table added to record_execution_step; Stage 2 language updated to reference it directly ✓
  • Imprecise anchor #87-autonomy-evaluation → corrected to #873-agent-notification-system

New Findings

1. verify_challenge not reserved or endpoint-assigned in normative spec (Medium)

v1.0.0-draft.md:937-948 — Section 8.5.4 (Reserved Operations) lists five reserved operations, each with an explicit CRUDE endpoint:

Operation Endpoint
introspect READ
execute_agent EXECUTE
record_execution_step CREATE
complete_execution EXECUTE
abort_execution EXECUTE

Section 8.8 defines verify_challenge semantically (Lines 1273, 1289) but neither reserves the operation name in Section 8.5.4 nor assigns a CRUDE endpoint. The informative doc (execution-safety-loop.md:274) places it on CREATE (CONFIRM_SESSION), but this assignment has no normative backing.

The CRUDE endpoint assignment is conformance-relevant: it determines the default Gatekeeper permission tier. An implementer following only the normative spec will find no specification for verify_challenge's endpoint placement.

Fix: Add verify_challenge to the Section 8.5.4 reserved operations table with its CRUDE endpoint (CREATE, matching the informative doc's assignment), and note the Section 8.8 reference. Fix this →

2. Informative doc's record_execution_step section covers only nextActionHint (Minor)

execution-safety-loop.md:199-235 — Section 3.3 is titled "The nextActionHint Field" and documents only that parameter. The loop diagram at Line 155 also shows { nextActionHint } as the only payload.

The normative spec now defines a 5-parameter table for record_execution_step (element_name, stepDescription, outcome, nextActionHint, findings), two of which are SHOULD-level. An implementer consulting the informative doc for integration guidance will not discover outcome (which drives Stage 2 of the evaluation pipeline) without consulting the normative spec separately. Since Stage 2 is the fix for the prior round's "Stage 2 unimplementable" finding, discoverability matters here.

Fix: Either add a brief cross-reference in Section 3.3 pointing to the parameter table in Section 8.6.3, or rename the section to "The record_execution_step Parameters" and document all SHOULD-level fields. Fix this →


Strengths

  • Three-review arc is complete. The spec has improved meaningfully across all three passes. The original 6 issues, then 3, then these 2 — net trajectory is strong.
  • outcome parameter integration is well done. Adding it to the parameter table, updating Stage 2 to reference it directly, and noting "if outcome is omitted, this stage is skipped" (Line 1126) makes Stage 2 fully implementable with no ambiguity.
  • Rate limiting now complete. Section 8.8.5 specifies window reset behavior and introduces progressive lockout as an option — addresses the prior gap without over-prescribing.
  • Self-approval prevention is correctly placed. The normative requirement in Section 8.7.3 and the informative note in execution-safety-loop.md:315-318 are consistent and specific enough to implement.
  • deny row in Section 8.7.4 is correct and consistent with the pattern example in Stage 3.

Summary

Finding #1 (verify_challenge missing from reserved operations table) is medium priority — it leaves the normative CRUDE endpoint for a protocol-level operation unspecified. Finding #2 is a minor discoverability gap in the informative doc. Both are small, targeted fixes.

- Fix root-relative links in versioning.md (/CHANGELOG.md and
  /CONTRIBUTING.md → relative paths) that caused link-check CI failure
- Add verify_challenge and confirm_operation to Section 8.5.4 reserved
  operations table with CRUDE endpoint assignments
- Expand informative doc Section 3.3 from nextActionHint-only to full
  record_execution_step parameter overview with outcome field and
  cross-reference to normative parameter table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 53s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit edb2201 addressed all 2 prior issues)
  • Analyze content and check cross-references
  • Post review feedback

This is the fourth review pass on this PR. The polish commit (edb2201) cleanly resolved both findings from the third review. Two new findings remain.


Prior Review: Both Issues Resolved ✓

  • verify_challenge (and confirm_operation) added to Section 8.5.4 reserved operations table with CRUDE endpoint assignments ✓
  • Section 3.3 expanded from nextActionHint-only to full record_execution_step parameter overview with outcome field, cross-reference to normative table, and Stage 2 role explanation ✓
  • versioning.md root-relative links fixed (link-check CI unblocked) ✓

New Findings

1. Stage 3 deny block and Section 8.8 Danger Zone protocol are not cross-referenced (Medium)

v1.0.0-draft.md:1135 — Stage 3 says:

deny patterns → continue: false, stopped: true (hard block, no confirmation path)

v1.0.0-draft.md:1247-1248 — Section 8.8.1 says out-of-band verification is triggered when:

The nextActionHint matches a hard-block pattern configured in the adapter

These two clauses describe the same mechanism from opposite ends — Stage 3 sets stopped: true for deny matches, and Section 8.8.1 says hard-block pattern matches trigger the full Danger Zone protocol (challenge generation, out-of-band display, persistent agent-level block). But neither references the other, leaving implementers to independently connect the two.

The consequences of this gap:

  1. An implementer reading Stage 3 alone will return stopped: true without knowing to also generate a verification challenge (Section 8.8.2) and display it through an AI-inaccessible channel.
  2. The parenthetical "(hard block, no confirmation path)" is technically inaccurate — verify_challenge IS the unblocking path. The intended meaning appears to be "no confirm_operation path," but this is ambiguous.
  3. Section 8.8.1's second bullet has no pointer back to Stage 3, so an implementer reading Section 8.8 won't immediately find where in the pipeline deny blocks originate.

Fix: At Stage 3, replace "(hard block, no confirmation path)" with "(triggers out-of-band verification protocol — see Section 8.8)". In Section 8.8.1, add a parenthetical to the second bullet: "matches a hard-block (deny) pattern configured in the adapter (Section 8.7.2, Stage 3)". Fix this →

2. Loop diagram payload unchanged after outcome field was added (Minor)

execution-safety-loop.md:155 — The ASCII protocol diagram still shows:

│  │  Report  │  record_execution_step│
│  │  intent  │  { nextActionHint }   │

Section 3.3 was correctly expanded to document outcome as a SHOULD-level field that drives Stage 2, but the diagram was not updated. Readers who scan diagrams before reading prose will not discover outcome from the diagram alone.

Fix: Update the diagram line to { nextActionHint, outcome? } to reflect that both fields form the standard payload. Fix this →


Strengths

The four-review arc has produced a significantly stronger spec:

  • All 9 issues across three prior reviews resolved. The polish commits addressed each finding precisely without over-engineering.
  • Reserved operations table is now complete. All 7 protocol operations (introspect, execute_agent, record_execution_step, complete_execution, abort_execution, confirm_operation, verify_challenge) have normative CRUDE endpoint assignments.
  • Stage 2 is now fully implementable. The outcome parameter with its three valid values and the "if omitted, skip" clause gives implementers everything they need.
  • deny is a first-class configurable element. The Section 8.7.4 table row for deny is correct, and Stage 3's resolution order is unambiguous.
  • Rate limiting is complete. Section 8.8.5 specifies window reset behavior with progressive lockout as an option — no implementation divergence possible.

Summary

Finding #1 (Stage 3 / Section 8.8 cross-reference gap) is medium priority — without it, an implementer reading Stage 3 alone will not know to trigger the full out-of-band verification protocol for deny pattern matches. Finding #2 is a minor diagram cosmetic. Both are small, targeted fixes.

- Stage 3 deny now references Section 8.8 out-of-band verification
- Section 8.8.1 hard-block trigger now references Stage 3 deny patterns
- Loop diagram updated to show { nextActionHint, outcome? } payload

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 5m 26s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit e6bf157 addressed both 4th-review issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the fifth review pass on this PR. The polish commit (e6bf157) cleanly resolved both findings from the fourth review. Two new findings remain.


Prior Review: Both Issues Resolved ✓

  • Stage 3 deny → Section 8.8 cross-reference: replaced "(hard block, no confirmation path)" with "(triggers out-of-band verification — Section 8.8)" ✓
  • Section 8.8.1 trigger condition → Stage 3 cross-reference: added "(Section 8.7.2, Stage 3)" ✓
  • Loop diagram updated from { nextActionHint } to { nextActionHint, outcome? } ✓ (see new finding spec: Document introspection system #2 below)

New Findings

1. verify safety tier's "verification challenge" protocol is undefined (Medium)

v1.0.0-draft.md:1159 — Stage 4's safety tier table says:

| 61-85 | `verify` | Pause + create verification challenge |

The danger_zone row (86-100) explicitly references Section 8.8. The verify row does not. And Section 8.8.1's trigger conditions only enumerate two cases — danger_zone safety tier and deny pattern matches — neither of which is the verify tier. The verify_challenge reserved operation (Section 8.5.4) is also scoped exclusively to "Danger Zone unblocking (Section 8.8)."

The result: the verify tier says "create verification challenge" but provides no implementation guidance:

  1. No challenge creation or display protocol is defined for this tier
  2. No operation is specified for the human to submit the response (it's not verify_challenge, and confirm_operation covers Gatekeeper blocks, not Autonomy Evaluator pauses)
  3. The informative doc's Section 6.2 (execution-safety-loop.md:418-428) explains only the danger_zone hard stop — not what happens at verify tier

Implementers will either skip verify enforcement entirely, reuse the Section 8.8 protocol (which may or may not be intended), or invent their own mechanism — diverging from each other with no normative ground truth.

Fix: Either add a subsection to Section 8.7 (or 8.8) defining the verify tier challenge protocol — including how the challenge is delivered and what operation the human uses to respond — or revise the Stage 4 table row to clarify that verify tier uses the same verify_challenge mechanism as Section 8.8 (with a cross-reference), if that is the intended behavior. Fix this →

2. Loop diagram box misaligned by 1 character after outcome split (Minor)

execution-safety-loop.md:155-156 — The update to the loop diagram split the original single-line payload { nextActionHint } │ into two lines. The outer box border in all other lines is consistently at column 39. After the split, both new lines have their trailing at column 38:

│  │  intent  │  { nextActionHint,   │   ← column 38 (off by 1)
│  │          │    outcome? }        │   ← column 38 (off by 1)

Fix: Add one trailing space before the on each of the two lines to restore alignment with the outer box. Fix this →


Strengths

The five-review arc has produced a significantly stronger spec:

  • 10 of 11 issues across four prior passes resolved — the polish commits have been precise and targeted without over-engineering adjacent sections.
  • Stage 3 / Section 8.8 cross-references are now bidirectional. An implementer reading either section will be directed to the other.
  • verify_challenge is now a reserved operation with normative CRUDE endpoint assignment. Section 8.5.4's reserved operations table is complete for all 7 protocol operations.
  • The outcome field is fully implementable. Parameter table, Stage 2 update, and "if omitted, skip" clause give implementers everything they need.
  • Rate limiting is complete. Progressive lockout semantics are defined. No implementation divergence path remains.
  • Self-approval prevention is normative and specific. Server-side controls are enumerated. The informative doc cross-references correctly.

Summary

Finding #1 (verify tier challenge protocol undefined) is medium priority — the phrase "create verification challenge" in Stage 4 is unimplementable as written, and implementers will diverge without a normative reference. Finding #2 is a cosmetic one-character alignment fix introduced by the 4th-pass diagram update.

- Clarify that verify and danger_zone tiers both use the Section 8.8
  out-of-band verification protocol; verify pauses (continue: false),
  danger_zone hard-blocks (stopped: true)
- Update Section 8.8.1 trigger conditions to include verify tier
- Broaden verify_challenge reserved operation description to cover
  both tiers
- Fix 1-character alignment in ASCII loop diagram

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 47s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit bec923b addressed both 5th-review issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the sixth review pass on this PR. The polish commit (bec923b) cleanly resolved both findings from the fifth review. Three new findings remain, all introduced by adding verify tier to Section 8.8's trigger conditions.


Prior Review: Both Issues Resolved ✓

  • verify tier challenge protocol undefined → Stage 4 table updated to reference verify_challenge and Section 8.8, explanatory paragraph added distinguishing pause vs. hard-block semantics, Section 8.8.1 trigger conditions expanded to include verify
  • Loop diagram box misaligned by 1 character → trailing spaces added to both new lines, restoring outer box border to column 39 ✓

New Findings

1. Section 8.8 intro paragraph contradicts Section 8.8.1 trigger conditions (Medium)

v1.0.0-draft.md:1243 — The section header paragraph reads:

"When the Autonomy Evaluator assigns a danger_zone safety tier (Section 8.7.2, Stage 4), the system MUST escalate beyond a simple pause."

But v1.0.0-draft.md:1249 (two lines later) now says:

"Out-of-band verification is triggered when: The safety tier evaluation assigns verify (risk score 61-85) or danger_zone (risk score 86-100)"

The intro and its own subsection directly contradict each other. An implementer reading the intro will conclude Section 8.8 is only relevant to danger_zone, then be surprised by Section 8.8.1.

Fix: Replace "assigns a danger_zone safety tier" in line 1243 with "assigns a verify or danger_zone safety tier", and update the sentence to reflect that verify is a pause (not a hard stop), e.g.: "When the Autonomy Evaluator assigns a verify or danger_zone safety tier, the system MUST require out-of-band human verification before the agent may proceed." Fix this →

2. Section 8.8.3 "Blocking Semantics" doesn't cover verify tier pause behavior (Minor)

v1.0.0-draft.md:1286-1295 — Section 8.8.3 opens "When an agent is blocked by a Danger Zone event:" and exclusively covers stopped: true hard-block semantics. But Section 8.8 now covers two distinct enforcement flows: the verify pause (continue: false, not permanently blocked) and the danger_zone/deny block (stopped: true).

The verify case has no server-side semantics defined: if a verify-paused agent calls record_execution_step again while a challenge is pending, what does the server return? Section 8.8.1 says the agent "is paused but not permanently blocked," but there is no spec text defining what happens during that pause window. Implementations will diverge — some will return a fresh evaluation (potentially continue: true if context shifted), others will return a static continue: false with the pending challenge ID, others may escalate to a full block.

Fix: Add a sentence to Section 8.8.1 (or to Section 8.8.3 under a new heading) defining pause-window behavior: e.g., "While a verify challenge is pending, subsequent record_execution_step calls for the same agent SHOULD return continue: false with the pending challengeId in the reason field, until the challenge is completed or expires." Fix this →

3. Informative doc Section 6.2 not updated for verify tier (Minor)

execution-safety-loop.md:418-428 — Section 6.2 is titled "Hard Stops and Out-of-Band Verification" and describes a 5-step flow that exclusively applies to danger_zone hard stops (step 1: stopped: true, step 4: "agent is blocked"). After verify was added as a Section 8.8 trigger, Section 6.2 now presents an incomplete view of when out-of-band verification is invoked — it has no mention of the softer verify tier flow.

Additionally, Section 6.1's table (line 415) still says verify tier behavior as "Pause + create verification challenge" without naming verify_challenge or citing Section 8.8 — less specific than the normative Stage 4 table, which now reads "via verify_challenge (Section 8.8)".

Fix: Rename Section 6.2 to "Hard Stops, Pauses, and Out-of-Band Verification" and add a brief second paragraph after the 5-step list describing the verify tier flow (steps 1-4 simplified: continue: false, challenge generated, displayed out-of-band, agent remains pausable until verification). Also update the Section 6.1 table's verify row to reference verify_challenge. Fix this →


Strengths

The six-review arc has produced a significantly stronger spec:

  • 11 of 13 issues across five prior rounds resolved — the polish commits have been precise and targeted without over-engineering adjacent sections.
  • verify tier is now normatively connected to Section 8.8. The Stage 4 table, the explanatory paragraph (line 1162), and the Section 8.8.1 trigger conditions are all updated and consistent with each other. The remaining inconsistency is the intro paragraph only.
  • Loop diagram alignment is correct. Both lines of the split payload are now at column 39 — verified by comparing char counts in the diff.
  • All prior medium issues resolved. Risk score ranges, deny category, self-approval prevention, outcome field, verify_challenge as a reserved operation, Stage 3 / Section 8.8 cross-references — all correctly specified.

Summary

Finding #1 (Section 8.8 intro text contradicting 8.8.1) is medium priority — the intro is the first text an implementer reads, and it misstates the scope of the section. Findings #2 and #3 are minor — server pause-window behavior for verify tier, and the informative doc not reflecting the verify trigger addition. All three are narrow, targeted fixes.

- Fix Section 8.8 intro to include verify tier (was danger_zone only)
- Define verify pause-window behavior: subsequent record_execution_step
  calls return continue:false with pending challenge ID; verify pauses
  do not persist across restarts or block new executions
- Update informative doc Section 6.1 table to reference verify_challenge
- Rewrite Section 6.2 to cover both verify pause and danger_zone hard
  stop flows side by side

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 5m 45s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit cf26c1b addressed all 3 prior issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the seventh review pass on this PR. The polish commit (cf26c1b) cleanly resolved all three findings from the sixth review. Two new findings remain, both introduced by the verify tier addition to Section 8.8's scope.


Prior Review: All 3 Issues Resolved ✓

  • Section 8.8 intro contradicting Section 8.8.1 → intro now reads "assigns a verify or danger_zone safety tier" with a sentence distinguishing pause vs. hard-block semantics ✓
  • Section 8.8.3 missing verify tier pause-window behavior → paragraph added at line 1297 defining that subsequent record_execution_step calls return continue: false with pending challenge ID in reason field until challenge completes or expires ✓
  • Section 6.2 and Section 6.1 table not updated for verify tier → Section 6.2 now covers both hard-stop and pause flows, Section 6.1 table references verify_challenge

New Findings

1. Initial verify tier response has no defined transport for the challenge ID (Medium)

v1.0.0-draft.md:1273 — Section 8.8.2 step 2 states:

"The AI agent receives only the challenge ID and a human-readable prompt — never the code itself"

This is a requirement. The agent MUST receive the challenge ID. But no field in the initial AutonomyDirective or autonomy_pause notification is specified to carry it for verify tier:

  • AgentNotification.metadata.verificationId (line 1190) is explicitly commented // For danger_zone blocks — not verify
  • Section 8.7.3 line 1207 confirms: danger_zone notification "includes the verificationId for the affected agent" — autonomy_pause has no equivalent
  • Section 8.8.3 (line 1297) says subsequent calls SHOULD return continue: false with the challenge ID in reason — but "subsequent" implies this does not apply to the initial triggering response

The result: an implementer has no normative field to put the challenge ID in for the first AutonomyDirective of a verify tier pause. Different implementations will use reason, a custom challengeId field, or an autonomy_pause metadata field — all undocumented by the spec.

Fix: Add a bullet to Section 8.8.2 step 2 (or step 1) specifying that the initial AutonomyDirective for a verify trigger MUST include the challenge ID in the reason field. Alternatively, update Section 8.7.3 to note that autonomy_pause notifications generated by verify tier challenges SHOULD include verificationId in their metadata (analogous to danger_zone notifications). Fix this →

2. confirm_operation is scoped to Gatekeeper in reserved operations table but invoked for Autonomy Evaluator confirm tier in Stage 4 (Medium)

v1.0.0-draft.md:1158 — Stage 4's confirm tier row reads:

"31-60 | confirm | Pause for human review via confirm_operation"

This implies humans call confirm_operation to unblock an Autonomy Evaluator confirm tier pause. But:

  • Reserved operations table (line 948): confirm_operation purpose is "Confirm a Gatekeeper-blocked operation (Section 8.7.3)" — scoped to Gatekeeper
  • Section 8.7.3 notification sources (line 1205): permission_pending is "Cleared when confirm_operation is called" — Gatekeeper notifications only
  • autonomy_pause clearing mechanism (line 1206): "Generated when the Autonomy Evaluator returns continue: false. Provides the reason and contributing factors." — no clearing operation defined

An implementer building confirm_operation from the reserved operations table and Section 8.7.3 will implement it as a Gatekeeper-only operation. They will have no normative basis for also using it to unblock Autonomy Evaluator confirm tier pauses. The Stage 4 table's "via confirm_operation" reference then becomes unimplementable — autonomy_pause notifications have no defined clearing mechanism.

Fix: Either (a) update the confirm_operation entry in Section 8.5.4 to read "Confirm a Gatekeeper-blocked operation or Autonomy Evaluator confirm tier pause" and update Section 8.7.3 to add that autonomy_pause notifications are also cleared by confirm_operation; or (b) remove "via confirm_operation" from the Stage 4 confirm row and clarify the unblock mechanism for Autonomy Evaluator pauses separately. Fix this →


Strengths

The seven-review arc has produced a substantially stronger spec:

  • 12 of 14 issues across six prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Section 8.8's verify / danger_zone split is now clear and consistent. The intro, trigger conditions (8.8.1), challenge protocol (8.8.2), blocking semantics (8.8.3), and the informative doc (Section 6.2) all describe the two tiers accurately and without contradiction.
  • Stage 3 and Section 8.8 are bidirectionally cross-referenced. An implementer reading either section finds the other.
  • All 7 reserved operations have normative CRUDE endpoint assignments. The reserved operations table (8.5.4) is complete.
  • Stage 2 outcome field is fully implementable. Parameter table, "if omitted, skip" clause, and Stage 2 prose are consistent.
  • autonomy_pause clearing for danger_zone notifications correctly uses verify_challenge. Section 8.7.3 line 1207 and 8.8.3 are aligned.

Summary

Finding #1 (challenge ID transport for initial verify tier response) and Finding #2 (confirm_operation dual-scope gap) are both medium priority. Together they mean: when a verify tier pause is triggered, neither the transport of the challenge ID to the agent nor the mechanism for the human to clear the unrelated confirm tier pause is fully specified by the normative text. Both are narrow, targeted fixes.

- verificationId metadata field now covers both verify and danger_zone
- autonomy_pause notifications for verify tier SHOULD include
  verificationId; for confirm tier, cleared by confirm_operation
- Initial AutonomyDirective MUST include challenge ID via notification
  (autonomy_pause for verify, danger_zone for danger_zone tier)
- Broaden confirm_operation reserved ops entry to include Autonomy
  Evaluator confirm tier pauses alongside Gatekeeper blocks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 23s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit 6291c26 addressed both 7th-review issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the eighth review pass on this PR. The polish commit (6291c26) cleanly resolved both findings from the seventh review. Two new findings remain, both introduced by the challenge ID transport addition in the latest commit.


Prior Review: Both Issues Resolved ✓

  • Challenge ID transport for initial verify tier response → bullet added to Section 8.8.2 step 2 specifying MUST transport via autonomy_pause notification with metadata.verificationId; verificationId comment in AgentNotification interface updated; Section 8.7.3 autonomy_pause description updated ✓
  • confirm_operation dual-scope gap → Section 8.5.4 reserved operations table updated to read "Confirm a Gatekeeper-blocked operation or Autonomy Evaluator confirm tier pause"; Section 8.7.3 autonomy_pause description updated with confirm_operation clearing semantics for confirm tier ✓

New Findings

1. SHOULD vs MUST normative conflict for verificationId in verify tier notifications (Medium)

The polish commit added a transport requirement in two locations, but used different normative strengths:

  • v1.0.0-draft.md:1206 (Section 8.7.3): "For verify tier pauses, SHOULD include verificationId in metadata"
  • v1.0.0-draft.md:1274 (Section 8.8.2, step 2): "The initial AutonomyDirective MUST include the challenge ID via a notification: autonomy_pause with metadata.verificationId for verify tier"

These directly conflict. An implementation that follows the SHOULD (i.e., skips verificationId sometimes) would violate the MUST in 8.8.2. Since the agent cannot call verify_challenge without the challenge ID, MUST is clearly the correct strength.

Fix: Update v1.0.0-draft.md:1206 to replace "SHOULD include verificationId in metadata" with "MUST include verificationId in metadata" for verify tier pauses, to align with the MUST in 8.8.2. Fix this →

2. Informative doc notification actions table not updated for confirm tier clearing (Minor)

execution-safety-loop.md:397-399 — The notification actions table reads:

Type Trigger Action Required
permission_pending Gatekeeper blocked an operation Call confirm_operation to approve
autonomy_pause Autonomy Evaluator returned continue: false Human review needed
danger_zone Danger Zone hard block triggered Out-of-band verification required (Section 6)

The normative fix in Section 8.7.3 now specifies that confirm tier autonomy_pause notifications are cleared by confirm_operation, matching the pattern established for permission_pending. But the informative doc's table still says "Human review needed" for autonomy_pause — it doesn't distinguish between confirm tier (where confirm_operation unblocks) and verify tier (where verify_challenge is used). An integrator reading only the informative doc's table will not discover that confirm_operation is the correct clearing call for confirm tier pauses.

Fix: Update the autonomy_pause row's "Action Required" column to read: "For confirm tier: call confirm_operation; for verify tier: out-of-band verify_challenge (Section 6)". Fix this →


Strengths

The eight-review arc has produced a substantially stronger spec:

  • 14 of 16 issues across seven prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Challenge ID transport is now fully specified for both tiers. autonomy_pause with metadata.verificationId for verify; danger_zone notification with metadata.verificationId for danger_zone. The agent-side contract is now complete.
  • confirm_operation dual-scope is resolved. The reserved operations table and Section 8.7.3 are aligned — confirm_operation is documented as clearing both Gatekeeper permission_pending blocks and Autonomy Evaluator confirm tier pauses.
  • All 7 reserved operations have normative CRUDE endpoint assignments and purposes. The Section 8.5.4 table is complete.
  • Section 8.8 intro, trigger conditions, blocking semantics, and informative Section 6.2 are internally consistent for both verify and danger_zone tiers.

Summary

Finding #1 (SHOULD vs MUST conflict for verificationId in 8.7.3 vs 8.8.2) is medium priority — it gives implementations a normative out to omit the challenge ID despite it being required by 8.8.2, undermining the transport guarantee the prior round's fix was meant to establish. Finding #2 is a one-line table update in the informative doc. Both are small, targeted fixes.

- Fix SHOULD/MUST conflict: autonomy_pause for verify tier MUST
  include verificationId (aligns with Section 8.8.2 MUST requirement)
- Update informative doc notification table to distinguish confirm
  tier (confirm_operation) from verify tier (verify_challenge) actions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 13s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit 366075f addressed both 8th-review issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the ninth review pass on this PR. The polish commit (366075f) cleanly resolved both findings from the eighth review. Two new findings remain, both in the informative doc — legacy descriptions that predate the scope expansions made during the 7th-round polish commit.


Prior Review: Both Issues Resolved ✓

  • SHOULD vs MUST conflict for verificationId (8.7.3 vs 8.8.2) → v1.0.0-draft.md:1206 updated from SHOULD to MUST ✓
  • Informative doc notification actions table outdated → autonomy_pause row now reads "For confirm tier: call confirm_operation; for verify tier: out-of-band verify_challenge (Section 6)" ✓

New Findings

1. Section 4.1 operations table has outdated purpose descriptions for two operations (Medium)

execution-safety-loop.md:279-280 — The Section 4.1 "Required Operations" table describes:

Operation Purpose shown
confirm_operation Approve pending Gatekeeper blocks
verify_challenge Submit verification code for Danger Zone unblocking

Both of these are now outdated. The 7th-round polish commit expanded the scope of both operations:

  • confirm_operation now covers both Gatekeeper blocks and Autonomy Evaluator confirm tier pauses. The normative spec at v1.0.0-draft.md:948 reads: "Confirm a Gatekeeper-blocked operation or Autonomy Evaluator confirm tier pause (Section 8.7.3)".
  • verify_challenge now unblocks both verify tier pauses and Danger Zone hard stops (both are danger_zone and verify tier, via Section 8.8). The normative spec at v1.0.0-draft.md:949 reads: "Submit out-of-band verification code (Section 8.8)".

An implementer reading Section 4.1 will conclude confirm_operation only matters for Gatekeeper blocks, and will have no basis for understanding it's also the unblocking mechanism for confirm tier Autonomy Evaluator pauses.

Fix: Update the Section 4.1 table's Purpose column:

  • confirm_operation: "Confirm a Gatekeeper-blocked operation or Autonomy Evaluator confirm tier pause"
  • verify_challenge: "Submit out-of-band verification code for verify tier pauses or Danger Zone unblocking" Fix this →

2. Three "Danger Zone only" scope references not updated for verify tier (Minor)

The 7th-round polish commit expanded verify_challenge to cover both verify tier and danger_zone tier. Three locations in execution-safety-loop.md still use "Danger Zone only" language:

  • Line 596 (Section 9.2): "SHOULD implement verify_challenge for Danger Zone unblocking" — should include verify tier
  • Line 610 (Section 9.3): "MAY support out-of-band verification for Danger Zone events" — should include verify tier
  • Line 619 (Related Specifications): Link description "Danger Zone hard stop protocol" for Section 8.8 — Section 8.8 now also covers verify tier pauses

Fix: Update line 596 to "for Danger Zone and verify tier unblocking", line 610 to "for verify tier pauses and Danger Zone events", and line 619's link description to "Out-of-band verification protocol (verify tier and Danger Zone)". Fix this →


Strengths

The nine-review arc continues to converge:

  • 16 of 18 issues across eight prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Notification table is now complete and accurate. autonomy_pause now correctly distinguishes confirm tier (call confirm_operation) from verify tier (out-of-band verify_challenge).
  • MUST/SHOULD strength is consistent. verificationId is now MUST in both 8.7.3 and 8.8.2 — no implementation can lawfully omit the challenge ID.
  • Section 6.2 covers both enforcement tiers. The hard-stop and pause flows are documented separately with accurate semantics.
  • Self-approval prevention, rate limiting, reserved operations table, Stage 2 outcome field, Stage 3/8.8 cross-references — all previously resolved and still correct.

Summary

Finding #1 (Section 4.1 operation table descriptions) is medium priority — it's the canonical quick-reference for what a safety dongle implements, and both confirm_operation and verify_challenge have narrower descriptions than their actual normative scope. Finding #2 is a minor three-line update to compliance requirements and a link description. Both are small, targeted fixes.

Systematic update of all locations in execution-safety-loop.md that
narrowly scoped confirm_operation to "Gatekeeper only" or
verify_challenge to "Danger Zone only" after normative scope was
broadened in prior commits:

- Section 4.1 operation table: both operation purposes updated
- Security note: self-approval warning now covers both block types
- Section 9.2 SHOULD requirements: both operations broadened
- Section 9.3 MAY requirements: OOB verification includes verify tier
- Related Specifications: Section 8.8 link description updated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 55s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit 6d7f317 addressed both 9th-review issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the tenth review pass on this PR. The polish commit (6d7f317) cleanly resolved both findings from the ninth review. Two new findings remain.


Prior Review: Both Issues Resolved ✓

  • Section 4.1 operations table outdated descriptions → confirm_operation now reads "Confirm Gatekeeper blocks or Autonomy Evaluator confirm tier pauses"; verify_challenge now reads "Submit out-of-band verification code (verify tier and Danger Zone)" ✓
  • Three "Danger Zone only" scope references → Line 596 (SHOULD requirement), line 610 (MAY requirement), and line 619 (Related Specifications link description) all updated to include verify tier alongside Danger Zone ✓
  • Security note in Section 4.3 → updated to cover both Gatekeeper blocks and confirm tier pauses ✓

New Findings

1. Section 8.8.3 "Danger Zone event" heading doesn't explicitly cover deny pattern match blocks (Medium)

v1.0.0-draft.md:1289 — Section 8.8.3 opens:

"When an agent is blocked by a Danger Zone event:"

The bullet list that follows imposes restart-persistence and rejection-of-execution-operations requirements on the blocked agent. But deny pattern matches from Stage 3 (v1.0.0-draft.md:1135) also produce stopped: true and cross-reference Section 8.8 — an implementer building deny pattern handling will arrive at Section 8.8.3 and encounter "blocked by a Danger Zone event."

"Danger Zone event" is not a defined term in the spec. An implementer may reasonably interpret it as meaning only the danger_zone safety tier (risk score 86-100), not deny pattern matches — and conclude that deny-blocked agents do not need restart-persistent blocking. The informative doc (execution-safety-loop.md:422) is explicit: "deny pattern match" produces a hard stop with persistence, identical to danger_zone tier. But the normative spec's Section 8.8.3 header is the authoritative statement of blocking semantics, and it doesn't include deny pattern matches in its scope definition.

v1.0.0-draft.md:1251 helps — Section 8.8.1 says "For danger_zone and deny pattern triggers, the AutonomyDirective MUST return stopped: true" — but the connection from stopped: true → 8.8.3 blocking semantics still passes through the ambiguous "Danger Zone event" heading.

Fix: Replace "blocked by a Danger Zone event" in v1.0.0-draft.md:1289 with "blocked by a hard-block trigger (a danger_zone safety tier assignment or a deny pattern match)" — or more concisely, "blocked by any event that sets stopped: true". Fix this →

2. autonomy_pause clearing condition for verify tier not defined in Section 8.7.3 (Minor)

v1.0.0-draft.md:1205-1206 — The notification sources description defines clearing conditions for two of three notification-generating scenarios but omits the third:

  • permission_pending: "Cleared when confirm_operation is called for the blocked operation" ✓
  • confirm tier autonomy_pause: "cleared when confirm_operation is called for the paused action" ✓
  • verify tier autonomy_pause: "MUST include verificationId in metadata" — no clearing condition defined

The autonomy_pause notification system is pull-based (Section 8.7.3 last paragraph): notifications accumulate between record_execution_step calls. Without a defined clearing condition for verify tier notifications, implementations will diverge on when to stop including a verify tier autonomy_pause in subsequent responses. Some may clear it immediately on verify_challenge success; others may continue surfacing it until the execution completes; others may keep it until expiration.

Fix: Append "For verify tier pauses, cleared when verify_challenge succeeds for the associated challenge ID" to the autonomy_pause notification description at v1.0.0-draft.md:1206. Fix this →


Strengths

The ten-review arc has produced a substantially stronger spec:

  • 18 of 20 issues across nine prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • All 9th-round scope expansions are complete. The Section 4.1 operations table, Section 9.2/9.3 implementation requirements, security note, and Related Specifications link now all correctly reflect the dual scope of confirm_operation and verify_challenge.
  • verify tier is fully integrated. Stage 4 table, Section 8.8.1 triggers, 8.8.3 pause-window behavior, Section 8.8 intro, informative Section 6.1/6.2, and notification table are all consistent.
  • Normative strength is consistent. verificationId is MUST in both 8.7.3 and 8.8.2. The SHOULD/MUST conflict from round 8 is fully resolved.
  • Challenge ID transport is specified for both tiers. autonomy_pause with metadata.verificationId for verify; danger_zone notification with metadata.verificationId for danger_zone.
  • deny patterns are a first-class configurable element. Section 8.7.4 row is present, Stage 3 resolution order is unambiguous, and informative doc Section 5.2 policy example includes all three pattern types.

Summary

Finding #1 (deny pattern matches and Section 8.8.3 blocking semantics) is medium priority — the normative text's "Danger Zone event" heading may cause implementers to skip restart-persistence for deny-blocked agents, which directly undermines the hard-block guarantee. Finding #2 is a minor one-sentence addition to complete the symmetry of the notification clearing conditions. Both are small, targeted fixes.

Systematic audit-driven fixes to close all remaining specification gaps:

Notification lifecycle completeness (Section 8.7.3):
- permission_pending: add failed confirmation behavior (stays pending)
- autonomy_pause (verify tier): add clearing condition (verify_challenge
  success or challenge expiration)
- danger_zone: add clearing condition (verify_challenge or admin
  override), specify broadcast includes affected agent, scope includes
  deny pattern matches alongside danger_zone tier

Blocking semantics (Section 8.8.3):
- Replace "Danger Zone event" with explicit scope: any event that sets
  stopped:true (danger_zone tier, deny pattern match, or equivalent)
- Add challenge expiration behavior for verify tier: pause lifts,
  re-evaluation occurs, new challenge generated if same action reported

Normative consistency sweep:
- Remove narrow "Danger Zone" from stopped:true comment and decision
  tree (now references Section 8.8 generically)
- Update deny configurable element description to reference Section 8.8
- danger-levels.md Section 8.2: parallel documentation for both verify
  tier (pause) and danger_zone tier (hard block) enforcement flows
- Informative doc danger_zone notification row: include deny patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 4m 20s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit 3814312 addressed both 10th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the eleventh review pass on this PR. The polish commit (3814312) cleanly resolved both findings from the tenth review. Two new findings remain, both introduced by the deny pattern expansion in Section 8.8.1.


Prior Review: Both Issues Resolved ✓

  • Section 8.8.3 "Danger Zone event" heading ambiguity → replaced with "hard-blocked by any event that sets stopped: true (a danger_zone safety tier assignment, a deny pattern match, or equivalent adapter-specific trigger)" ✓
  • autonomy_pause clearing condition for verify tier → added "For verify tier pauses, MUST include verificationId in metadata; cleared when verify_challenge succeeds for the associated challenge ID or the challenge expires" ✓
  • verify challenge expiration behavior added → subsequent calls SHOULD re-evaluate; same action triggers a new challenge ✓

New Findings

1. deny pattern match challenge ID transport only has SHOULD coverage, not MUST (Medium)

Section 8.8.1 defines three out-of-band verification triggers. Section 8.8.2 step 2 (v1.0.0-draft.md:1274) specifies MUST-level challenge ID transport for two of them, but the third is missing:

Trigger Section 8.8.1 normative strength Section 8.8.2 step 2 coverage
verify tier MUST via autonomy_pause with metadata.verificationId
danger_zone tier SHOULD include notification MUST via danger_zone with metadata.verificationId
deny pattern match SHOULD include notification (same line as danger_zone) Not mentioned

The relevant text: v1.0.0-draft.md:1252 — "SHOULD include a danger_zone notification broadcast to all executing agents" applies equally to both danger_zone and deny triggers. But Section 8.8.2 step 2 mentions only "danger_zone tier" by name — deny is not extended to the MUST transport guarantee.

The consequence: an implementation that lawfully skips the SHOULD for a deny-blocked agent provides stopped: true but no verificationId. The blocked agent has no challenge ID to call verify_challenge with. The Section 8.8.2 MUST explicitly closes this gap for danger_zone tier but leaves it open for deny pattern matches — creating a recovery-path gap for the most restricted class of hard blocks.

Fix: Update v1.0.0-draft.md:1274 to extend the MUST to deny pattern matches: replace "or danger_zone with metadata.verificationId for danger_zone tier" with "or danger_zone with metadata.verificationId for danger_zone tier or deny pattern matches". Alternatively, update the SHOULD in line 1252 to MUST for both danger_zone and deny triggers (since neither can be meaningfully unblocked without the notification). Fix this →

2. danger_zone notification description uses definitive language for verificationId but notification itself is SHOULD (Minor)

v1.0.0-draft.md:1207 (Section 8.7.3) reads:

"danger_zone: Broadcast to ALL executing agents (including the affected agent) when any agent triggers a hard block (danger_zone tier or deny pattern match). Includes the verificationId for the affected agent. Cleared when the blocked agent is unblocked via successful verify_challenge or admin override."

The notification is SHOULD-level (Section 8.8.1, line 1252). The description says the notification "Includes the verificationId" — a definitive statement without a normative qualifier. Since the notification may lawfully be omitted (SHOULD), the conditional "when sent" is missing. The intended meaning is clear from context, but the mismatch between "SHOULD send" and "Includes" (always-true) introduces a semantic gap: an implementer who omits the notification reads the description as though they've also omitted a required verificationId delivery.

Fix: Replace "Includes the verificationId for the affected agent" with "When sent, MUST include the verificationId for the affected agent." This tightens the conditional and ensures that any sent notification always delivers the challenge ID. Fix this →


Strengths

The eleven-review arc has converged to a substantially complete spec:

  • 20 of 22 issues across ten prior rounds resolved. The polish commits have addressed every finding precisely without over-engineering.
  • Section 8.8.3 blocking semantics are now explicit and complete. The heading covers all stopped: true triggers; the verify pause expiration behavior (lift, re-evaluate, regenerate) is fully specified.
  • All three notification types have complete lifecycle descriptions. permission_pending (cleared by confirm_operation, failed attempts don't clear), autonomy_pause (dual clearing paths for confirm vs. verify tier), danger_zone (broadcast scope, verificationId included, cleared by verify_challenge or admin override).
  • deny pattern matches are now first-class throughout the spec. Section 8.8.1 trigger conditions, Section 8.8.3 heading, Section 8.7.3 danger_zone notification description, and the informative doc notification table all correctly include deny pattern matches alongside danger_zone tier.
  • danger-levels.md Section 8.2 now covers both enforcement tiers in parallel. The verify (pause) and danger_zone (hard block) flows are documented symmetrically with accurate step-by-step descriptions.

Summary

Finding #1 (deny challenge ID transport weaker than other triggers) is medium priority — it creates a path where a deny-blocked agent is permanently stuck with stopped: true and no verificationId to call verify_challenge, violating the recovery-path guarantee that Section 8.8 establishes. Finding #2 is a one-sentence wording fix to add a normative qualifier to the danger_zone notification's verificationId inclusion.

The danger_zone notification (for both danger_zone tier and deny
pattern triggers) is upgraded from SHOULD to MUST because stopped:true
without a verificationId creates an unrecoverable dead end — the agent
is permanently blocked with no challenge ID to call verify_challenge.

- Section 8.8.1: SHOULD → MUST for danger_zone notification
- Section 8.8.2: extend MUST challenge ID transport to deny matches
- Section 8.7.3: add MUST qualifier to verificationId inclusion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 12s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit f4b1535 addressed both 11th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the twelfth review pass on this PR. The polish commit (f4b1535) cleanly resolved both findings from the eleventh review. Two new findings remain — one pre-existing gap exposed by the notification MUST upgrade, one introduced by it.


Prior Review: Both Issues Resolved ✓

  • deny pattern match challenge ID transport only SHOULD → MUST: line 1252 now reads "MUST include a danger_zone notification broadcast" for both danger_zone and deny triggers; line 1274 extends the MUST challenge ID transport to deny pattern matches explicitly ✓
  • danger_zone notification description using definitive language without normative qualifier → since the notification itself is now MUST, the verificationId inclusion at line 1207 ("MUST include the verificationId") is unconditionally correct — the "When sent" qualifier is unnecessary when there's no longer a SHOULD on the notification ✓

New Findings

1. Informative doc compliance summary lists hard-block state persistence as MAY — normative spec says MUST (Medium)

execution-safety-loop.md:611 — Section 9.3 MAY Requirements includes:

"MAY persist blocked agent state across server restarts"

But v1.0.0-draft.md:1293 (Section 8.8.3) is unambiguous:

"The blocked state MUST persist across server restarts (file-based or database storage)"

These directly conflict. The normative MUST was justified by the hard-block security guarantee — a blocked agent that becomes unblocked on restart defeats the purpose of the stopped: true mechanism. But the informative doc's Section 9.3 compliance summary presents persistence as an optional enhancement.

An implementer using Section 9 as a compliance checklist will categorize persistence as optional, build accordingly, pass internal reviews, and then have a server restart silently unblock every hard-blocked agent. The normative MUST is actionable (even prescribes the mechanism: file-based or database storage) but isn't represented in the informative doc's compliance table.

Fix: Move "persist blocked agent state across server restarts" from Section 9.3 (MAY) to Section 9.1 (MUST), updating the wording to match the normative spec's intent: "MUST persist hard-blocked agent state across server restarts when implementing stopped: true semantics". Fix this →

2. danger_zone notification MUST not reflected in Section 9.1 MUST compliance list (Minor)

execution-safety-loop.md:598 — Section 9.2 SHOULD Requirements lists:

"SHOULD include notifications in AutonomyDirective responses"

The latest polish commit upgraded the danger_zone notification to MUST for all hard-block triggers (v1.0.0-draft.md:1252). The informative doc's Section 9.1 MUST list (execution-safety-loop.md:583-589) does not include any bullet about including danger_zone notifications when returning stopped: true. An implementer treating Section 9.2 SHOULD for notifications as the authoritative strength will implement notifications as optional even in hard-block scenarios where the normative spec now requires them.

The Section 9.1 bullet at line 586 ("MUST respect the continue and stopped directive semantics") is the closest proxy, but it doesn't surface the notification obligation, and the danger_zone notification is part of the protocol contract when blocking (agents in other executions need to know about the hard block).

Fix: Add a bullet to Section 9.1: "MUST include a danger_zone notification broadcast when returning stopped: true (Section 8.8.1)" — and remove the general "SHOULD include notifications" from Section 9.2 if the MUST already covers the most critical case. Fix this →


Strengths

The twelve-review arc has converged to a substantially complete spec:

  • 22 of 24 issues across eleven prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Both MUST upgrades in the latest commit are internally consistent. Line 1252 ("MUST include notification") and line 1274 ("MUST include verificationId") are aligned — the MUST chain from trigger → notification → challenge ID is now unbroken for all three hard-block trigger types.
  • Notification lifecycle descriptions are symmetric. All three notification types (permission_pending, autonomy_pause, danger_zone) have clearing conditions, normative strength, and metadata field requirements specified in Section 8.7.3.
  • The deny tier is now fully first-class. It has: a configurable patterns entry (8.7.4), Stage 3 resolution order, Stage 3 → Section 8.8 cross-reference, inclusion in 8.8.1 trigger conditions, MUST notification coverage (8.8.1:1252), MUST challenge ID transport (8.8.2:1274), MUST blocking semantics (8.8.3:1289), and the informative doc notification table (execution-safety-loop.md:399).
  • Challenge ID transport is MUST-level and complete for all three triggers. verify tier: autonomy_pause + metadata.verificationId; danger_zone/deny: danger_zone notification + metadata.verificationId.

Summary

Finding #1 (persistence MUST vs. MAY in compliance summary) is medium priority — an implementer treating Section 9 as a compliance checklist will build a system where server restarts silently unblock hard-blocked agents, defeating the security guarantee in Section 8.8.3. Finding #2 is a minor compliance table gap introduced by the notification MUST upgrade. Both are single-bullet changes.

- Move blocked state persistence from Section 9.3 MAY to Section 9.1
  MUST (normative spec Section 8.8.3 requires it)
- Add danger_zone notification broadcast to Section 9.1 MUST (normative
  spec Section 8.8.1 now requires it for all stopped:true events)
- Narrow Section 9.2 SHOULD for notifications to non-hard-block events
  since hard-block notifications are covered by the new MUST

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 3m 16s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit cfe6858 addressed both 12th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the thirteenth review pass on this PR. The polish commit (cfe6858) cleanly resolved both findings from the twelfth review. Two new findings remain — both are a direct consequence of the Section 9.1 MUST upgrade pattern applied in the latest commit.


Prior Review: Both Issues Resolved ✓

  • Persistence MUST vs. MAY in compliance summary → "MAY persist blocked agent state across server restarts" removed from Section 9.3; "MUST persist hard-blocked agent state across server restarts when implementing stopped: true semantics" added to Section 9.1 ✓
  • danger_zone notification MUST not in Section 9.1 → "MUST include a danger_zone notification broadcast to all executing agents when returning stopped: true" added to Section 9.1; Section 9.2 SHOULD updated to "for non-hard-block events" ✓

New Findings

1. Section 9.1 MUST list missing verify tier autonomy_pause notification obligation (Medium)

execution-safety-loop.md:590 — The latest commit correctly added:

"MUST include a danger_zone notification broadcast to all executing agents when returning stopped: true"

But v1.0.0-draft.md:1274 imposes a parallel MUST for verify tier:

"The initial AutonomyDirective MUST include the challenge ID via a notification: autonomy_pause with metadata.verificationId for verify tier"

This MUST is not represented in Section 9.1. The resulting Section 9.2 language ("SHOULD include notifications in AutonomyDirective responses for non-hard-block events") now actively misrepresents the normative strength for verify tier — verify is not a hard block, so it falls under "non-hard-block events," downgrading the normative MUST to SHOULD in the compliance checklist.

An implementer building to Section 9.1 will know they MUST include danger_zone notifications for hard blocks but will not know that verify tier challenges also require a MUST-level autonomy_pause notification with verificationId. The agent has no challenge ID without it.

Fix: Add a bullet to Section 9.1: "MUST include an autonomy_pause notification with metadata.verificationId when issuing a verify tier challenge (Section 8.8.2)". This is structurally parallel to the danger_zone MUST bullet added in the latest commit. Fix this →

2. Section 8.7.1's general "OPTIONAL" statement contradicts section-specific MUST requirements (Minor)

v1.0.0-draft.md:1109 reads:

"Implementations MUST include continue and factors in every AutonomyDirective. All other fields are OPTIONAL."

This sentence uses the RFC 2119 normative term "OPTIONAL", which has a specific meaning: implementations MAY omit these fields without violating the spec. But Sections 8.8.1 and 8.8.2 now impose MUST-level requirements on notifications for hard-block and verify tier events:

  • v1.0.0-draft.md:1252: "the AutonomyDirective MUST return stopped: true and MUST include a danger_zone notification"
  • v1.0.0-draft.md:1274: "The initial AutonomyDirective MUST include the challenge ID via a notification"

An implementer who reads line 1109 first and implements notifications as optional throughout, then reads Section 8.8, may not realize the later MUST overrides the general OPTIONAL statement, since RFC 2119 usage of "OPTIONAL" usually means exactly that with no exceptions unless explicitly qualified.

Fix: Append "except as required by Sections 8.8.1 and 8.8.2" to line 1109: "All other fields are OPTIONAL except as required by Sections 8.8.1 and 8.8.2." Fix this →


Strengths

The thirteen-review arc continues to converge:

  • 24 of 26 issues across twelve prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • The Section 9.1 / 9.2 / 9.3 restructuring is correct and clean. The two MUST bullets added in round 12 are well-placed and correctly worded. The scope qualifier "for non-hard-block events" on the Section 9.2 SHOULD is accurate — it just needs the parallel verify tier MUST to complete the picture.
  • danger_zone notification chain is complete. MUST-level from normative spec (8.8.1:1252) through to informative compliance checklist (Section 9.1:590). Only the verify tier parallel is missing.
  • Blocking semantics section (8.8.3) is now explicit for all cases. Hard blocks (stopped: true), verify pauses (pending challenge ID, expiration, re-evaluation), and restart persistence are all correctly specified.
  • All 7 reserved operations have normative CRUD endpoint assignments. The Section 8.5.4 table is complete.
  • Section 8.8 intro, trigger conditions, blocking semantics, and informative Sections 6.1/6.2 are internally consistent for verify, danger_zone, and deny triggers.

Summary

Finding #1 (verify tier autonomy_pause MUST missing from Section 9.1) is medium priority — it mirrors the round-12 finding for danger_zone but for the verify tier, and the "for non-hard-block events" phrasing in Section 9.2 actively downgrades it from the normative MUST. Finding #2 is a one-clause clarification to the general OPTIONAL statement in 8.7.1 that now directly contradicts the MUST requirements in 8.8.1/8.8.2.

…MUST

- Section 8.7.1: qualify "All other fields are OPTIONAL" with cross-ref
  to Sections 8.8.1/8.8.2 where notifications field is mandatory
- Section 9.1: add verify tier autonomy_pause notification with
  metadata.verificationId as a MUST requirement
- Section 9.2: qualify SHOULD for non-hard-block notifications to exclude
  verify tier (already covered as MUST in Section 9.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 25, 2026

Claude finished @mickdarling's task in 4m 3s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit 782b181 addressed both 13th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the fourteenth review pass on this PR. The polish commit (782b181) cleanly resolved both findings from the thirteenth review. Two new findings remain — both are pre-existing compliance table gaps surfaced by the Section 9 restructuring across the last several rounds.


Prior Review: Both Issues Resolved ✓

  • Section 9.1 MUST list missing verify tier autonomy_pause notification obligation → "MUST include an autonomy_pause notification with metadata.verificationId when issuing a verify tier challenge (Section 8.8.2)" added to Section 9.1 ✓
  • Section 8.7.1 "OPTIONAL" contradicts MUST requirements in 8.8.1/8.8.2 → Line 1109 now reads "All other fields are OPTIONAL except where specific sections impose MUST requirements (see Sections 8.8.1 and 8.8.2 for mandatory notifications in verify tier, danger_zone tier, and deny pattern scenarios)" ✓
  • Section 9.2 SHOULD updated to "for non-hard-block events other than verify tier" to prevent double-classifying the verify tier MUST ✓

New Findings

1. Section 9.3 says "MAY support step limits" — normative spec says Step Limit is MUST (Medium)

execution-safety-loop.md:611:

"MAY support step limits (maxAutonomousSteps)"

But the normative spec imposes MUST at two points:

  • v1.0.0-draft.md:1121 (Stage 1): "Implementations MUST support configuring the step limit"
  • v1.0.0-draft.md:1233 (Section 8.7.5): Step Limit row shows MUST, with the rationale "Prevents runaway autonomous execution"

Section 8.7.5 is explicit that Step Limit is the only MUST-level stage — and the only stage required for a conformant minimum viable implementation. The Section 9 restructuring over the last several rounds correctly moved persistence and notification requirements to MUST, but this MAY/MUST conflict for step limits was never addressed.

An implementer using Section 9 as a compliance checklist will conclude step limits are optional (MAY), directly violating the normative MUST. A system that skips Step Limit has no guard against unbounded autonomous execution — the only protection that Section 8.7.5 requires unconditionally.

Fix: Move "support step limits (maxAutonomousSteps)" from Section 9.3 to Section 9.1, with wording matching Section 8.7.5: "MUST implement the Step Limit stage with a configurable maxAutonomousSteps threshold ([Section 8.7.5](../versions/v1.0.0-draft.md#875-minimum-viable-implementation))". Fix this →

2. Section 9.2 verify tier carve-out is broader than the underlying MUST it replaces (Minor)

execution-safety-loop.md:601:

"SHOULD include notifications in AutonomyDirective responses for non-hard-block events other than verify tier (which is a MUST — see Section 9.1)"

The carve-out excludes verify tier from the SHOULD entirely. But the Section 9.1 MUST is specifically about autonomy_pause with metadata.verificationId — not all notifications in verify tier responses. Other notification types that may accompany a verify tier response (e.g., status, error) are now excluded from both the MUST (Section 9.1 is specific to autonomy_pause) and the SHOULD (Section 9.2 excludes verify tier altogether).

An implementer reading Section 9.2 literally could skip status and error notifications in verify tier responses and cite the carve-out as justification.

Fix: Narrow the parenthetical to the specific obligation: "other than verify tier challenges, for which the autonomy_pause notification with metadata.verificationId is a MUST (see Section 9.1)." This preserves SHOULD coverage for other notification types in verify tier responses while clearly designating the autonomy_pause + verificationId as MUST. Fix this →


Strengths

The fourteen-review arc has converged to a substantially complete spec:

  • 26 of 28 issues across thirteen prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Section 9.1 now reflects all three normative notification MUSTs. danger_zone broadcast (from round 12), autonomy_pause + verificationId for verify tier (from round 13), and persistence for stopped: true — all correctly placed.
  • The Section 8.7.1 OPTIONAL exception is accurate. The new qualifier correctly identifies Sections 8.8.1 and 8.8.2 as the source of mandatory notifications requirements. The stopped field's mandatory semantics are separately covered by the existing interface comment and Section 8.8.1.
  • Section 9.2 and 9.3 restructuring is substantially correct. The only gap is the step limits MAY/MUST conflict (Finding spec: Document CRUDE pattern implementation #1) and the over-broad verify tier exclusion (Finding spec: Document introspection system #2).
  • All 7 reserved operations have normative CRUDE endpoint assignments. Section 8.5.4 table is complete and matches the informative doc's Section 4.1 table.
  • The deny tier, verify tier, and danger_zone tier are fully specified end-to-end. Trigger conditions, challenge protocol, blocking semantics, notification lifecycle, clearing conditions, compliance obligations, and cross-references are all consistent.

Summary

Finding #1 (step limits classified as MAY in compliance summary but MUST in normative spec) is medium priority — it's a direct compliance signal conflict on the one unconditional enforcement requirement in Section 8.7.5. Finding #2 is minor — the verify tier exclusion in Section 9.2 is slightly over-broad, leaving other notification types in verify tier responses without SHOULD coverage.

…215)

- Move step limits (maxAutonomousSteps) from Section 9.3 MAY to Section
  9.1 MUST, matching normative Section 8.7.5 requirement
- Narrow Section 9.2 verify tier exclusion to specifically reference the
  autonomy_pause notification with metadata.verificationId, preserving
  SHOULD coverage for other notification types in verify tier responses
- Remove duplicate MAY bullet for risk tolerance thresholds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 26, 2026

Claude finished @mickdarling's task in 3m 53s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit c86088f addressed both 14th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the fifteenth review pass on this PR. The polish commit (c86088f) cleanly resolved both findings from the fourteenth review. Two new findings remain, both in the informative doc's Section 9 compliance table.


Prior Review: Both Issues Resolved ✓

  • Section 9.3 "MAY support step limits" conflicting with normative MUST → removed from Section 9.3; added to Section 9.1 as "MUST implement the Step Limit stage with a configurable maxAutonomousSteps threshold (Section 8.7.5)" ✓
  • Section 9.2 verify tier carve-out was over-broad → narrowed from "other than verify tier (which is a MUST)" to "other than verify tier challenges, for which the autonomy_pause notification with metadata.verificationId is a MUST (see Section 9.1)" ✓

New Findings

1. Section 9.3 "MAY support out-of-band verification" conflicts with Section 9.2 "SHOULD implement verify_challenge" (Medium)

execution-safety-loop.md:613 — Section 9.3 MAY Requirements lists:

"MAY support out-of-band verification for verify tier pauses and Danger Zone events"

But execution-safety-loop.md:600 — Section 9.2 SHOULD Requirements already says:

"SHOULD implement verify_challenge for verify tier pauses and Danger Zone unblocking"

verify_challenge is the protocol-level operation for out-of-band verification — these two bullets describe the same capability at contradictory normative levels. An implementer scanning both lists will see SHOULD and MAY for the same feature and may default to MAY, discarding the SHOULD as a duplicate.

The inconsistency is structural: the Section 9.1 MUST requirements (lines 590–591) require including verificationId in notifications for both verify tier challenges and danger_zone hard blocks. Generating a verificationId presupposes that a verification challenge exists — which means the challenge generation protocol is already effectively mandatory. A MAY for the overall out-of-band verification mechanism is inconsistent with MUSTs that are predicated on that mechanism running.

Fix: Remove line 613 from Section 9.3 — it is superseded by the SHOULD in Section 9.2 and made inconsistent by the MUST requirements in Section 9.1. If the original intent was to mark the display channel (OS dialog, SMS, hardware token) as optional (as opposed to the verify_challenge operation), rewrite line 613 to make that distinction explicit: "MAY use OS dialogs, hardware tokens, SMS/email, or other display channels for presenting verification codes to operators". Fix this →

2. Section 8.7.5 "Previous Outcome: SHOULD" has no corresponding entry in Section 9.2 (Minor)

execution-safety-loop.md:595–604 — Section 9.2 SHOULD Requirements maps to the Section 8.7.5 Minimum Viable Implementation table, but not completely. Cross-referencing the two:

Section 8.7.5 Stage Requirement Section 9 entry
Step Limit MUST Section 9.1: "MUST implement the Step Limit stage" ✓
Previous Outcome SHOULD Absent from Section 9.2
Pattern Matching SHOULD Section 9.2: "SHOULD support configurable policy patterns" ✓
Safety Tier MAY Section 9.3: "MAY implement pattern-based automatic danger level classification" ✓
Risk Tolerance MAY Section 9.3: "MAY support configurable risk tolerance thresholds" ✓

Stage 2 — evaluating the outcome field to return continue: false on failures — is a SHOULD in v1.0.0-draft.md:1234. The normative spec at line 1125 says: "If the outcome field... reports "failure", the evaluator SHOULD return continue: false". An implementer checking Section 9.2 as a complete SHOULD checklist will not know this stage exists.

Fix: Add to Section 9.2: "SHOULD evaluate the \outcome` field in `record_execution_step` calls and return `continue: false` on reported failures (Stage 2, Section 8.7.5)"`. Fix this →


Strengths

The fifteen-review arc has produced a substantially complete spec:

  • 28 of 30 issues across fourteen prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Section 9.1 MUST list is now complete for all protocol obligations. Step Limit, persistence, danger_zone notification broadcast, autonomy_pause + verificationId — all correctly placed and worded.
  • Section 9.2 verify tier carve-out is now precise. The parenthetical correctly identifies that only the autonomy_pause + verificationId combination is MUST — other notification types in verify tier responses retain SHOULD coverage.
  • All three notification types have symmetric lifecycle descriptions. permission_pending, autonomy_pause (both confirm and verify sub-cases), and danger_zone all have trigger conditions, clearing conditions, and normative strength defined.
  • The deny tier is fully first-class throughout. Section 8.8.1 trigger conditions, 8.8.3 heading, 8.7.3 notification description, Section 9.1 notification MUST, informative notification table — all consistent.
  • Section 8.8.3 blocking semantics cover all cases. Hard blocks (stopped: true), verify pauses (pending challenge, expiration, re-evaluation), and restart persistence are all specified.

Summary

Finding #1 (Section 9.3 MAY vs Section 9.2 SHOULD for out-of-band verification) is medium priority — the redundant MAY gives implementers an escape hatch below the SHOULD strength, and is inconsistent with the MUSTs that presuppose challenge generation. Finding #2 is minor — the Section 9.2 SHOULD list is missing a Stage 2 entry that corresponds directly to a SHOULD row in the Section 8.7.5 minimum viable implementation table.

- Replace Section 9.3 "MAY support out-of-band verification" (conflicts
  with Section 9.2 SHOULD for verify_challenge) with MAY for display
  channel choice (OS dialogs, hardware tokens, SMS/email)
- Add Section 9.2 bullet for Stage 2 Previous Outcome evaluation,
  completing the Section 8.7.5 coverage in the compliance checklist

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude Please review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Feb 26, 2026

Claude finished @mickdarling's task in 6m 9s —— View job


PR Review: feat: Sprint 7 — Safety Model (Phase 1)

  • Gather context and read repo instructions
  • Check prior review findings (polish commit f546c1a addressed both 15th-round issues)
  • Read changed files and analyze content
  • Check cross-references
  • Post review feedback

This is the sixteenth review pass on this PR. The polish commit (f546c1a) cleanly resolved both findings from the fifteenth review. Two new findings remain — both in the informative doc's Section 9 compliance checklist.


Prior Review: Both Issues Resolved ✓

  • Section 9.3 MAY/SHOULD conflict for out-of-band verification → replaced "MAY support out-of-band verification for verify tier pauses and Danger Zone events" with "MAY use OS dialogs, hardware tokens, SMS/email, or other display channels for presenting verification codes to operators" — correctly scopes the MAY to the display channel choice, not the verification mechanism ✓
  • Section 9.2 missing Stage 2 (Previous Outcome) SHOULD bullet → "SHOULD evaluate the outcome field in record_execution_step calls and return continue: false on reported failures (Stage 2, Section 8.7.5)" added ✓

New Findings

1. Section 8.8.5 rate limiting requirements entirely absent from Section 9 compliance checklist (Medium)

docs/security/execution-safety-loop.md:595–614 — Section 9 now covers all MUST and most SHOULD requirements from Sections 8.6–8.8. But v1.0.0-draft.md:1314-1320 (Section 8.8.5) contains four SHOULD requirements with no corresponding entries anywhere in Section 9:

Section 8.8.5 requirement Normative strength Section 9 entry
Rate-limit failed verify_challenge attempts SHOULD (line 1314) Absent
No more than 10 failed attempts / 60s window SHOULD (line 1316) Absent
Rate limit state SHOULD persist across restarts SHOULD (line 1319) Absent
Failed attempts SHOULD trigger security audit events SHOULD (line 1320) Absent

An implementer who builds verify_challenge against the Section 9.2 checklist has no guidance that rate limiting is expected. Without it, the brute-force window against a 128-bit entropy code is bounded only by network throughput — the code's entropy provides no protection if attempts are unlimited. Section 9.1 already has two SHOULD-to-MUST upgrades for notification obligations introduced in earlier rounds; the pattern of capturing critical security SHOULDs in Section 9.2 is established and just needs applying here.

Fix: Add to Section 9.2 SHOULD requirements:

- SHOULD rate-limit failed `verify_challenge` attempts (≤10 per 60s window per agent) and persist rate limit state across server restarts ([Section 8.8.5](../versions/v1.0.0-draft.md#885-rate-limiting))
- SHOULD emit security audit events for failed verification attempts ([Section 8.8.5](../versions/v1.0.0-draft.md#885-rate-limiting))

Fix this →

2. Section 9.2 Stage 2 bullet cross-references 8.7.5 (requirement table) but not 8.7.2 (behavioral definition) (Minor)

docs/security/execution-safety-loop.md:601 — The new Stage 2 bullet reads:

"SHOULD evaluate the outcome field in record_execution_step calls and return continue: false on reported failures (Stage 2, Section 8.7.5)"

Section 8.7.5 is the minimum viable implementation table — it shows the requirement level (SHOULD) but not the behavioral spec. The actual Stage 2 definition, including the "if outcome is omitted, this stage is skipped" clause and the rationale for pausing on failure, is in v1.0.0-draft.md:1123-1129 (Section 8.7.2). An implementer who clicks the link reaches the summary table and still needs to search for the behavior.

Section 3.3 of the informative doc does already cross-reference Section 8.7.2 for Stage 2 (execution-safety-loop.md:205). But in the compliance checklist context (Section 9), where each bullet is a self-contained requirement, an implementer following the checklist would benefit from a direct link to the behavioral spec.

Fix: Update the link to reference Section 8.7.2 rather than 8.7.5: replace ([Stage 2, Section 8.7.5](../versions/v1.0.0-draft.md#875-minimum-viable-implementation)) with ([Section 8.7.2, Stage 2](../versions/v1.0.0-draft.md#872-evaluation-pipeline)). The requirement level (SHOULD) is captured in the bullet text itself and doesn't need the 8.7.5 link. Fix this →


Strengths

The sixteen-review arc has converged to a near-complete spec:

  • 30 of 32 issues across fifteen prior rounds resolved. Each polish commit addressed findings precisely without over-engineering.
  • Section 9 compliance checklist is substantially complete. All MUST requirements for the execution safety loop are in Section 9.1 (notification obligations, persistence, step limit, endpoint placement). Section 9.2 now covers Stage 2 (new this round), Stage 3 pattern matching, verify_challenge, confirm_operation, and general logging.
  • The MAY/SHOULD/MUST layering in Section 9 is clean. The round-14 step limits fix, round-12/13 notification MUST additions, and round-15 MAY display channel rewrite have produced a well-structured compliance summary.
  • Both Section 8.8.5 state persistence requirements are correctly split. Hard-blocked state: MUST in Section 9.1 (server restart can't bypass a hard block). Rate limit state: SHOULD in Section 8.8.5 (missing from Section 9.2 — this round's Finding spec: Document CRUDE pattern implementation #1).
  • All 7 reserved operations are correctly specified end-to-end. Normative endpoint assignments (8.5.4), informative quick-reference table (Section 4.1), and compliance checklist (Section 9.2) are aligned.

Summary

Finding #1 (Section 8.8.5 rate limiting absent from Section 9.2) is medium priority — it's a security-relevant gap where an implementer who builds verify_challenge against the Section 9 checklist has no guidance that rate limiting is expected. Finding #2 is a one-word link fix for improved traceability in the compliance checklist. Both are small, targeted changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant