Rectify: Adjudicated Failure False Positive — PART A ONLY#415
Merged
Trecek merged 7 commits intoskill-and-tool-access-control-tiered-visibility-cross-tier-s/306from Mar 16, 2026
Merged
Conversation
… logging test - Add make_session fixture and import ContentState/_evaluate_content_state - Add TestContentStateEnum: enum variants + parametrized _evaluate_content_state tests - Add TestDeadEndGuardContentState: contract violation terminal, drain-race regression - Update test_logs_dead_end_guard to use empty result (drain-race path) - Add test_logs_dead_end_guard_terminal_not_promoted companion test - Fix pre-existing line-too-long in server/_state.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… positives The dead-end guard in _compute_outcome previously promoted ALL channel-confirmed failures to RETRIABLE, conflating drain-race artifacts (transient) with pattern- contract violations (terminal). This caused infinite retry cycles when a session completed normally but the model did not produce expected output patterns. - Add ContentState StrEnum (COMPLETE/ABSENT/CONTRACT_VIOLATION/SESSION_ERROR) - Add _evaluate_content_state() to classify content failures by category - Refactor _compute_success CHANNEL_B bypass to delegate to _evaluate_content_state - Fix dead-end guard: only promote ContentState.ABSENT to RETRIABLE; CONTRACT_VIOLATION and SESSION_ERROR remain FAILED (terminal — retrying will not produce different output) - Export ContentState from execution/__init__.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etection Two bugs introduced by the ContentState implementation: 1. _compute_success CHANNEL_B bypass used _evaluate_content_state, which checks is_error and completion_marker. CHANNEL_B confirmation overrides both — the bypass must only check expected_output_patterns (pattern contracts). Reverts to _check_expected_patterns for the bypass decision. 2. _evaluate_content_state returned COMPLETE for no-marker/no-patterns cases regardless of result emptiness. Empty results must return ABSENT so the dead-end guard can promote CHANNEL_A drain-race artifacts to RETRIABLE even when no content requirements are configured. All 12 regression failures fixed; 4393 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Trecek
commented
Mar 16, 2026
Collaborator
Author
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
| result="Some output without the marker", | ||
| assistant_messages=[], | ||
| ) | ||
| outcome, retry_reason = _compute_outcome( |
Collaborator
Author
There was a problem hiding this comment.
[warning] tests: session=session first argument to _compute_outcome is under-indented relative to other keyword arguments (column 8 vs column 12). ruff will flag this and auto-fix but abort the commit, requiring re-stage and retry.
…redundant comment - completion_marker check at L414 now uses the already-stripped `result` variable (consistent with the empty-content check at L403 and pattern check at L418) — fixes inconsistent strip() usage flagged in review - remove comment block (L444-L448) that restated the absence of skipped checks rather than clarifying non-obvious behaviour; PRECONDITION comment retained as it documents a caller contract Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…headless.py Aligns with the strip() usage in session.py at L418 and L453, ensuring pattern matching is not sensitive to leading/trailing whitespace regardless of call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…with value assertions Truthiness assertions (assert ContentState.COMPLETE) are always True for non-empty StrEnum values and cannot detect misspellings or duplicate values. Replace with explicit value equality checks (assert ContentState.COMPLETE == "complete") to catch regressions in enum string values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
391bf06
into
skill-and-tool-access-control-tiered-visibility-cross-tier-s/306
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The dead-end guard in
_compute_outcome(session.py) conflates two fundamentally different failure categories under a single "channel confirmed + failure = retriable" assumption:Drain-race artifacts (transient) —
session.resultis empty or missing the completion marker because stdout was not fully drained before the guard ran. These are genuinely retriable: a resume will find the content.Pattern-contract violations (terminal) —
session.resultis non-empty and contains the completion marker, but the skill'sexpected_output_patternsare absent. The session completed normally; the model simply did not produce the required output block. Retrying will never produce different content.The guard promotes both categories to
RETRIABLE + RESUME, causing infinite retry cycles until the budget guard terminates the run. This PR introducesContentState, a typed enum that makes the four content evaluation outcomes first-class concepts (ABSENT,CONTRACT_VIOLATION,COMPLETE,SESSION_ERROR). The dead-end guard now checksContentStaterather than justbool, restricting drain-race promotion toContentState.ABSENTonly. This makes the bug class structurally impossible.Architecture Impact
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; START([session exits<br/>channel confirmed]) subgraph Recovery ["Recovery Phase (● headless.py)"] direction TB R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>Rebuild result if marker<br/>in standalone assistant msg"] R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>Rebuild result if patterns found<br/>in assistant_messages<br/>(now: CHANNEL_A + CHANNEL_B)"] end subgraph Evaluation ["★ Content Evaluation (● session.py)"] direction TB ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>is_error? → SESSION_ERROR<br/>empty result? → ABSENT<br/>marker absent? → ABSENT<br/>patterns fail? → CONTRACT_VIOLATION<br/>else → COMPLETE"] end subgraph Outcome ["● _compute_outcome (session.py)"] direction TB CS["● _compute_success<br/>━━━━━━━━━━<br/>CHANNEL_B: delegates to<br/>_evaluate_content_state<br/>CHANNEL_A: _check_session_content"] CR["_compute_retry<br/>━━━━━━━━━━<br/>API signals / termination /<br/>channel confirmation dispatch"] CONTRA{"contradiction guard<br/>━━━━━━━━━━<br/>success=True AND<br/>needs_retry=True?"} DEG{"● dead-end guard<br/>━━━━━━━━━━<br/>not success AND<br/>not needs_retry AND<br/>channel confirmed?"} DEGS{"★ ContentState check<br/>━━━━━━━━━━<br/>ABSENT?"} end subgraph Resolution ["Outcome Resolution"] direction TB PROMOTE["promote to RETRIABLE<br/>━━━━━━━━━━<br/>retry_reason = RESUME<br/>(drain-race rescue only)"] TERMINAL["remain FAILED<br/>━━━━━━━━━━<br/>CONTRACT_VIOLATION or<br/>SESSION_ERROR → terminal"] NS["_normalize_subtype<br/>━━━━━━━━━━<br/>adjudicated_failure /<br/>empty_result / missing_marker"] end RESULT(["SkillResult<br/>success / needs_retry / subtype"]) START --> R1 --> R2 --> ECS ECS --> CS ECS --> CR CS --> CONTRA CR --> CONTRA CONTRA -->|"yes: demote success"| DEG CONTRA -->|"no"| DEG DEG -->|"yes"| DEGS DEG -->|"no"| NS DEGS -->|"ABSENT"| PROMOTE DEGS -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| TERMINAL PROMOTE --> NS TERMINAL --> NS NS --> RESULT class START,RESULT terminal; class R1,R2 handler; class ECS,DEGS newComponent; class CS,DEG detector; class CR,CONTRA phase; class PROMOTE stateNode; class TERMINAL,NS output;Color Legend:
State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; subgraph Fields ["SKILLRESULT FIELD LIFECYCLE"] direction LR INIT["INIT_ONLY<br/>━━━━━━━━━━<br/>session_id<br/>start_ts<br/>completion_marker<br/>NEVER modify after set"] APPEND["APPEND_ONLY<br/>━━━━━━━━━━<br/>assistant_messages<br/>only grows during recovery"] MUTABLE["MUTABLE<br/>━━━━━━━━━━<br/>● needs_retry<br/>● success<br/>● retry_reason<br/>● subtype<br/>computed then may be promoted"] DERIVED["DERIVED<br/>━━━━━━━━━━<br/>outcome<br/>mapped from success+needs_retry<br/>(SUCCEEDED / RETRIABLE / FAILED)"] end subgraph Recovery ["● RECOVERY MUTATIONS (headless.py)"] direction TB R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>may overwrite session.result<br/>from assistant_messages"] R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>appends to session.result<br/>now: CHANNEL_A + CHANNEL_B"] end subgraph ContentGate ["★ CONTENTSTATE VALIDATION GATE (session.py)"] direction TB ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>reads: session.result, is_error,<br/>completion_marker, patterns<br/>writes: ContentState discriminant"] CS_ENUM["★ ContentState enum<br/>━━━━━━━━━━<br/>COMPLETE — all checks pass<br/>ABSENT — empty/marker missing (retriable)<br/>CONTRACT_VIOLATION — result+marker+no patterns (terminal)<br/>SESSION_ERROR — is_error=True (terminal)"] end subgraph DeadEndGuard ["● DEAD-END GUARD CONTRACT (session.py)"] direction TB PRE["● _compute_outcome entry<br/>━━━━━━━━━━<br/>success=False AND<br/>needs_retry=False AND<br/>channel_confirmed?"] GATE{"★ ContentState gate<br/>━━━━━━━━━━<br/>ContentState.ABSENT?"} PROMOTE["PROMOTE<br/>━━━━━━━━━━<br/>needs_retry = True<br/>retry_reason = RESUME<br/>(drain-race rescue)"] BLOCK["BLOCK promotion<br/>━━━━━━━━━━<br/>needs_retry stays False<br/>outcome = FAILED<br/>(contract violation guard)"] end subgraph SkillResult ["SKILLRESULT OUTPUT CONTRACT"] direction TB SR["SkillResult<br/>━━━━━━━━━━<br/>success / needs_retry / subtype<br/>guaranteed: CONTRACT_VIOLATION<br/>never becomes RETRIABLE"] end INIT --> Recovery APPEND --> Recovery MUTABLE --> Recovery R1 --> R2 R2 --> ECS ECS --> CS_ENUM CS_ENUM --> PRE PRE --> GATE GATE -->|"ABSENT"| PROMOTE GATE -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| BLOCK PROMOTE --> MUTABLE BLOCK --> MUTABLE MUTABLE --> DERIVED DERIVED --> SR class INIT detector; class APPEND handler; class MUTABLE phase; class DERIVED gap; class R1,R2 handler; class ECS,CS_ENUM,GATE newComponent; class PRE detector; class PROMOTE stateNode; class BLOCK output; class SR terminal;Color Legend:
Implementation Plan
Plan file:
temp/rectify/rectify_adjudicated-failure-false-positive_2026-03-16_000001_part_a.mdToken Usage Summary
No token data available for this pipeline run.
🤖 Generated with Claude Code via AutoSkillit