Rectify: Adjudicated Failure False Positive — PART A ONLY by Trecek · Pull Request #415 · TalonT-Org/AutoSkillit

Trecek · 2026-03-16T20:34:11Z

Summary

The dead-end guard in _compute_outcome (session.py) conflates two fundamentally different failure categories under a single "channel confirmed + failure = retriable" assumption:

Drain-race artifacts (transient) — session.result is empty or missing the completion marker because stdout was not fully drained before the guard ran. These are genuinely retriable: a resume will find the content.
Pattern-contract violations (terminal) — session.result is non-empty and contains the completion marker, but the skill's expected_output_patterns are absent. The session completed normally; the model simply did not produce the required output block. Retrying will never produce different content.

The guard promotes both categories to RETRIABLE + RESUME, causing infinite retry cycles until the budget guard terminates the run. This PR introduces ContentState, a typed enum that makes the four content evaluation outcomes first-class concepts (ABSENT, CONTRACT_VIOLATION, COMPLETE, SESSION_ERROR). The dead-end guard now checks ContentState rather than just bool, restricting drain-race promotion to ContentState.ABSENT only. This makes the bug class structurally impossible.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([session exits<br/>channel confirmed])

    subgraph Recovery ["Recovery Phase (● headless.py)"]
        direction TB
        R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>Rebuild result if marker<br/>in standalone assistant msg"]
        R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>Rebuild result if patterns found<br/>in assistant_messages<br/>(now: CHANNEL_A + CHANNEL_B)"]
    end

    subgraph Evaluation ["★ Content Evaluation (● session.py)"]
        direction TB
        ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>is_error? → SESSION_ERROR<br/>empty result? → ABSENT<br/>marker absent? → ABSENT<br/>patterns fail? → CONTRACT_VIOLATION<br/>else → COMPLETE"]
    end

    subgraph Outcome ["● _compute_outcome (session.py)"]
        direction TB
        CS["● _compute_success<br/>━━━━━━━━━━<br/>CHANNEL_B: delegates to<br/>_evaluate_content_state<br/>CHANNEL_A: _check_session_content"]
        CR["_compute_retry<br/>━━━━━━━━━━<br/>API signals / termination /<br/>channel confirmation dispatch"]
        CONTRA{"contradiction guard<br/>━━━━━━━━━━<br/>success=True AND<br/>needs_retry=True?"}
        DEG{"● dead-end guard<br/>━━━━━━━━━━<br/>not success AND<br/>not needs_retry AND<br/>channel confirmed?"}
        DEGS{"★ ContentState check<br/>━━━━━━━━━━<br/>ABSENT?"}
    end

    subgraph Resolution ["Outcome Resolution"]
        direction TB
        PROMOTE["promote to RETRIABLE<br/>━━━━━━━━━━<br/>retry_reason = RESUME<br/>(drain-race rescue only)"]
        TERMINAL["remain FAILED<br/>━━━━━━━━━━<br/>CONTRACT_VIOLATION or<br/>SESSION_ERROR → terminal"]
        NS["_normalize_subtype<br/>━━━━━━━━━━<br/>adjudicated_failure /<br/>empty_result / missing_marker"]
    end

    RESULT(["SkillResult<br/>success / needs_retry / subtype"])

    START --> R1 --> R2 --> ECS
    ECS --> CS
    ECS --> CR
    CS --> CONTRA
    CR --> CONTRA
    CONTRA -->|"yes: demote success"| DEG
    CONTRA -->|"no"| DEG
    DEG -->|"yes"| DEGS
    DEG -->|"no"| NS
    DEGS -->|"ABSENT"| PROMOTE
    DEGS -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| TERMINAL
    PROMOTE --> NS
    TERMINAL --> NS
    NS --> RESULT

    class START,RESULT terminal;
    class R1,R2 handler;
    class ECS,DEGS newComponent;
    class CS,DEG detector;
    class CR,CONTRA phase;
    class PROMOTE stateNode;
    class TERMINAL,NS output;

Color Legend:

Color	Category	Description
Dark Blue	Terminal	Session entry and SkillResult exit
Orange	Handler	Recovery phase functions (● modified)
Green	New Component	★ _evaluate_content_state and ★ ContentState gate
Red	Detector	● _compute_success (modified), ● dead-end guard (modified)
Purple	Phase	_compute_retry, contradiction guard
Teal	State	RETRIABLE promotion (drain-race rescue path)
Dark Teal	Output	Terminal outcome and _normalize_subtype

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    subgraph Fields ["SKILLRESULT FIELD LIFECYCLE"]
        direction LR
        INIT["INIT_ONLY<br/>━━━━━━━━━━<br/>session_id<br/>start_ts<br/>completion_marker<br/>NEVER modify after set"]
        APPEND["APPEND_ONLY<br/>━━━━━━━━━━<br/>assistant_messages<br/>only grows during recovery"]
        MUTABLE["MUTABLE<br/>━━━━━━━━━━<br/>● needs_retry<br/>● success<br/>● retry_reason<br/>● subtype<br/>computed then may be promoted"]
        DERIVED["DERIVED<br/>━━━━━━━━━━<br/>outcome<br/>mapped from success+needs_retry<br/>(SUCCEEDED / RETRIABLE / FAILED)"]
    end

    subgraph Recovery ["● RECOVERY MUTATIONS (headless.py)"]
        direction TB
        R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>may overwrite session.result<br/>from assistant_messages"]
        R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>appends to session.result<br/>now: CHANNEL_A + CHANNEL_B"]
    end

    subgraph ContentGate ["★ CONTENTSTATE VALIDATION GATE (session.py)"]
        direction TB
        ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>reads: session.result, is_error,<br/>completion_marker, patterns<br/>writes: ContentState discriminant"]
        CS_ENUM["★ ContentState enum<br/>━━━━━━━━━━<br/>COMPLETE — all checks pass<br/>ABSENT — empty/marker missing (retriable)<br/>CONTRACT_VIOLATION — result+marker+no patterns (terminal)<br/>SESSION_ERROR — is_error=True (terminal)"]
    end

    subgraph DeadEndGuard ["● DEAD-END GUARD CONTRACT (session.py)"]
        direction TB
        PRE["● _compute_outcome entry<br/>━━━━━━━━━━<br/>success=False AND<br/>needs_retry=False AND<br/>channel_confirmed?"]
        GATE{"★ ContentState gate<br/>━━━━━━━━━━<br/>ContentState.ABSENT?"}
        PROMOTE["PROMOTE<br/>━━━━━━━━━━<br/>needs_retry = True<br/>retry_reason = RESUME<br/>(drain-race rescue)"]
        BLOCK["BLOCK promotion<br/>━━━━━━━━━━<br/>needs_retry stays False<br/>outcome = FAILED<br/>(contract violation guard)"]
    end

    subgraph SkillResult ["SKILLRESULT OUTPUT CONTRACT"]
        direction TB
        SR["SkillResult<br/>━━━━━━━━━━<br/>success / needs_retry / subtype<br/>guaranteed: CONTRACT_VIOLATION<br/>never becomes RETRIABLE"]
    end

    INIT --> Recovery
    APPEND --> Recovery
    MUTABLE --> Recovery
    R1 --> R2
    R2 --> ECS
    ECS --> CS_ENUM
    CS_ENUM --> PRE
    PRE --> GATE
    GATE -->|"ABSENT"| PROMOTE
    GATE -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| BLOCK
    PROMOTE --> MUTABLE
    BLOCK --> MUTABLE
    MUTABLE --> DERIVED
    DERIVED --> SR

    class INIT detector;
    class APPEND handler;
    class MUTABLE phase;
    class DERIVED gap;
    class R1,R2 handler;
    class ECS,CS_ENUM,GATE newComponent;
    class PRE detector;
    class PROMOTE stateNode;
    class BLOCK output;
    class SR terminal;

Color Legend:

Color	Category	Description
Red	INIT_ONLY / Guard Entry	Fields that must never change; dead-end guard check
Orange	APPEND_ONLY / Handlers	assistant_messages growth; recovery functions (● modified)
Purple	MUTABLE	Fields computed and potentially promoted by guards
Yellow/Amber	DERIVED	Outcome derived from field combination
Green	New Component	★ ContentState enum, ★ _evaluate_content_state, ★ gate
Teal	Promotion	needs_retry promoted for drain-race (ABSENT only)
Dark Teal	Block	Promotion blocked for terminal failures
Dark Blue	Terminal	SkillResult output contract

Implementation Plan

Plan file: temp/rectify/rectify_adjudicated-failure-false-positive_2026-03-16_000001_part_a.md

Token Usage Summary

No token data available for this pipeline run.

🤖 Generated with Claude Code via AutoSkillit

… logging test - Add make_session fixture and import ContentState/_evaluate_content_state - Add TestContentStateEnum: enum variants + parametrized _evaluate_content_state tests - Add TestDeadEndGuardContentState: contract violation terminal, drain-race regression - Update test_logs_dead_end_guard to use empty result (drain-race path) - Add test_logs_dead_end_guard_terminal_not_promoted companion test - Fix pre-existing line-too-long in server/_state.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… positives The dead-end guard in _compute_outcome previously promoted ALL channel-confirmed failures to RETRIABLE, conflating drain-race artifacts (transient) with pattern- contract violations (terminal). This caused infinite retry cycles when a session completed normally but the model did not produce expected output patterns. - Add ContentState StrEnum (COMPLETE/ABSENT/CONTRACT_VIOLATION/SESSION_ERROR) - Add _evaluate_content_state() to classify content failures by category - Refactor _compute_success CHANNEL_B bypass to delegate to _evaluate_content_state - Fix dead-end guard: only promote ContentState.ABSENT to RETRIABLE; CONTRACT_VIOLATION and SESSION_ERROR remain FAILED (terminal — retrying will not produce different output) - Export ContentState from execution/__init__.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…etection Two bugs introduced by the ContentState implementation: 1. _compute_success CHANNEL_B bypass used _evaluate_content_state, which checks is_error and completion_marker. CHANNEL_B confirmation overrides both — the bypass must only check expected_output_patterns (pattern contracts). Reverts to _check_expected_patterns for the bypass decision. 2. _evaluate_content_state returned COMPLETE for no-marker/no-patterns cases regardless of result emptiness. Empty results must return ABSENT so the dead-end guard can promote CHANNEL_A drain-race artifacts to RETRIABLE even when no content requirements are configured. All 12 regression failures fixed; 4393 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…NNEL_A

Trecek

AutoSkillit PR Review — Verdict: changes_requested

src/autoskillit/execution/session.py

src/autoskillit/execution/headless.py

tests/execution/test_session_adjudication.py

Trecek · 2026-03-16T20:40:57Z

tests/execution/test_session_adjudication.py

+            result="Some output without the marker",
+            assistant_messages=[],
+        )
+        outcome, retry_reason = _compute_outcome(


[warning] tests: session=session first argument to _compute_outcome is under-indented relative to other keyword arguments (column 8 vs column 12). ruff will flag this and auto-fix but abort the commit, requiring re-stage and retry.

…redundant comment - completion_marker check at L414 now uses the already-stripped `result` variable (consistent with the empty-content check at L403 and pattern check at L418) — fixes inconsistent strip() usage flagged in review - remove comment block (L444-L448) that restated the absence of skipped checks rather than clarifying non-obvious behaviour; PRECONDITION comment retained as it documents a caller contract Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…headless.py Aligns with the strip() usage in session.py at L418 and L453, ensuring pattern matching is not sensitive to leading/trailing whitespace regardless of call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…with value assertions Truthiness assertions (assert ContentState.COMPLETE) are always True for non-empty StrEnum values and cannot detect misspellings or duplicate values. Replace with explicit value equality checks (assert ContentState.COMPLETE == "complete") to catch regressions in enum string values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Trecek and others added 4 commits March 16, 2026 12:40

fix(headless): extend pattern recovery from assistant_messages to CHA…

def308e

…NNEL_A

Trecek commented Mar 16, 2026

View reviewed changes

Trecek and others added 3 commits March 16, 2026 13:48

Trecek merged commit 391bf06 into skill-and-tool-access-control-tiered-visibility-cross-tier-s/306 Mar 16, 2026

Trecek deleted the remediation/20260316 branch March 16, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rectify: Adjudicated Failure False Positive — PART A ONLY#415

Rectify: Adjudicated Failure False Positive — PART A ONLY#415
Trecek merged 7 commits intoskill-and-tool-access-control-tiered-visibility-cross-tier-s/306from
remediation/20260316

Trecek commented Mar 16, 2026

Uh oh!

Trecek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Trecek commented Mar 16, 2026

Summary

Architecture Impact

Process Flow Diagram

State Lifecycle Diagram

Implementation Plan

Token Usage Summary

Uh oh!

Trecek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant