Skip to content

Rectify: Adjudicated Failure False Positive — PART A ONLY#415

Merged
Trecek merged 7 commits intoskill-and-tool-access-control-tiered-visibility-cross-tier-s/306from
remediation/20260316
Mar 16, 2026
Merged

Rectify: Adjudicated Failure False Positive — PART A ONLY#415
Trecek merged 7 commits intoskill-and-tool-access-control-tiered-visibility-cross-tier-s/306from
remediation/20260316

Conversation

@Trecek
Copy link
Collaborator

@Trecek Trecek commented Mar 16, 2026

Summary

The dead-end guard in _compute_outcome (session.py) conflates two fundamentally different failure categories under a single "channel confirmed + failure = retriable" assumption:

  1. Drain-race artifacts (transient) — session.result is empty or missing the completion marker because stdout was not fully drained before the guard ran. These are genuinely retriable: a resume will find the content.

  2. Pattern-contract violations (terminal) — session.result is non-empty and contains the completion marker, but the skill's expected_output_patterns are absent. The session completed normally; the model simply did not produce the required output block. Retrying will never produce different content.

The guard promotes both categories to RETRIABLE + RESUME, causing infinite retry cycles until the budget guard terminates the run. This PR introduces ContentState, a typed enum that makes the four content evaluation outcomes first-class concepts (ABSENT, CONTRACT_VIOLATION, COMPLETE, SESSION_ERROR). The dead-end guard now checks ContentState rather than just bool, restricting drain-race promotion to ContentState.ABSENT only. This makes the bug class structurally impossible.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([session exits<br/>channel confirmed])

    subgraph Recovery ["Recovery Phase (● headless.py)"]
        direction TB
        R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>Rebuild result if marker<br/>in standalone assistant msg"]
        R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>Rebuild result if patterns found<br/>in assistant_messages<br/>(now: CHANNEL_A + CHANNEL_B)"]
    end

    subgraph Evaluation ["★ Content Evaluation (● session.py)"]
        direction TB
        ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>is_error? → SESSION_ERROR<br/>empty result? → ABSENT<br/>marker absent? → ABSENT<br/>patterns fail? → CONTRACT_VIOLATION<br/>else → COMPLETE"]
    end

    subgraph Outcome ["● _compute_outcome (session.py)"]
        direction TB
        CS["● _compute_success<br/>━━━━━━━━━━<br/>CHANNEL_B: delegates to<br/>_evaluate_content_state<br/>CHANNEL_A: _check_session_content"]
        CR["_compute_retry<br/>━━━━━━━━━━<br/>API signals / termination /<br/>channel confirmation dispatch"]
        CONTRA{"contradiction guard<br/>━━━━━━━━━━<br/>success=True AND<br/>needs_retry=True?"}
        DEG{"● dead-end guard<br/>━━━━━━━━━━<br/>not success AND<br/>not needs_retry AND<br/>channel confirmed?"}
        DEGS{"★ ContentState check<br/>━━━━━━━━━━<br/>ABSENT?"}
    end

    subgraph Resolution ["Outcome Resolution"]
        direction TB
        PROMOTE["promote to RETRIABLE<br/>━━━━━━━━━━<br/>retry_reason = RESUME<br/>(drain-race rescue only)"]
        TERMINAL["remain FAILED<br/>━━━━━━━━━━<br/>CONTRACT_VIOLATION or<br/>SESSION_ERROR → terminal"]
        NS["_normalize_subtype<br/>━━━━━━━━━━<br/>adjudicated_failure /<br/>empty_result / missing_marker"]
    end

    RESULT(["SkillResult<br/>success / needs_retry / subtype"])

    START --> R1 --> R2 --> ECS
    ECS --> CS
    ECS --> CR
    CS --> CONTRA
    CR --> CONTRA
    CONTRA -->|"yes: demote success"| DEG
    CONTRA -->|"no"| DEG
    DEG -->|"yes"| DEGS
    DEG -->|"no"| NS
    DEGS -->|"ABSENT"| PROMOTE
    DEGS -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| TERMINAL
    PROMOTE --> NS
    TERMINAL --> NS
    NS --> RESULT

    class START,RESULT terminal;
    class R1,R2 handler;
    class ECS,DEGS newComponent;
    class CS,DEG detector;
    class CR,CONTRA phase;
    class PROMOTE stateNode;
    class TERMINAL,NS output;
Loading

Color Legend:

Color Category Description
Dark Blue Terminal Session entry and SkillResult exit
Orange Handler Recovery phase functions (● modified)
Green New Component ★ _evaluate_content_state and ★ ContentState gate
Red Detector ● _compute_success (modified), ● dead-end guard (modified)
Purple Phase _compute_retry, contradiction guard
Teal State RETRIABLE promotion (drain-race rescue path)
Dark Teal Output Terminal outcome and _normalize_subtype

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    subgraph Fields ["SKILLRESULT FIELD LIFECYCLE"]
        direction LR
        INIT["INIT_ONLY<br/>━━━━━━━━━━<br/>session_id<br/>start_ts<br/>completion_marker<br/>NEVER modify after set"]
        APPEND["APPEND_ONLY<br/>━━━━━━━━━━<br/>assistant_messages<br/>only grows during recovery"]
        MUTABLE["MUTABLE<br/>━━━━━━━━━━<br/>● needs_retry<br/>● success<br/>● retry_reason<br/>● subtype<br/>computed then may be promoted"]
        DERIVED["DERIVED<br/>━━━━━━━━━━<br/>outcome<br/>mapped from success+needs_retry<br/>(SUCCEEDED / RETRIABLE / FAILED)"]
    end

    subgraph Recovery ["● RECOVERY MUTATIONS (headless.py)"]
        direction TB
        R1["_recover_from_separate_marker<br/>━━━━━━━━━━<br/>may overwrite session.result<br/>from assistant_messages"]
        R2["● _recover_block_from_assistant_messages<br/>━━━━━━━━━━<br/>appends to session.result<br/>now: CHANNEL_A + CHANNEL_B"]
    end

    subgraph ContentGate ["★ CONTENTSTATE VALIDATION GATE (session.py)"]
        direction TB
        ECS["★ _evaluate_content_state<br/>━━━━━━━━━━<br/>reads: session.result, is_error,<br/>completion_marker, patterns<br/>writes: ContentState discriminant"]
        CS_ENUM["★ ContentState enum<br/>━━━━━━━━━━<br/>COMPLETE — all checks pass<br/>ABSENT — empty/marker missing (retriable)<br/>CONTRACT_VIOLATION — result+marker+no patterns (terminal)<br/>SESSION_ERROR — is_error=True (terminal)"]
    end

    subgraph DeadEndGuard ["● DEAD-END GUARD CONTRACT (session.py)"]
        direction TB
        PRE["● _compute_outcome entry<br/>━━━━━━━━━━<br/>success=False AND<br/>needs_retry=False AND<br/>channel_confirmed?"]
        GATE{"★ ContentState gate<br/>━━━━━━━━━━<br/>ContentState.ABSENT?"}
        PROMOTE["PROMOTE<br/>━━━━━━━━━━<br/>needs_retry = True<br/>retry_reason = RESUME<br/>(drain-race rescue)"]
        BLOCK["BLOCK promotion<br/>━━━━━━━━━━<br/>needs_retry stays False<br/>outcome = FAILED<br/>(contract violation guard)"]
    end

    subgraph SkillResult ["SKILLRESULT OUTPUT CONTRACT"]
        direction TB
        SR["SkillResult<br/>━━━━━━━━━━<br/>success / needs_retry / subtype<br/>guaranteed: CONTRACT_VIOLATION<br/>never becomes RETRIABLE"]
    end

    INIT --> Recovery
    APPEND --> Recovery
    MUTABLE --> Recovery
    R1 --> R2
    R2 --> ECS
    ECS --> CS_ENUM
    CS_ENUM --> PRE
    PRE --> GATE
    GATE -->|"ABSENT"| PROMOTE
    GATE -->|"CONTRACT_VIOLATION<br/>or SESSION_ERROR"| BLOCK
    PROMOTE --> MUTABLE
    BLOCK --> MUTABLE
    MUTABLE --> DERIVED
    DERIVED --> SR

    class INIT detector;
    class APPEND handler;
    class MUTABLE phase;
    class DERIVED gap;
    class R1,R2 handler;
    class ECS,CS_ENUM,GATE newComponent;
    class PRE detector;
    class PROMOTE stateNode;
    class BLOCK output;
    class SR terminal;
Loading

Color Legend:

Color Category Description
Red INIT_ONLY / Guard Entry Fields that must never change; dead-end guard check
Orange APPEND_ONLY / Handlers assistant_messages growth; recovery functions (● modified)
Purple MUTABLE Fields computed and potentially promoted by guards
Yellow/Amber DERIVED Outcome derived from field combination
Green New Component ★ ContentState enum, ★ _evaluate_content_state, ★ gate
Teal Promotion needs_retry promoted for drain-race (ABSENT only)
Dark Teal Block Promotion blocked for terminal failures
Dark Blue Terminal SkillResult output contract

Implementation Plan

Plan file: temp/rectify/rectify_adjudicated-failure-false-positive_2026-03-16_000001_part_a.md

Token Usage Summary

No token data available for this pipeline run.

🤖 Generated with Claude Code via AutoSkillit

Trecek and others added 4 commits March 16, 2026 12:40
… logging test

- Add make_session fixture and import ContentState/_evaluate_content_state
- Add TestContentStateEnum: enum variants + parametrized _evaluate_content_state tests
- Add TestDeadEndGuardContentState: contract violation terminal, drain-race regression
- Update test_logs_dead_end_guard to use empty result (drain-race path)
- Add test_logs_dead_end_guard_terminal_not_promoted companion test
- Fix pre-existing line-too-long in server/_state.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… positives

The dead-end guard in _compute_outcome previously promoted ALL channel-confirmed
failures to RETRIABLE, conflating drain-race artifacts (transient) with pattern-
contract violations (terminal). This caused infinite retry cycles when a session
completed normally but the model did not produce expected output patterns.

- Add ContentState StrEnum (COMPLETE/ABSENT/CONTRACT_VIOLATION/SESSION_ERROR)
- Add _evaluate_content_state() to classify content failures by category
- Refactor _compute_success CHANNEL_B bypass to delegate to _evaluate_content_state
- Fix dead-end guard: only promote ContentState.ABSENT to RETRIABLE; CONTRACT_VIOLATION
  and SESSION_ERROR remain FAILED (terminal — retrying will not produce different output)
- Export ContentState from execution/__init__.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etection

Two bugs introduced by the ContentState implementation:

1. _compute_success CHANNEL_B bypass used _evaluate_content_state, which
   checks is_error and completion_marker. CHANNEL_B confirmation overrides
   both — the bypass must only check expected_output_patterns (pattern
   contracts). Reverts to _check_expected_patterns for the bypass decision.

2. _evaluate_content_state returned COMPLETE for no-marker/no-patterns cases
   regardless of result emptiness. Empty results must return ABSENT so the
   dead-end guard can promote CHANNEL_A drain-race artifacts to RETRIABLE even
   when no content requirements are configured.

All 12 regression failures fixed; 4393 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

result="Some output without the marker",
assistant_messages=[],
)
outcome, retry_reason = _compute_outcome(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: session=session first argument to _compute_outcome is under-indented relative to other keyword arguments (column 8 vs column 12). ruff will flag this and auto-fix but abort the commit, requiring re-stage and retry.

Trecek and others added 3 commits March 16, 2026 13:48
…redundant comment

- completion_marker check at L414 now uses the already-stripped `result`
  variable (consistent with the empty-content check at L403 and pattern
  check at L418) — fixes inconsistent strip() usage flagged in review
- remove comment block (L444-L448) that restated the absence of skipped
  checks rather than clarifying non-obvious behaviour; PRECONDITION
  comment retained as it documents a caller contract

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…headless.py

Aligns with the strip() usage in session.py at L418 and L453, ensuring
pattern matching is not sensitive to leading/trailing whitespace
regardless of call site.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…with value assertions

Truthiness assertions (assert ContentState.COMPLETE) are always True for
non-empty StrEnum values and cannot detect misspellings or duplicate values.
Replace with explicit value equality checks (assert ContentState.COMPLETE == "complete")
to catch regressions in enum string values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Trecek Trecek merged commit 391bf06 into skill-and-tool-access-control-tiered-visibility-cross-tier-s/306 Mar 16, 2026
@Trecek Trecek deleted the remediation/20260316 branch March 16, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant