Implementation Plan: Fleet Campaign Retry Blocked by Terminal FAILURE State#1968
Conversation
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review: blocking issues found — changes required.
2 critical findings and 6 warnings were detected. See inline comments for details.
Critical issues (must fix):
tools_execution.py:680[arch] Unconditional FAILURE reset in server layer (IL-3) — conflates dispatch naming with retry intenttools_execution.py:681[bugs] TOCTOU betweenreset_failed_dispatchandhas_failed_dispatch+ discarded bool return value
Warnings:
fleet/state.py:204,649[cohesion] 10-field dispatch-clearing sequence duplicated verbatim in two functions — extract_clear_dispatch_for_retry(d)fleet/state.py:645[bugs] Missingbreakafter reset — all FAILURE dispatches are bulk-reset (REQUIRES DECISION: intended?)fleet/state.py:667[bugs]_write_stateinresume_campaign_from_statemay lackfcntl.LOCK_EXcoveragetest_retry_failed_dispatch.py:201[tests] Noop test has no FAILURE dispatch — passes triviallytest_tools_dispatch_halt.py:213[tests] Weak negative assertion on error code
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review: 1 blocking finding and 2 warnings detected. See inline comments — changes required before merge.
| state_path: Path, | ||
| continue_on_failure: bool, | ||
| *, | ||
| reset_on_retry: bool = False, |
There was a problem hiding this comment.
[info] cohesion: Two reset entry points exist with different semantics: reset_failed_dispatch (targets a single named dispatch) vs reset_on_retry=True here in resume_campaign_from_state (resets all FAILURE dispatches in one pass). The asymmetry is undocumented — add a cross-reference note on each so callers know which to use for targeted single-dispatch retry vs. bulk-reset-on-resume.
There was a problem hiding this comment.
Valid observation — flagged for design decision. The asymmetry between reset_failed_dispatch (single targeted dispatch) and reset_on_retry=True in resume_campaign_from_state (all FAILURE dispatches in one pass) is intentional for different caller contexts (server tool layer vs. campaign resume algorithm) but undocumented.
| resume_campaign_from_state, | ||
| write_initial_state, | ||
| ) | ||
| from autoskillit.fleet.state import _validate_transition |
There was a problem hiding this comment.
[info] tests: _validate_transition is imported directly from the private autoskillit.fleet.state submodule rather than the public autoskillit.fleet re-export surface. If intentionally testing internals, a comment explaining the necessity would help; if it should be public, add it to autoskillit.fleet.__init__.
There was a problem hiding this comment.
Acknowledged — minor suggestion noted. _validate_transition is imported from the private module intentionally to test the internal transition guard directly. Making it public would widen the surface beyond its purpose; a comment explaining intent could be added.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review: warning-only findings detected. See inline comments — no blocking changes required.
b0813a5 to
531b230
Compare
FAILURE status was terminal with no outgoing transitions, blocking user retry via --resume. Both the halt guard in dispatch_food_truck and Phase 2 of resume_campaign_from_state unconditionally rejected any campaign with a FAILURE record. Changes: - Add FAILURE→PENDING transition to _ALLOWED_TRANSITIONS - Add reset_failed_dispatch() function (thread-safe, clears metadata) - Add reset_on_retry param to resume_campaign_from_state - dispatch_food_truck halt guard resets matching dispatch_name - CLI --resume calls resume_campaign_from_state with reset_on_retry=True Fixes the retry blocked by terminal failure state issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…duplication The 10-field dispatch-clearing sequence was verbatim duplicated in reset_failed_dispatch and resume_campaign_from_state. A new field added to DispatchRecord requiring reset would need to be updated in two places with no static guard to catch the second site.
test_reset_on_retry_with_continue_on_failure_true_is_noop: add FAILURE dispatch d2 so the test actually exercises the noop behavior — previously it passed trivially with no FAILURE dispatch present. test_dispatch_resets_and_proceeds_when_retrying_failed_dispatch: replace weak negative assertion (error != X) with positive success assertion so any other error is correctly caught.
assert result.get("success") is True fails because InMemoryHeadlessExecutor
returns dispatch_status=failure. The meaningful assertion is that dispatch_id
is present in the envelope — fleet halt errors never include dispatch_id,
only execution envelopes do.
FAILURE now allows a PENDING retry transition, so grouping it under 'Terminal states: no further transitions permitted' was inaccurate. Move FAILURE before the terminal group with its own comment.
…ropagate as OSError
…ering verified, exact block equality
…tracts.py Two docstrings introduced by develop/#1970 exceeded the 99-char ruff limit. Shortened to pass pre-commit without changing assertion logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
531b230 to
88d916a
Compare
… State (#1968) ## Summary When a fleet campaign dispatch fails, the `FAILURE` status is terminal — `_ALLOWED_TRANSITIONS[FAILURE]` is `frozenset()` with no outgoing transitions. This blocks explicit user retry (`--resume`) because both the `has_failed_dispatch()` halt guard in `dispatch_food_truck` and Phase 2 of `resume_campaign_from_state` unconditionally reject campaigns with any FAILURE record. The fix adds a `FAILURE → PENDING` transition, a `reset_failed_dispatch()` function, and modifies the two halt check sites to distinguish between **automatic continuation** (should still halt) and **explicit user retry** (should reset the failed dispatch and re-execute). ## Requirements - REQ-RETRY-001: A failed dispatch MUST be retryable without manual state file edits - REQ-RETRY-002: The halt-on-failure guard MUST still prevent automatic continuation to subsequent dispatches after an unacknowledged failure - REQ-RETRY-003: Retry of a failed dispatch MUST reset its state and re-execute it from scratch (not resume) - REQ-RETRY-004: The retry mechanism MUST be safe under concurrent access (respect existing `_resume_lock` + `fcntl.LOCK_EX` pattern) Closes #1695 ## Implementation Plan Plan file: `/home/talon/projects/autoskillit-runs/impl-20260505-180604-269785/.autoskillit/temp/make-plan/fleet_campaign_retry_blocked_by_terminal_failure_state_plan_2026-05-05_181500.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) via AutoSkillit <!-- autoskillit:pipeline-signature steps=prepare_pr,run_arch_lenses,compose_pr,annotate_pr_diff,review_pr --> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
When a fleet campaign dispatch fails, the
FAILUREstatus is terminal —_ALLOWED_TRANSITIONS[FAILURE]isfrozenset()with no outgoing transitions. This blocks explicit user retry (--resume) because both thehas_failed_dispatch()halt guard indispatch_food_truckand Phase 2 ofresume_campaign_from_stateunconditionally reject campaigns with any FAILURE record.The fix adds a
FAILURE → PENDINGtransition, areset_failed_dispatch()function, and modifies the two halt check sites to distinguish between automatic continuation (should still halt) and explicit user retry (should reset the failed dispatch and re-execute).Requirements
_resume_lock+fcntl.LOCK_EXpattern)Closes #1695
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/impl-20260505-180604-269785/.autoskillit/temp/make-plan/fleet_campaign_retry_blocked_by_terminal_failure_state_plan_2026-05-05_181500.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary
Token Efficiency