Skip to content

Implementation Plan: Fleet Campaign Retry Blocked by Terminal FAILURE State#1968

Merged
Trecek merged 11 commits into
developfrom
fleet-campaign-retry-blocked-by-terminal-failure-state/1695
May 6, 2026
Merged

Implementation Plan: Fleet Campaign Retry Blocked by Terminal FAILURE State#1968
Trecek merged 11 commits into
developfrom
fleet-campaign-retry-blocked-by-terminal-failure-state/1695

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented May 6, 2026

Summary

When a fleet campaign dispatch fails, the FAILURE status is terminal — _ALLOWED_TRANSITIONS[FAILURE] is frozenset() with no outgoing transitions. This blocks explicit user retry (--resume) because both the has_failed_dispatch() halt guard in dispatch_food_truck and Phase 2 of resume_campaign_from_state unconditionally reject campaigns with any FAILURE record.

The fix adds a FAILURE → PENDING transition, a reset_failed_dispatch() function, and modifies the two halt check sites to distinguish between automatic continuation (should still halt) and explicit user retry (should reset the failed dispatch and re-execute).

Requirements

  • REQ-RETRY-001: A failed dispatch MUST be retryable without manual state file edits
  • REQ-RETRY-002: The halt-on-failure guard MUST still prevent automatic continuation to subsequent dispatches after an unacknowledged failure
  • REQ-RETRY-003: Retry of a failed dispatch MUST reset its state and re-execute it from scratch (not resume)
  • REQ-RETRY-004: The retry mechanism MUST be safe under concurrent access (respect existing _resume_lock + fcntl.LOCK_EX pattern)

Closes #1695

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-20260505-180604-269785/.autoskillit/temp/make-plan/fleet_campaign_retry_blocked_by_terminal_failure_state_plan_2026-05-05_181500.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step count uncached output cache_read peak_ctx turns cache_write time
plan 1 48 20.0k 642.0k 81.3k 92 73.4k 12m 23s
verify 1 41 13.6k 1.0M 59.1k 91 47.9k 6m 59s
implement 1 881.3k 14.3k 959.3k 62.0k 115 52.6k 6m 7s
prepare_pr 1 60 5.1k 190.4k 36.6k 18 25.0k 1m 38s
compose_pr 1 59 2.0k 158.5k 26.1k 15 13.1k 52s
review_pr 3 514 128.0k 4.1M 115.1k 190 254.6k 31m 23s
resolve_review 3 1.0k 79.3k 7.9M 107.5k 317 243.1k 32m 24s
ci_conflict_fix 3 387 22.0k 1.7M 59.9k 104 92.9k 6m 32s
diagnose_ci 1 92 2.4k 335.6k 40.5k 22 28.0k 60s
resolve_ci 1 110 3.2k 459.7k 44.2k 32 31.2k 4m 1s
Total 883.6k 290.0k 17.5M 115.1k 861.9k 1h 43m

Token Efficiency

Step LoC Changed cache_read/LoC cache_write/LoC output/LoC
plan 0
verify 0
implement 386 2485.3 136.4 37.0
prepare_pr 0
compose_pr 0
review_pr 0
resolve_review 80 98697.9 3038.3 991.7
ci_conflict_fix 2865 598.5 32.4 7.7
diagnose_ci 0
resolve_ci 1 459673.0 31163.0 3233.0
Total 3332 5238.1 258.7 87.0

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

Comment thread src/autoskillit/server/tools/tools_execution.py
Comment thread src/autoskillit/server/tools/tools_execution.py
Comment thread src/autoskillit/fleet/state.py
Comment thread src/autoskillit/fleet/state.py
Comment thread src/autoskillit/fleet/state.py Outdated
Comment thread src/autoskillit/fleet/state.py
Comment thread tests/fleet/test_retry_failed_dispatch.py
Comment thread tests/server/test_tools_dispatch_halt.py Outdated
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review: blocking issues found — changes required.

2 critical findings and 6 warnings were detected. See inline comments for details.

Critical issues (must fix):

  • tools_execution.py:680 [arch] Unconditional FAILURE reset in server layer (IL-3) — conflates dispatch naming with retry intent
  • tools_execution.py:681 [bugs] TOCTOU between reset_failed_dispatch and has_failed_dispatch + discarded bool return value

Warnings:

  • fleet/state.py:204,649 [cohesion] 10-field dispatch-clearing sequence duplicated verbatim in two functions — extract _clear_dispatch_for_retry(d)
  • fleet/state.py:645 [bugs] Missing break after reset — all FAILURE dispatches are bulk-reset (REQUIRES DECISION: intended?)
  • fleet/state.py:667 [bugs] _write_state in resume_campaign_from_state may lack fcntl.LOCK_EX coverage
  • test_retry_failed_dispatch.py:201 [tests] Noop test has no FAILURE dispatch — passes trivially
  • test_tools_dispatch_halt.py:213 [tests] Weak negative assertion on error code

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

Comment thread tests/fleet/test_retry_failed_dispatch.py
Comment thread tests/fleet/test_retry_failed_dispatch.py
Comment thread src/autoskillit/fleet/state.py
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review: 1 blocking finding and 2 warnings detected. See inline comments — changes required before merge.

Comment thread src/autoskillit/fleet/state.py
Comment thread src/autoskillit/fleet/state.py
state_path: Path,
continue_on_failure: bool,
*,
reset_on_retry: bool = False,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] cohesion: Two reset entry points exist with different semantics: reset_failed_dispatch (targets a single named dispatch) vs reset_on_retry=True here in resume_campaign_from_state (resets all FAILURE dispatches in one pass). The asymmetry is undocumented — add a cross-reference note on each so callers know which to use for targeted single-dispatch retry vs. bulk-reset-on-resume.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. The asymmetry between reset_failed_dispatch (single targeted dispatch) and reset_on_retry=True in resume_campaign_from_state (all FAILURE dispatches in one pass) is intentional for different caller contexts (server tool layer vs. campaign resume algorithm) but undocumented.

Comment thread src/autoskillit/fleet/state.py
resume_campaign_from_state,
write_initial_state,
)
from autoskillit.fleet.state import _validate_transition
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] tests: _validate_transition is imported directly from the private autoskillit.fleet.state submodule rather than the public autoskillit.fleet re-export surface. If intentionally testing internals, a comment explaining the necessity would help; if it should be public, add it to autoskillit.fleet.__init__.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — minor suggestion noted. _validate_transition is imported from the private module intentionally to test the internal transition guard directly. Making it public would widen the surface beyond its purpose; a comment explaining intent could be added.

Comment thread tests/fleet/test_retry_failed_dispatch.py
Comment thread tests/fleet/test_retry_failed_dispatch.py
Comment thread tests/fleet/test_retry_failed_dispatch.py Outdated
Comment thread tests/server/test_tools_dispatch_halt.py
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review: warning-only findings detected. See inline comments — no blocking changes required.

@Trecek Trecek force-pushed the fleet-campaign-retry-blocked-by-terminal-failure-state/1695 branch 2 times, most recently from b0813a5 to 531b230 Compare May 6, 2026 03:28
Trecek and others added 10 commits May 5, 2026 20:36
FAILURE status was terminal with no outgoing transitions, blocking user
retry via --resume. Both the halt guard in dispatch_food_truck and
Phase 2 of resume_campaign_from_state unconditionally rejected any
campaign with a FAILURE record.

Changes:
- Add FAILURE→PENDING transition to _ALLOWED_TRANSITIONS
- Add reset_failed_dispatch() function (thread-safe, clears metadata)
- Add reset_on_retry param to resume_campaign_from_state
- dispatch_food_truck halt guard resets matching dispatch_name
- CLI --resume calls resume_campaign_from_state with reset_on_retry=True

Fixes the retry blocked by terminal failure state issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…duplication

The 10-field dispatch-clearing sequence was verbatim duplicated in
reset_failed_dispatch and resume_campaign_from_state. A new field added
to DispatchRecord requiring reset would need to be updated in two places
with no static guard to catch the second site.
test_reset_on_retry_with_continue_on_failure_true_is_noop: add FAILURE
dispatch d2 so the test actually exercises the noop behavior — previously
it passed trivially with no FAILURE dispatch present.

test_dispatch_resets_and_proceeds_when_retrying_failed_dispatch: replace
weak negative assertion (error != X) with positive success assertion so
any other error is correctly caught.
assert result.get("success") is True fails because InMemoryHeadlessExecutor
returns dispatch_status=failure. The meaningful assertion is that dispatch_id
is present in the envelope — fleet halt errors never include dispatch_id,
only execution envelopes do.
FAILURE now allows a PENDING retry transition, so grouping it under
'Terminal states: no further transitions permitted' was inaccurate.
Move FAILURE before the terminal group with its own comment.
…tracts.py

Two docstrings introduced by develop/#1970 exceeded the 99-char ruff limit.
Shortened to pass pre-commit without changing assertion logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Trecek Trecek force-pushed the fleet-campaign-retry-blocked-by-terminal-failure-state/1695 branch from 531b230 to 88d916a Compare May 6, 2026 03:37
@Trecek Trecek added this pull request to the merge queue May 6, 2026
Merged via the queue into develop with commit 2ed6083 May 6, 2026
2 checks passed
@Trecek Trecek deleted the fleet-campaign-retry-blocked-by-terminal-failure-state/1695 branch May 6, 2026 04:04
Trecek added a commit that referenced this pull request May 8, 2026
… State (#1968)

## Summary

When a fleet campaign dispatch fails, the `FAILURE` status is terminal —
`_ALLOWED_TRANSITIONS[FAILURE]` is `frozenset()` with no outgoing
transitions. This blocks explicit user retry (`--resume`) because both
the `has_failed_dispatch()` halt guard in `dispatch_food_truck` and
Phase 2 of `resume_campaign_from_state` unconditionally reject campaigns
with any FAILURE record.

The fix adds a `FAILURE → PENDING` transition, a
`reset_failed_dispatch()` function, and modifies the two halt check
sites to distinguish between **automatic continuation** (should still
halt) and **explicit user retry** (should reset the failed dispatch and
re-execute).

## Requirements

- REQ-RETRY-001: A failed dispatch MUST be retryable without manual
state file edits
- REQ-RETRY-002: The halt-on-failure guard MUST still prevent automatic
continuation to subsequent dispatches after an unacknowledged failure
- REQ-RETRY-003: Retry of a failed dispatch MUST reset its state and
re-execute it from scratch (not resume)
- REQ-RETRY-004: The retry mechanism MUST be safe under concurrent
access (respect existing `_resume_lock` + `fcntl.LOCK_EX` pattern)

Closes #1695

## Implementation Plan

Plan file:
`/home/talon/projects/autoskillit-runs/impl-20260505-180604-269785/.autoskillit/temp/make-plan/fleet_campaign_retry_blocked_by_terminal_failure_state_plan_2026-05-05_181500.md`

🤖 Generated with [Claude Code](https://claude.com/claude-code) via
AutoSkillit
<!-- autoskillit:pipeline-signature
steps=prepare_pr,run_arch_lenses,compose_pr,annotate_pr_diff,review_pr
-->

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant