Activity-Aware Fleet Dispatch Timeout + Quota Sleep Resumable Exit#2211
Merged
Trecek merged 11 commits intoMay 8, 2026
Merged
Conversation
Add _watch_child_activity coroutine to run_managed_async that extends the wall-clock CancelScope.deadline when child processes are CPU-active or an API connection is ESTABLISHED. This prevents the unconditional 3600s wall-clock timeout from killing fleet dispatch sessions that are doing useful work (review cycles, CI watches, merge queue waits). Changes: - Add enable_deadline_extension, max_extension_seconds, idle_output_timeout fields to FleetConfig with defaults (True, 7200, 1800) - Update defaults.yaml and settings.py loader for new FleetConfig fields - Add enable_deadline_extension and max_extension_seconds to SubprocessRunner protocol and DefaultSubprocessRunner.__call__ - Implement _watch_child_activity in _process_race.py using existing _has_active_child_processes and _has_active_api_connection probes - Wire deadline extension watcher into run_managed_async task group (started after idle watcher, before tracing) - Thread new kwargs through _execute_claude_headless and dispatch_food_truck - Add 3-tier idle output timeout priority (caller > fleet > run_skill) - Update RecordingSubprocessRunner to propagate new kwargs - Add test_process_deadline_extension.py with 6 tests for watcher behavior - Add FleetConfig field tests to test_fleet_config.py - Add protocol compliance and param-set tests to test_protocol_satisfaction.py Fix: use original_deadline + 2*poll_interval for desired deadline rather than anyio.current_time() which doesn't align with CancelScope clock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The AUTOSKILLIT_IDLE_OUTPUT_TIMEOUT env var was injected into merged_extras after build_food_truck_cmd had already consumed it, so the env var never reached the subprocess environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was setting run_skill.idle_output_timeout but fleet config's idle_output_timeout (default 1800) now takes priority over run_skill in dispatch_food_truck. Updated to set fleet.idle_output_timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When quota guard detects that required sleep exceeds the session's remaining wall-clock budget, emit a clean sentinel exit directive instead of sleep-and-retry. This prevents fleet dispatch sessions from being killed by timeout mechanisms. Changes: - quota_guard.py: add QUOTA_BUDGET_EXCEEDED_TRIGGER constant and session-deadline budget check before deny message - fleet/_api.py: inject AUTOSKILLIT_SESSION_DEADLINE env var in _run_dispatch env_extras (started_at + timeout_sec) - hooks/__init__.py: re-export QUOTA_BUDGET_EXCEEDED_TRIGGER - fleet/_prompts.py: add budget-exceeded routing instruction in QUOTA DENIAL ROUTING section - fleet/state_types.py: add FLEET_QUOTA_EXHAUSTED to _INFRASTRUCTURE_FAILURE_REASONS - cli/_prompts_campaign.py: update QUOTA RETRY trigger to match both quota_exhausted and fleet_quota_exhausted reasons - skills/sous-chef/SKILL.md: document budget-exceeded denial protocol - tests: quota guard budget tests (3), outcome classifier tests (2), fleet prompt routing test (1), contract test (1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_watch_child_activity The variable is captured on first non-None scope observation, not at scope creation time. The new name accurately describes the semantics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d 3600.0 The session deadline fallback was hardcoded as 3600.0 instead of reading from FleetConfig.default_timeout_sec, creating a second source of truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_timeout to float SubprocessRunner protocol and run_managed_async both declare these as float. Aligns the config dataclass types with the downstream protocol for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ved deadline The desired extension deadline was computed from _first_observed_deadline which is captured once and never updates. This caused the extension to fire at most once. Using anyio.current_time() ensures continuous extension while child processes remain active, capped at max_extension_seconds. Tests updated to use short initial timeouts (0.1s) so the watcher actually needs to extend the deadline, and the cap assertion tolerance removed since min(desired, cap) enforces the cap exactly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Config validate() checked max_extension_seconds > 0 but not idle_output_timeout. A negative value would pass validation and propagate incorrectly downstream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ine computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8e6cbde to
60acbc1
Compare
Trecek
added a commit
that referenced
this pull request
May 8, 2026
…2211) ## Summary Adds two coordinated improvements to fleet dispatch reliability: (1) an activity-aware deadline extension watcher that extends wall-clock timeouts when child processes are actively running, preventing useful work (review cycles, CI watches, merge queue waits) from being killed by unconditional timeouts; and (2) a budget-aware quota guard that detects when required sleep exceeds the remaining session budget and emits a clean sentinel exit instead of sleeping into a timeout, with quota exhaustion classified as a retryable infrastructure failure rather than a halting logic failure. ## Requirements n/a ## Conflict Resolution Decisions The following files had merge conflicts that were automatically resolved. n/a ## Architecture Impact n/a ## Implementation Plan Plan files: - `.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_a.md` - `.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_b.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) via AutoSkillit <!-- autoskillit:pipeline-signature steps=prepare_pr,run_arch_lenses,compose_pr,annotate_pr_diff,review_pr --> ## Token Usage Summary | Step | Model | count | uncached | output | cache_read | peak_ctx | turns | cache_write | time | |------|-------|-------|----------|--------|------------|----------|-------|-------------|------| | plan | claude-opus-4-6 | 1 | 101 | 37.5k | 2.0M | 110.6k | 175 | 110.9k | 16m 48s | | verify | claude-opus-4-6 | 2 | 2.8k | 37.2k | 2.8M | 77.7k | 191 | 117.3k | 18m 11s | | implement* | MiniMax-M2.7 | 2 | 232.8k | 73.8k | 14.7M | 21.0k | 415 | 0 | 35m 22s | | fix | claude-opus-4-6 | 2 | 100 | 11.0k | 1.6M | 65.4k | 68 | 90.8k | 21m 38s | | prepare_pr* | MiniMax-M2.7 | 1 | 51.5k | 4.9k | 302.9k | 0 | 25 | 0 | 2m 22s | | compose_pr* | MiniMax-M2.7 | 1 | 48.8k | 1.8k | 311.3k | 0 | 21 | 0 | 1m 2s | | **Total** | | | 336.0k | 166.3k | 21.7M | 110.6k | | 318.9k | 1h 35m | \* *Step used a non-Anthropic provider; caching behavior may differ.* ## Token Efficiency | Step | LoC Changed | cache_read/LoC | cache_write/LoC | output/LoC | |------|-------------|----------------|-----------------|------------| | plan | 0 | — | — | — | | verify | 0 | — | — | — | | implement | 638 | 22962.5 | 0.0 | 115.7 | | fix | 26 | 61745.6 | 3492.2 | 422.5 | | prepare_pr | 0 | — | — | — | | compose_pr | 0 | — | — | — | | **Total** | **664** | 32683.1 | 480.3 | 250.4 | ## Model Usage Breakdown | Model | steps | uncached | output | cache_read | cache_write | time | |-------|-------|----------|--------|------------|-------------|------| | claude-opus-4-6 | 3 | 3.0k | 85.7k | 6.4M | 318.9k | 56m 39s | | MiniMax-M2.7 | 3 | 333.0k | 80.6k | 15.3M | 0 | 38m 47s | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two coordinated improvements to fleet dispatch reliability: (1) an activity-aware deadline extension watcher that extends wall-clock timeouts when child processes are actively running, preventing useful work (review cycles, CI watches, merge queue waits) from being killed by unconditional timeouts; and (2) a budget-aware quota guard that detects when required sleep exceeds the remaining session budget and emits a clean sentinel exit instead of sleeping into a timeout, with quota exhaustion classified as a retryable infrastructure failure rather than a halting logic failure.
Requirements
n/a
Conflict Resolution Decisions
The following files had merge conflicts that were automatically resolved.
n/a
Architecture Impact
n/a
Implementation Plan
Plan files:
.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_a.md.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_b.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary
* Step used a non-Anthropic provider; caching behavior may differ.
Token Efficiency
Model Usage Breakdown