Skip to content

Activity-Aware Fleet Dispatch Timeout + Quota Sleep Resumable Exit#2211

Merged
Trecek merged 11 commits into
developfrom
activity-aware-fleet-dispatch-timeout-quota-sleep-resumable/2201
May 8, 2026
Merged

Activity-Aware Fleet Dispatch Timeout + Quota Sleep Resumable Exit#2211
Trecek merged 11 commits into
developfrom
activity-aware-fleet-dispatch-timeout-quota-sleep-resumable/2201

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented May 8, 2026

Summary

Adds two coordinated improvements to fleet dispatch reliability: (1) an activity-aware deadline extension watcher that extends wall-clock timeouts when child processes are actively running, preventing useful work (review cycles, CI watches, merge queue waits) from being killed by unconditional timeouts; and (2) a budget-aware quota guard that detects when required sleep exceeds the remaining session budget and emits a clean sentinel exit instead of sleeping into a timeout, with quota exhaustion classified as a retryable infrastructure failure rather than a halting logic failure.

Requirements

n/a

Conflict Resolution Decisions

The following files had merge conflicts that were automatically resolved.

n/a

Architecture Impact

n/a

Implementation Plan

Plan files:

  • .autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_a.md
  • .autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_b.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step Model count uncached output cache_read peak_ctx turns cache_write time
plan claude-opus-4-6 1 101 37.5k 2.0M 110.6k 175 110.9k 16m 48s
verify claude-opus-4-6 2 2.8k 37.2k 2.8M 77.7k 191 117.3k 18m 11s
implement* MiniMax-M2.7 2 232.8k 73.8k 14.7M 21.0k 415 0 35m 22s
prepare_pr* MiniMax-M2.7 1 51.5k 4.9k 302.9k 0 25 0 2m 22s
compose_pr* MiniMax-M2.7 1 48.8k 1.8k 311.3k 0 21 0 1m 2s
review_pr claude-opus-4-6 3 106 129.4k 2.1M 108.8k 107 262.0k 34m 8s
resolve_review claude-opus-4-6 3 2.4k 62.2k 4.9M 100.8k 201 209.0k 34m 20s
ci_conflict_fix claude-opus-4-6 1 38 5.8k 929.5k 67.9k 40 55.0k 2m 28s
Total 338.5k 352.7k 28.1M 110.6k 754.2k 2h 24m

* Step used a non-Anthropic provider; caching behavior may differ.

Token Efficiency

Step LoC Changed cache_read/LoC cache_write/LoC output/LoC
plan 0
verify 0
implement 638 22962.5 0.0 115.7
prepare_pr 0
compose_pr 0
review_pr 0
resolve_review 49 100528.5 4266.0 1268.4
ci_conflict_fix 1411 658.8 39.0 4.1
Total 2098 13372.3 359.5 168.1

Model Usage Breakdown

Model steps uncached output cache_read cache_write time
claude-opus-4-6 3 3.0k 85.7k 6.4M 318.9k 56m 39s
MiniMax-M2.7 3 333.0k 80.6k 15.3M 0 38m 47s

Trecek and others added 11 commits May 7, 2026 20:09
Add _watch_child_activity coroutine to run_managed_async that extends
the wall-clock CancelScope.deadline when child processes are CPU-active
or an API connection is ESTABLISHED. This prevents the unconditional
3600s wall-clock timeout from killing fleet dispatch sessions that are
doing useful work (review cycles, CI watches, merge queue waits).

Changes:
- Add enable_deadline_extension, max_extension_seconds, idle_output_timeout
  fields to FleetConfig with defaults (True, 7200, 1800)
- Update defaults.yaml and settings.py loader for new FleetConfig fields
- Add enable_deadline_extension and max_extension_seconds to SubprocessRunner
  protocol and DefaultSubprocessRunner.__call__
- Implement _watch_child_activity in _process_race.py using existing
  _has_active_child_processes and _has_active_api_connection probes
- Wire deadline extension watcher into run_managed_async task group
  (started after idle watcher, before tracing)
- Thread new kwargs through _execute_claude_headless and dispatch_food_truck
- Add 3-tier idle output timeout priority (caller > fleet > run_skill)
- Update RecordingSubprocessRunner to propagate new kwargs
- Add test_process_deadline_extension.py with 6 tests for watcher behavior
- Add FleetConfig field tests to test_fleet_config.py
- Add protocol compliance and param-set tests to test_protocol_satisfaction.py

Fix: use original_deadline + 2*poll_interval for desired deadline rather
than anyio.current_time() which doesn't align with CancelScope clock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The AUTOSKILLIT_IDLE_OUTPUT_TIMEOUT env var was injected into
merged_extras after build_food_truck_cmd had already consumed it,
so the env var never reached the subprocess environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was setting run_skill.idle_output_timeout but fleet config's
idle_output_timeout (default 1800) now takes priority over run_skill
in dispatch_food_truck. Updated to set fleet.idle_output_timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When quota guard detects that required sleep exceeds the session's
remaining wall-clock budget, emit a clean sentinel exit directive
instead of sleep-and-retry. This prevents fleet dispatch sessions
from being killed by timeout mechanisms.

Changes:
- quota_guard.py: add QUOTA_BUDGET_EXCEEDED_TRIGGER constant and
  session-deadline budget check before deny message
- fleet/_api.py: inject AUTOSKILLIT_SESSION_DEADLINE env var in
  _run_dispatch env_extras (started_at + timeout_sec)
- hooks/__init__.py: re-export QUOTA_BUDGET_EXCEEDED_TRIGGER
- fleet/_prompts.py: add budget-exceeded routing instruction in
  QUOTA DENIAL ROUTING section
- fleet/state_types.py: add FLEET_QUOTA_EXHAUSTED to
  _INFRASTRUCTURE_FAILURE_REASONS
- cli/_prompts_campaign.py: update QUOTA RETRY trigger to match
  both quota_exhausted and fleet_quota_exhausted reasons
- skills/sous-chef/SKILL.md: document budget-exceeded denial protocol
- tests: quota guard budget tests (3), outcome classifier tests (2),
  fleet prompt routing test (1), contract test (1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_watch_child_activity

The variable is captured on first non-None scope observation, not at scope
creation time. The new name accurately describes the semantics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d 3600.0

The session deadline fallback was hardcoded as 3600.0 instead of reading
from FleetConfig.default_timeout_sec, creating a second source of truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_timeout to float

SubprocessRunner protocol and run_managed_async both declare these as float.
Aligns the config dataclass types with the downstream protocol for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ved deadline

The desired extension deadline was computed from _first_observed_deadline
which is captured once and never updates. This caused the extension to fire
at most once. Using anyio.current_time() ensures continuous extension while
child processes remain active, capped at max_extension_seconds.

Tests updated to use short initial timeouts (0.1s) so the watcher actually
needs to extend the deadline, and the cap assertion tolerance removed since
min(desired, cap) enforces the cap exactly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Config

validate() checked max_extension_seconds > 0 but not idle_output_timeout.
A negative value would pass validation and propagate incorrectly downstream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ine computation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Trecek Trecek force-pushed the activity-aware-fleet-dispatch-timeout-quota-sleep-resumable/2201 branch from 8e6cbde to 60acbc1 Compare May 8, 2026 03:10
@Trecek Trecek added this pull request to the merge queue May 8, 2026
Merged via the queue into develop with commit 17c9f22 May 8, 2026
2 checks passed
@Trecek Trecek deleted the activity-aware-fleet-dispatch-timeout-quota-sleep-resumable/2201 branch May 8, 2026 03:22
Trecek added a commit that referenced this pull request May 8, 2026
…2211)

## Summary

Adds two coordinated improvements to fleet dispatch reliability: (1) an
activity-aware deadline extension watcher that extends wall-clock
timeouts when child processes are actively running, preventing useful
work (review cycles, CI watches, merge queue waits) from being killed by
unconditional timeouts; and (2) a budget-aware quota guard that detects
when required sleep exceeds the remaining session budget and emits a
clean sentinel exit instead of sleeping into a timeout, with quota
exhaustion classified as a retryable infrastructure failure rather than
a halting logic failure.

## Requirements

n/a

## Conflict Resolution Decisions

The following files had merge conflicts that were automatically
resolved.

n/a

## Architecture Impact

n/a

## Implementation Plan

Plan files:
-
`.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_a.md`
-
`.autoskillit/temp/make-plan/activity_aware_fleet_dispatch_timeout_quota_sleep_resumable_exit_plan_2026-05-07_170500_part_b.md`

🤖 Generated with [Claude Code](https://claude.com/claude-code) via
AutoSkillit
<!-- autoskillit:pipeline-signature
steps=prepare_pr,run_arch_lenses,compose_pr,annotate_pr_diff,review_pr
-->

## Token Usage Summary

| Step | Model | count | uncached | output | cache_read | peak_ctx |
turns | cache_write | time |

|------|-------|-------|----------|--------|------------|----------|-------|-------------|------|
| plan | claude-opus-4-6 | 1 | 101 | 37.5k | 2.0M | 110.6k | 175 |
110.9k | 16m 48s |
| verify | claude-opus-4-6 | 2 | 2.8k | 37.2k | 2.8M | 77.7k | 191 |
117.3k | 18m 11s |
| implement* | MiniMax-M2.7 | 2 | 232.8k | 73.8k | 14.7M | 21.0k | 415 |
0 | 35m 22s |
| fix | claude-opus-4-6 | 2 | 100 | 11.0k | 1.6M | 65.4k | 68 | 90.8k |
21m 38s |
| prepare_pr* | MiniMax-M2.7 | 1 | 51.5k | 4.9k | 302.9k | 0 | 25 | 0 |
2m 22s |
| compose_pr* | MiniMax-M2.7 | 1 | 48.8k | 1.8k | 311.3k | 0 | 21 | 0 |
1m 2s |
| **Total** | | | 336.0k | 166.3k | 21.7M | 110.6k | | 318.9k | 1h 35m |

\* *Step used a non-Anthropic provider; caching behavior may differ.*

## Token Efficiency

| Step | LoC Changed | cache_read/LoC | cache_write/LoC | output/LoC |
|------|-------------|----------------|-----------------|------------|
| plan | 0 | — | — | — |
| verify | 0 | — | — | — |
| implement | 638 | 22962.5 | 0.0 | 115.7 |
| fix | 26 | 61745.6 | 3492.2 | 422.5 |
| prepare_pr | 0 | — | — | — |
| compose_pr | 0 | — | — | — |
| **Total** | **664** | 32683.1 | 480.3 | 250.4 |

## Model Usage Breakdown

| Model | steps | uncached | output | cache_read | cache_write | time |
|-------|-------|----------|--------|------------|-------------|------|
| claude-opus-4-6 | 3 | 3.0k | 85.7k | 6.4M | 318.9k | 56m 39s |
| MiniMax-M2.7 | 3 | 333.0k | 80.6k | 15.3M | 0 | 38m 47s |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant