Summary
When a subscription token (CLAUDE_CODE_OAUTH_TOKEN) expires, some agent executions fail fast (~3-4 min with clear error), but others hang for the full execution timeout (up to 3600s) before being recovered by the watchdog. This wastes execution slots for an hour, blocks other scheduled tasks, and creates cascading failures across the fleet.
Component
Backend / Task Execution Service / Agent Runtime
Priority
P1
Error
Fast failure (correct behavior):
Task execution failed (exit code 1): Subscription token may be expired or revoked. Generate a new one with 'claude setup-token'.
Slow failure (bug — same root cause, different symptom):
Task execution timed out after 3600 seconds
Or recovered by watchdog:
Execution completed on agent but status not reported — recovered by watchdog
Location
- File:
src/backend/services/task_execution_service.py
- Function: Task execution dispatch and response handling
Root Cause
When Claude Code encounters an expired OAuth token, it sometimes exits cleanly with exit code 1 (fast failure), but other times it appears to hang — possibly retrying the token, waiting for user input, or crashing in a way that doesn't produce a clean exit. The backend has no mechanism to detect an auth failure early; it just waits for the full execution timeout.
The inconsistency suggests Claude Code's behavior on expired tokens is non-deterministic — sometimes it fails fast, sometimes it enters a retry/hang state.
Reproduction Steps
- Register a subscription with a token that will expire soon
- Assign the subscription to multiple agents
- Wait for the token to expire
- Trigger scheduled executions on those agents
- Observe: some fail in ~3 minutes, others hang for the full timeout (e.g., 3600s)
- During the hang period, the execution slot is occupied, blocking other tasks
Suggested Fix
Two-part fix:
1. Early auth validation before dispatching execution:
# Before calling /api/task on the agent, do a lightweight auth check
# e.g., verify token validity with a quick API call or check token expiry timestamp
2. Detect auth failure pattern in execution output and abort early:
# When streaming execution output, if "Subscription token may be expired"
# appears in stderr, terminate the execution immediately rather than
# waiting for the full timeout
3. Consider storing token expiry metadata when registering subscriptions, so the platform can proactively warn or fail before dispatching.
Environment
- Trinity version:
2798ca9
- Docker version: 29.3.0
- OS: Ubuntu (GCP VM)
Related
src/backend/services/task_execution_service.py — execution dispatch
src/backend/services/cleanup_service.py — watchdog recovery
src/backend/services/slot_service.py — slot TTL management
Summary
When a subscription token (CLAUDE_CODE_OAUTH_TOKEN) expires, some agent executions fail fast (~3-4 min with clear error), but others hang for the full execution timeout (up to 3600s) before being recovered by the watchdog. This wastes execution slots for an hour, blocks other scheduled tasks, and creates cascading failures across the fleet.
Component
Backend / Task Execution Service / Agent Runtime
Priority
P1
Error
Fast failure (correct behavior):
Slow failure (bug — same root cause, different symptom):
Or recovered by watchdog:
Location
src/backend/services/task_execution_service.pyRoot Cause
When Claude Code encounters an expired OAuth token, it sometimes exits cleanly with exit code 1 (fast failure), but other times it appears to hang — possibly retrying the token, waiting for user input, or crashing in a way that doesn't produce a clean exit. The backend has no mechanism to detect an auth failure early; it just waits for the full execution timeout.
The inconsistency suggests Claude Code's behavior on expired tokens is non-deterministic — sometimes it fails fast, sometimes it enters a retry/hang state.
Reproduction Steps
Suggested Fix
Two-part fix:
1. Early auth validation before dispatching execution:
2. Detect auth failure pattern in execution output and abort early:
3. Consider storing token expiry metadata when registering subscriptions, so the platform can proactively warn or fail before dispatching.
Environment
2798ca9Related
src/backend/services/task_execution_service.py— execution dispatchsrc/backend/services/cleanup_service.py— watchdog recoverysrc/backend/services/slot_service.py— slot TTL management