Skip to content

bug: expired subscription token causes hour-long zombie executions instead of fast failure #285

@vybe

Description

@vybe

Summary

When a subscription token (CLAUDE_CODE_OAUTH_TOKEN) expires, some agent executions fail fast (~3-4 min with clear error), but others hang for the full execution timeout (up to 3600s) before being recovered by the watchdog. This wastes execution slots for an hour, blocks other scheduled tasks, and creates cascading failures across the fleet.

Component

Backend / Task Execution Service / Agent Runtime

Priority

P1

Error

Fast failure (correct behavior):

Task execution failed (exit code 1): Subscription token may be expired or revoked. Generate a new one with 'claude setup-token'.

Slow failure (bug — same root cause, different symptom):

Task execution timed out after 3600 seconds

Or recovered by watchdog:

Execution completed on agent but status not reported — recovered by watchdog

Location

  • File: src/backend/services/task_execution_service.py
  • Function: Task execution dispatch and response handling

Root Cause

When Claude Code encounters an expired OAuth token, it sometimes exits cleanly with exit code 1 (fast failure), but other times it appears to hang — possibly retrying the token, waiting for user input, or crashing in a way that doesn't produce a clean exit. The backend has no mechanism to detect an auth failure early; it just waits for the full execution timeout.

The inconsistency suggests Claude Code's behavior on expired tokens is non-deterministic — sometimes it fails fast, sometimes it enters a retry/hang state.

Reproduction Steps

  1. Register a subscription with a token that will expire soon
  2. Assign the subscription to multiple agents
  3. Wait for the token to expire
  4. Trigger scheduled executions on those agents
  5. Observe: some fail in ~3 minutes, others hang for the full timeout (e.g., 3600s)
  6. During the hang period, the execution slot is occupied, blocking other tasks

Suggested Fix

Two-part fix:

1. Early auth validation before dispatching execution:

# Before calling /api/task on the agent, do a lightweight auth check
# e.g., verify token validity with a quick API call or check token expiry timestamp

2. Detect auth failure pattern in execution output and abort early:

# When streaming execution output, if "Subscription token may be expired" 
# appears in stderr, terminate the execution immediately rather than 
# waiting for the full timeout

3. Consider storing token expiry metadata when registering subscriptions, so the platform can proactively warn or fail before dispatching.

Environment

  • Trinity version: 2798ca9
  • Docker version: 29.3.0
  • OS: Ubuntu (GCP VM)

Related

  • src/backend/services/task_execution_service.py — execution dispatch
  • src/backend/services/cleanup_service.py — watchdog recovery
  • src/backend/services/slot_service.py — slot TTL management

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions