Skip to content

bug: Execution shows Failed then Success after cleanup (race condition) #378

@vybe

Description

@vybe

Summary

Scheduled task executions briefly show "Failed" status with error "Stale Execution Slot TTL Expired for Agent [name] Cleaned by Cleanup Service", but the task actually succeeds and status updates to "Success" ~14 seconds later.

Context

Reported by Alex Sazonenka on Paradigm Live instance (2026-04-17). A generation task at 15:44:13 showed Failed immediately, but after page refresh 14 seconds later it was Success with full output and logs available.

This creates confusion and would cause false-positive failure notifications if we implement alert channels.

Root Cause Hypothesis

Race condition between:

  1. Cleanup service's TTL watchdog marking execution as stale/failed
  2. Actual execution completing successfully

The cleanup service is winning the race and marking the slot as expired before the execution can report completion.

Acceptance Criteria

  • Cleanup service checks for actual execution completion before marking failed
  • Add grace period or completion signal handshake
  • Execution status is accurate on first render (no phantom failures)
  • Add test case for concurrent cleanup + completion scenario

Technical Notes

Related files:

  • src/backend/services/cleanup_service.py - TTL watchdog
  • src/backend/services/slot_service.py - Slot management
  • src/backend/services/task_execution_service.py - Execution lifecycle

May need to check execution's actual process state before cleanup marks it failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions