Skip to content

fix: reap stale workflow executions and use updated_at for staleness#262

Merged
santoshkumarradha merged 3 commits intomainfrom
fix/stale-execution-reaper
Mar 13, 2026
Merged

fix: reap stale workflow executions and use updated_at for staleness#262
santoshkumarradha merged 3 commits intomainfrom
fix/stale-execution-reaper

Conversation

@AbirAbbas
Copy link
Contributor

@AbirAbbas AbirAbbas commented Mar 13, 2026

Summary

  • Switches staleness detection from started_at to updated_at so only executions with no recent activity are reaped — legitimately long-running executions that are still making progress are no longer incorrectly timed out
  • Adds MarkStaleWorkflowExecutions to handle the workflow_executions table, where orphaned child executions get permanently stuck in running state when their parent fails without cascading cancellation
  • Wires both into the existing ExecutionCleanupService background loop (no new config needed — reuses stale_execution_timeout)

Root cause

When the orchestrator dispatches child executions (intake, anatomy, review phases) and the parent or a sibling fails, the control plane doesn't cascade that failure to in-flight children. The parent gets marked failed but children are orphaned in running state forever. The existing MarkStaleExecutions only operated on the executions table — workflow_executions had no reaping at all.

What changed

File Change
storage/storage.go Added MarkStaleWorkflowExecutions to StorageProvider interface
storage/execution_records.go Changed MarkStaleExecutions to use updated_at; added MarkStaleWorkflowExecutions impl; added COALESCE(updated_at, created_at, started_at) defensive fallback
handlers/execution_cleanup.go Calls MarkStaleWorkflowExecutions alongside existing stale marking
*_test.go Updated mocks, added tests for workflow stale marking + NULL updated_at scenarios

Review follow-ups (5ea01e5)

From engineering review:

  • COALESCE defensive fallback: Both MarkStaleExecutions and MarkStaleWorkflowExecutions now use COALESCE(updated_at, created_at, started_at) so rows where updated_at was never set still get reaped instead of silently skipped
  • Invariant documentation: Added docstring comments documenting the invariant that updated_at must be bumped on every meaningful activity for staleness detection to work correctly
  • NULL updated_at tests: Added TestMarkStaleExecutions_ReapsWhenUpdatedAtIsNULL and TestMarkStaleWorkflowExecutions_ReapsWhenUpdatedAtIsNULL

Test plan

  • All existing cleanup tests pass (11/11 handler, 8/8 storage)
  • New test: TestExecutionCleanupService_PerformCleanup_MarksStaleWorkflowExecutions
  • New test: TestExecutionCleanupService_PerformCleanup_ContinuesWhenMarkStaleWorkflowFails
  • New test: TestMarkStaleExecutions_ReapsWhenUpdatedAtIsNULL
  • New test: TestMarkStaleWorkflowExecutions_ReapsWhenUpdatedAtIsNULL
  • Deploy to staging and verify stuck executions from today get reaped

🤖 Generated with Claude Code

…detection

The existing MarkStaleExecutions only covered the executions table and
used started_at to detect staleness, which missed orphaned workflow
executions entirely and could incorrectly timeout legitimately long-running
executions. This change:

- Switches staleness detection from started_at to updated_at so only
  executions with no recent activity are reaped
- Adds MarkStaleWorkflowExecutions to handle the workflow_executions
  table where orphaned child executions get permanently stuck in
  running state when their parent fails
- Wires both into the existing ExecutionCleanupService background loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner March 13, 2026 01:08
AbirAbbas and others added 2 commits March 12, 2026 21:23
Tests run against a real database (no mocks) covering:
- Stuck executions reaped while active ones are preserved
- Long-running executions with recent activity NOT incorrectly reaped
- Orphaned workflow children reaped when parent already failed
- Waiting-state executions reaped after inactivity
- Batch limit respected across multiple reaper passes
- End-to-end scenario: parent fails, children stuck in both tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use COALESCE(updated_at, created_at, started_at) in both
  MarkStaleExecutions and MarkStaleWorkflowExecutions to handle
  rows where updated_at was never set
- Add invariant comment documenting that updated_at must be bumped
  on every meaningful activity for staleness detection to work
- Add tests for NULL updated_at scenario on both execution types
@santoshkumarradha santoshkumarradha added this pull request to the merge queue Mar 13, 2026
Merged via the queue into main with commit 56410a6 Mar 13, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants