fix: reap stale workflow executions and use updated_at for staleness#262
Merged
santoshkumarradha merged 3 commits intomainfrom Mar 13, 2026
Merged
fix: reap stale workflow executions and use updated_at for staleness#262santoshkumarradha merged 3 commits intomainfrom
santoshkumarradha merged 3 commits intomainfrom
Conversation
…detection The existing MarkStaleExecutions only covered the executions table and used started_at to detect staleness, which missed orphaned workflow executions entirely and could incorrectly timeout legitimately long-running executions. This change: - Switches staleness detection from started_at to updated_at so only executions with no recent activity are reaped - Adds MarkStaleWorkflowExecutions to handle the workflow_executions table where orphaned child executions get permanently stuck in running state when their parent fails - Wires both into the existing ExecutionCleanupService background loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests run against a real database (no mocks) covering: - Stuck executions reaped while active ones are preserved - Long-running executions with recent activity NOT incorrectly reaped - Orphaned workflow children reaped when parent already failed - Waiting-state executions reaped after inactivity - Batch limit respected across multiple reaper passes - End-to-end scenario: parent fails, children stuck in both tables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use COALESCE(updated_at, created_at, started_at) in both MarkStaleExecutions and MarkStaleWorkflowExecutions to handle rows where updated_at was never set - Add invariant comment documenting that updated_at must be bumped on every meaningful activity for staleness detection to work - Add tests for NULL updated_at scenario on both execution types
santoshkumarradha
approved these changes
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
started_attoupdated_atso only executions with no recent activity are reaped — legitimately long-running executions that are still making progress are no longer incorrectly timed outMarkStaleWorkflowExecutionsto handle theworkflow_executionstable, where orphaned child executions get permanently stuck inrunningstate when their parent fails without cascading cancellationExecutionCleanupServicebackground loop (no new config needed — reusesstale_execution_timeout)Root cause
When the orchestrator dispatches child executions (intake, anatomy, review phases) and the parent or a sibling fails, the control plane doesn't cascade that failure to in-flight children. The parent gets marked
failedbut children are orphaned inrunningstate forever. The existingMarkStaleExecutionsonly operated on theexecutionstable —workflow_executionshad no reaping at all.What changed
storage/storage.goMarkStaleWorkflowExecutionstoStorageProviderinterfacestorage/execution_records.goMarkStaleExecutionsto useupdated_at; addedMarkStaleWorkflowExecutionsimpl; addedCOALESCE(updated_at, created_at, started_at)defensive fallbackhandlers/execution_cleanup.goMarkStaleWorkflowExecutionsalongside existing stale marking*_test.goupdated_atscenariosReview follow-ups (5ea01e5)
From engineering review:
MarkStaleExecutionsandMarkStaleWorkflowExecutionsnow useCOALESCE(updated_at, created_at, started_at)so rows whereupdated_atwas never set still get reaped instead of silently skippedupdated_atmust be bumped on every meaningful activity for staleness detection to work correctlyupdated_attests: AddedTestMarkStaleExecutions_ReapsWhenUpdatedAtIsNULLandTestMarkStaleWorkflowExecutions_ReapsWhenUpdatedAtIsNULLTest plan
TestExecutionCleanupService_PerformCleanup_MarksStaleWorkflowExecutionsTestExecutionCleanupService_PerformCleanup_ContinuesWhenMarkStaleWorkflowFailsTestMarkStaleExecutions_ReapsWhenUpdatedAtIsNULLTestMarkStaleWorkflowExecutions_ReapsWhenUpdatedAtIsNULL🤖 Generated with Claude Code