fix: reject terminal-state regression on execution status writes#528
Merged
fix: reject terminal-state regression on execution status writes#528
Conversation
…xecutions/events `applyEventToExecution` was unconditionally writing whatever status arrived in the fire-and-forget workflow event, with no guard against regressing from a terminal state (failed/succeeded/cancelled/timeout) back to a non-terminal one (running/queued/pending). The Python SDK fires these events from many paths — notify_call_start, notify_call_complete, notify_call_error, plus retries — and they are not strictly ordered. A late "running" event for the same execution_id could land after a "failed" event and stomp the row's status back, so callers polling /api/v1/executions/:id would never see the terminal status. In production this stranded github-buddy's `app.call` for the full 7200s wall-clock timeout despite pr-af.review having reported "failed" 6 minutes in. Once an execution has reached a terminal state, treat the row as immutable for status / result / error / completion fields. The endpoint still returns 200 so the SDK's fire-and-forget call site doesn't trip its own retry, but the row no longer regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…status Companion to the /workflow/executions/events guard: the /api/v1/executions/:id/status callback handler had a partial transition guard (only for the 'waiting' state) but allowed terminal→non-terminal writes. A late or replayed status callback could regress a finished execution back to 'running' and strand any caller polling the GET endpoint for completion. Reject any non-terminal write against an already-terminal record with 500 + a descriptive error so the SDK's `_post_execution_status` retry loop sees a hard failure (rather than silently re-stomping). Same- terminal writes are still accepted to keep the SDK's at-least-once delivery idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two execution-status write paths on the control plane could silently regress an already-finished execution from a terminal state (
failed/succeeded/cancelled/timeout) back to a non-terminal one (running/queued/pending). When that happens, callers polling/api/v1/executions/:idfor completion see the row flip back to "running" and never observe the terminal status — theirapp.callhangs until its own wall-clock timeout fires, and the workflow appears stuck in the UI for hours after it actually finished.This PR adds the missing guard on both write paths:
POST /api/v1/workflow/executions/events—applyEventToExecutionwas unconditionally writing whatever status arrived. The Python SDK fires these events fire-and-forget fromnotify_call_start/notify_call_complete/notify_call_error, and the calls aren't strictly ordered (a late "running" can land after "failed" for the sameexecution_id, especially when retries are in play or when an outer reasoner errors while inner reasoners are still emitting). Now: once a row is terminal, the handler still returns 200 (so the SDK's fire-and-forget retry doesn't trip) but skips status / result / error / completion mutations entirely.POST /api/v1/executions/:id/status— the existing transition guard only covered thewaitingstate. Now any non-terminal write against an already-terminal record returns 500 with a descriptive error, so the SDK's_post_execution_statusretry loop sees a hard failure rather than silently stomping the row. Same-terminal writes are still accepted to keep at-least-once status delivery idempotent.Production incident this fixes
A real PR review on the Railway test deployment hung in the UI for 12+ hours despite the underlying
pr-af.reviewreasoner having raisedBudgetExhaustedError~6 minutes in. The pr-af SDK successfully POSTed{"status": "failed"}to/api/v1/executions/exec_…lyjqfh97/statusat03:09:07 UTC(control plane returned 200), but a subsequent unguarded write reverted the row torunning. github-buddy'sapp.callpoll loop kept seeingstatus: running, eventually timed out at its owndefault_execution_timeout=7200s, and the workflow row stayed inrunningstate in the UI — observed live by queryingGET /api/v1/executions/exec_…lyjqfh9710 hours later:status: running. After this PR, that regression is impossible regardless of which writer raced.Test plan
go build ./...— clean.go test ./internal/handlers/... -count=1— all existing tests still pass (107s + the smaller suites).TestUpdateExecutionStatusHandler_TerminalRegression— laterunningwrite against afailedrow → 500, row unchanged.TestUpdateExecutionStatusHandler_TerminalIdempotent—failed→failedredelivery → 200.TestWorkflowExecutionEventHandler_TerminalRegression— laterunningevent against afailedrow → 200 (fire-and-forget) but row +CompletedAtunchanged.Notes for reviewers
/executions/:id/statusregression because the existing handler returns 500 for the analogous waiting-state guard; matching the local convention. Happy to switch to 409 if you'd prefer./workflow/executions/eventshandler keeps its 200 response on regression because the SDK call site is fire-and-forget — surfacing 4xx/5xx there would put backpressure on the agent process and trigger needless retry storms. The early-return is silent on purpose; observability is via the unchanged DB row and the existing log-execution stream.🤖 Generated with Claude Code