Skip to content

fix: reject terminal-state regression on execution status writes#528

Merged
AbirAbbas merged 2 commits intomainfrom
fix/terminal-status-regression-guard
May 5, 2026
Merged

fix: reject terminal-state regression on execution status writes#528
AbirAbbas merged 2 commits intomainfrom
fix/terminal-status-regression-guard

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

Two execution-status write paths on the control plane could silently regress an already-finished execution from a terminal state (failed/succeeded/cancelled/timeout) back to a non-terminal one (running/queued/pending). When that happens, callers polling /api/v1/executions/:id for completion see the row flip back to "running" and never observe the terminal status — their app.call hangs until its own wall-clock timeout fires, and the workflow appears stuck in the UI for hours after it actually finished.

This PR adds the missing guard on both write paths:

  1. POST /api/v1/workflow/executions/eventsapplyEventToExecution was unconditionally writing whatever status arrived. The Python SDK fires these events fire-and-forget from notify_call_start / notify_call_complete / notify_call_error, and the calls aren't strictly ordered (a late "running" can land after "failed" for the same execution_id, especially when retries are in play or when an outer reasoner errors while inner reasoners are still emitting). Now: once a row is terminal, the handler still returns 200 (so the SDK's fire-and-forget retry doesn't trip) but skips status / result / error / completion mutations entirely.
  2. POST /api/v1/executions/:id/status — the existing transition guard only covered the waiting state. Now any non-terminal write against an already-terminal record returns 500 with a descriptive error, so the SDK's _post_execution_status retry loop sees a hard failure rather than silently stomping the row. Same-terminal writes are still accepted to keep at-least-once status delivery idempotent.

Production incident this fixes

A real PR review on the Railway test deployment hung in the UI for 12+ hours despite the underlying pr-af.review reasoner having raised BudgetExhaustedError ~6 minutes in. The pr-af SDK successfully POSTed {"status": "failed"} to /api/v1/executions/exec_…lyjqfh97/status at 03:09:07 UTC (control plane returned 200), but a subsequent unguarded write reverted the row to running. github-buddy's app.call poll loop kept seeing status: running, eventually timed out at its own default_execution_timeout=7200s, and the workflow row stayed in running state in the UI — observed live by querying GET /api/v1/executions/exec_…lyjqfh97 10 hours later: status: running. After this PR, that regression is impossible regardless of which writer raced.

Test plan

  • go build ./... — clean.
  • go test ./internal/handlers/... -count=1 — all existing tests still pass (107s + the smaller suites).
  • New tests pin the guard behavior:
    • TestUpdateExecutionStatusHandler_TerminalRegression — late running write against a failed row → 500, row unchanged.
    • TestUpdateExecutionStatusHandler_TerminalIdempotentfailedfailed redelivery → 200.
    • TestWorkflowExecutionEventHandler_TerminalRegression — late running event against a failed row → 200 (fire-and-forget) but row + CompletedAt unchanged.

Notes for reviewers

  • The two commits are split by write path so each guard is reviewable on its own.
  • I deliberately return 500 (not 409) on /executions/:id/status regression because the existing handler returns 500 for the analogous waiting-state guard; matching the local convention. Happy to switch to 409 if you'd prefer.
  • The /workflow/executions/events handler keeps its 200 response on regression because the SDK call site is fire-and-forget — surfacing 4xx/5xx there would put backpressure on the agent process and trigger needless retry storms. The early-return is silent on purpose; observability is via the unchanged DB row and the existing log-execution stream.

🤖 Generated with Claude Code

AbirAbbas and others added 2 commits May 5, 2026 09:25
…xecutions/events

`applyEventToExecution` was unconditionally writing whatever status
arrived in the fire-and-forget workflow event, with no guard against
regressing from a terminal state (failed/succeeded/cancelled/timeout)
back to a non-terminal one (running/queued/pending).

The Python SDK fires these events from many paths — notify_call_start,
notify_call_complete, notify_call_error, plus retries — and they are
not strictly ordered. A late "running" event for the same execution_id
could land after a "failed" event and stomp the row's status back, so
callers polling /api/v1/executions/:id would never see the terminal
status. In production this stranded github-buddy's `app.call` for the
full 7200s wall-clock timeout despite pr-af.review having reported
"failed" 6 minutes in.

Once an execution has reached a terminal state, treat the row as
immutable for status / result / error / completion fields. The endpoint
still returns 200 so the SDK's fire-and-forget call site doesn't trip
its own retry, but the row no longer regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…status

Companion to the /workflow/executions/events guard: the
/api/v1/executions/:id/status callback handler had a partial transition
guard (only for the 'waiting' state) but allowed terminal→non-terminal
writes. A late or replayed status callback could regress a finished
execution back to 'running' and strand any caller polling the GET
endpoint for completion.

Reject any non-terminal write against an already-terminal record with
500 + a descriptive error so the SDK's `_post_execution_status` retry
loop sees a hard failure (rather than silently re-stomping). Same-
terminal writes are still accepted to keep the SDK's at-least-once
delivery idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner May 5, 2026 13:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.40% 87.30% ↑ +0.10 pp 🟡
sdk-go 91.80% 90.70% ↑ +1.10 pp 🟢
sdk-python 93.66% 93.63% ↑ +0.03 pp 🟢
sdk-typescript 92.63% 92.56% ↑ +0.07 pp 🟢
web-ui 89.69% 90.01% ↓ -0.32 pp 🟡
aggregate 88.85% 89.01% ↓ -0.16 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 26 96.00%
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas AbirAbbas merged commit 290d3df into main May 5, 2026
28 checks passed
@AbirAbbas AbirAbbas deleted the fix/terminal-status-regression-guard branch May 5, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant