fix: reject terminal-state regression on execution status writes by AbirAbbas · Pull Request #528 · Agent-Field/agentfield

AbirAbbas · 2026-05-05T13:26:39Z

Summary

Two execution-status write paths on the control plane could silently regress an already-finished execution from a terminal state (failed/succeeded/cancelled/timeout) back to a non-terminal one (running/queued/pending). When that happens, callers polling /api/v1/executions/:id for completion see the row flip back to "running" and never observe the terminal status — their app.call hangs until its own wall-clock timeout fires, and the workflow appears stuck in the UI for hours after it actually finished.

This PR adds the missing guard on both write paths:

POST /api/v1/workflow/executions/events — applyEventToExecution was unconditionally writing whatever status arrived. The Python SDK fires these events fire-and-forget from notify_call_start / notify_call_complete / notify_call_error, and the calls aren't strictly ordered (a late "running" can land after "failed" for the same execution_id, especially when retries are in play or when an outer reasoner errors while inner reasoners are still emitting). Now: once a row is terminal, the handler still returns 200 (so the SDK's fire-and-forget retry doesn't trip) but skips status / result / error / completion mutations entirely.
POST /api/v1/executions/:id/status — the existing transition guard only covered the waiting state. Now any non-terminal write against an already-terminal record returns 500 with a descriptive error, so the SDK's _post_execution_status retry loop sees a hard failure rather than silently stomping the row. Same-terminal writes are still accepted to keep at-least-once status delivery idempotent.

Production incident this fixes

A real PR review on the Railway test deployment hung in the UI for 12+ hours despite the underlying pr-af.review reasoner having raised BudgetExhaustedError ~6 minutes in. The pr-af SDK successfully POSTed {"status": "failed"} to /api/v1/executions/exec_…lyjqfh97/status at 03:09:07 UTC (control plane returned 200), but a subsequent unguarded write reverted the row to running. github-buddy's app.call poll loop kept seeing status: running, eventually timed out at its own default_execution_timeout=7200s, and the workflow row stayed in running state in the UI — observed live by querying GET /api/v1/executions/exec_…lyjqfh97 10 hours later: status: running. After this PR, that regression is impossible regardless of which writer raced.

Test plan

go build ./... — clean.
go test ./internal/handlers/... -count=1 — all existing tests still pass (107s + the smaller suites).
New tests pin the guard behavior:
- TestUpdateExecutionStatusHandler_TerminalRegression — late running write against a failed row → 500, row unchanged.
- TestUpdateExecutionStatusHandler_TerminalIdempotent — failed→failed redelivery → 200.
- TestWorkflowExecutionEventHandler_TerminalRegression — late running event against a failed row → 200 (fire-and-forget) but row + CompletedAt unchanged.

Notes for reviewers

The two commits are split by write path so each guard is reviewable on its own.
I deliberately return 500 (not 409) on /executions/:id/status regression because the existing handler returns 500 for the analogous waiting-state guard; matching the local convention. Happy to switch to 409 if you'd prefer.
The /workflow/executions/events handler keeps its 200 response on regression because the SDK call site is fire-and-forget — surfacing 4xx/5xx there would put backpressure on the agent process and trigger needless retry storms. The early-return is silent on purpose; observability is via the unchanged DB row and the existing log-execution stream.

🤖 Generated with Claude Code

…xecutions/events `applyEventToExecution` was unconditionally writing whatever status arrived in the fire-and-forget workflow event, with no guard against regressing from a terminal state (failed/succeeded/cancelled/timeout) back to a non-terminal one (running/queued/pending). The Python SDK fires these events from many paths — notify_call_start, notify_call_complete, notify_call_error, plus retries — and they are not strictly ordered. A late "running" event for the same execution_id could land after a "failed" event and stomp the row's status back, so callers polling /api/v1/executions/:id would never see the terminal status. In production this stranded github-buddy's `app.call` for the full 7200s wall-clock timeout despite pr-af.review having reported "failed" 6 minutes in. Once an execution has reached a terminal state, treat the row as immutable for status / result / error / completion fields. The endpoint still returns 200 so the SDK's fire-and-forget call site doesn't trip its own retry, but the row no longer regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…status Companion to the /workflow/executions/events guard: the /api/v1/executions/:id/status callback handler had a partial transition guard (only for the 'waiting' state) but allowed terminal→non-terminal writes. A late or replayed status callback could regress a finished execution back to 'running' and strand any caller polling the GET endpoint for completion. Reject any non-terminal write against an already-terminal record with 500 + a descriptive error so the SDK's `_post_execution_status` retry loop sees a hard failure (rather than silently re-stomping). Same- terminal writes are still accepted to keep the SDK's at-least-once delivery idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-05T13:31:16Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.40%	87.30%	↑ +0.10 pp	🟡
`sdk-go`	91.80%	90.70%	↑ +1.10 pp	🟢
`sdk-python`	93.66%	93.63%	↑ +0.03 pp	🟢
`sdk-typescript`	92.63%	92.56%	↑ +0.07 pp	🟢
`web-ui`	89.69%	90.01%	↓ -0.32 pp	🟡
aggregate	88.85%	89.01%	↓ -0.16 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-05-05T13:31:18Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Touched lines	Patch coverage	Status
`control-plane`	26	96.00%	✅
`sdk-go`	0	—	➖ no changes
`sdk-python`	0	—	➖ no changes
`sdk-typescript`	0	—	➖ no changes
`web-ui`	0	—	➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

AbirAbbas and others added 2 commits May 5, 2026 09:25

AbirAbbas requested a review from a team as a code owner May 5, 2026 13:26

AbirAbbas merged commit 290d3df into main May 5, 2026
28 checks passed

AbirAbbas deleted the fix/terminal-status-regression-guard branch May 5, 2026 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject terminal-state regression on execution status writes#528

fix: reject terminal-state regression on execution status writes#528
AbirAbbas merged 2 commits intomainfrom
fix/terminal-status-regression-guard

AbirAbbas commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented May 5, 2026

Summary

Production incident this fixes

Test plan

Notes for reviewers

Uh oh!

github-actions Bot commented May 5, 2026

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions Bot commented May 5, 2026

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant