Skip to content

fix(workflow-executor): reject duplicate triggers of in-flight runs (PRD-468)#1634

Merged
Scra3 merged 1 commit into
feat/prd-214-server-step-mapperfrom
feature/prd-468-executor-runs-the-same-workflow-run-twice-on-concurrent
Jun 9, 2026
Merged

fix(workflow-executor): reject duplicate triggers of in-flight runs (PRD-468)#1634
Scra3 merged 1 commit into
feat/prd-214-server-step-mapperfrom
feature/prd-468-executor-runs-the-same-workflow-run-twice-on-concurrent

Conversation

@Scra3

@Scra3 Scra3 commented Jun 8, 2026

Copy link
Copy Markdown
Member

What

A duplicate trigger for a run already executing on this instance was silently accepted (200) after a wasted orchestrator round-trip — the inFlightRuns check sat after getAvailableRun and only returned.

triggerPoll now checks inFlightRuns up front (assertRunNotInFlight), before the orchestrator round-trip, and throws the boundary RunAlreadyInFlightErrorhandleTrigger maps it to 400 so the front knows the trigger was rejected, not silently accepted.

Important: this is an optimization, not a concurrency guard

inFlightRuns is per-instance and a deployment can run several executors, so an executor-side check can never be the duplicate-execution guard. Genuine concurrency dedup is the orchestrator's job: it atomically claims a pending run (UPDATE … WHERE runState='pending' … FOR UPDATE SKIP LOCKED), so concurrent triggers — on this executor or another — get nothing back.

The up-front check is purely a best-effort local short-circuit to skip a useless orchestrator round-trip when this instance already knows it's running the run.

Changes

  • assertRunNotInFlight(runId) early in triggerPoll (+ explanatory comment on the intent)
  • new boundary RunAlreadyInFlightError, mapped to 400 in handleTrigger
  • tests: the concurrent-trigger test asserts a single execution because the orchestrator denies the second dispatch (the dedup is orchestrator-driven by design)

fixes PRD-468

🤖 Generated with Claude Code

Note

Reject duplicate triggers of in-flight workflow runs with a 400 error

  • Runner.triggerPoll previously silently skipped duplicate triggers; it now throws RunAlreadyInFlightError via a new assertRunNotInFlight helper when the run ID is already being processed.
  • The POST /runs/:runId/trigger endpoint in executor-http-server.ts catches RunAlreadyInFlightError and returns a 400 response with the error message.
  • Behavioral Change: callers that previously received a silent no-op on duplicate trigger requests will now receive a 400 error response.

Macroscope summarized 1e0f685.

@linear-code

linear-code Bot commented Jun 8, 2026

Copy link
Copy Markdown

PRD-468

@qltysh

qltysh Bot commented Jun 8, 2026

Copy link
Copy Markdown

2 new issues

Tool Category Rule Count
qlty Structure Function with many returns (count = 5): handleTrigger 1
qlty Structure Function with high complexity (count = 11): handleTrigger 1

Base automatically changed from fix/prd-442-activity-log-target to feat/prd-214-server-step-mapper June 8, 2026 13:22
@qltysh

qltysh Bot commented Jun 8, 2026

Copy link
Copy Markdown

Qlty


Coverage Impact

Unable to calculate total coverage change because base branch coverage was not found.

Modified Files with Diff Coverage (3)

RatingFile% DiffUncovered Line #s
New Coverage rating: A
packages/workflow-executor/src/runner.ts100.0%
New Coverage rating: A
packages/workflow-executor/src/http/executor-http-server.ts100.0%
New Coverage rating: A
packages/workflow-executor/src/errors.ts100.0%
Total100.0%
🚦 See full report on Qlty Cloud »

🛟 Help
  • Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.

  • Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.

  • File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)

    • Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

@Scra3 Scra3 force-pushed the feature/prd-468-executor-runs-the-same-workflow-run-twice-on-concurrent branch 4 times, most recently from fe27c82 to 431cb08 Compare June 8, 2026 13:41
@matthv

matthv commented Jun 8, 2026

Copy link
Copy Markdown
Member

Nice work — the boundary error + the assertRunNotInFlight helper + the 400 mapping are clean, and the early short-circuit (no orchestrator round-trip on a re-trigger) is a real improvement. One thing I'd like to settle before merging though, around the concurrency guarantee.

Description ↔ code mismatch

The description says the in-flight check is done twice (early + late) and that "the late check is the correctness guard … removing it in favor of the early one alone would reintroduce double execution". But the diff actually removes the late check (if (inFlightRuns.has(step.runId)) return) and keeps only the early assertRunNotInFlight(runId). Macroscope's summary also describes a single check. So either the description is stale, or the late check was dropped unintentionally.

Why this matters — getAvailableRun is only atomic for pending runs

I checked the orchestrator side. GET /api/workflow-orchestrator/available-run/:id only claims atomically when the run is pending:

if (runState === Pending)  → claimPendingRunById()   // atomic, see below
if (runState === Started)  → no claim, just hydrate + return the run

claimPendingRunById is solid for the pending case:

UPDATE "workflowRuns" SET "lockedAt"=NOW(), "runState"='loading'
WHERE id IN ( SELECT wr.id ... WHERE wr."runState"='pending' AND wr.id=:runId ...
              FOR UPDATE SKIP LOCKED )
RETURNING *

→ two concurrent calls: one claims, the other gets null. 👍

But for a started run there's no claim — two concurrent getAvailableRun both return the step. So:

  • pending run → orchestrator dedups → early-check-only is enough ✅
  • started run + truly simultaneous triggers → both pass the early check (before either registers in inFlightRuns) and the orchestrator returns the step to both → double execution. This is exactly the case the removed late check covered.

The test gives false confidence here: getAvailableRun.mockResolvedValueOnce(step).mockResolvedValueOnce(null) reproduces the pending-claim behavior. A started run would resolve step both times, so the current test wouldn't catch the regression.

Suggested resolution (either):

  1. Restore the late check (cheap, closes the started race), and reconcile the description; or
  2. Prove the trigger path never hits a started run concurrently (always goes through pending) — then early-check-only is correct, but please fix the description and add a test that exercises the started case (getAvailableRun returning the step twice).

Minor (non-blocking): 400 vs 409 — "already being processed" reads more like a 409 Conflict, though I see the ticket explicitly asked for 400, so up to you.

…PRD-468)

A duplicate trigger for a run already executing on this instance was silently
accepted (200) after a wasted orchestrator round-trip — the inFlightRuns check
sat after getAvailableRun and only `return`ed.

triggerPoll now checks inFlightRuns up front via assertRunNotInFlight, before the
orchestrator round-trip, throwing the new boundary RunAlreadyInFlightError
(mapped to 400 by handleTrigger, so the front knows the trigger was rejected, not
silently accepted).

This is a best-effort local optimization, NOT a concurrency guard: inFlightRuns is
per-instance and a deployment can run several executors, so genuine
duplicate-execution prevention is the orchestrator's job — it atomically claims a
pending run, and concurrent triggers (on this or another executor) get nothing back.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Scra3 Scra3 force-pushed the feature/prd-468-executor-runs-the-same-workflow-run-twice-on-concurrent branch from 431cb08 to 1e0f685 Compare June 8, 2026 17:46
@Scra3

Scra3 commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Thanks for the deep read — you're right that the commit message was stale (it described a late check that the diff removed), I've fixed the description + commit message.

On the substance: I'm intentionally not restoring the late check. inFlightRuns is per-instance and we run multiple executors, so an executor-side check can't be the concurrency guard for the started-run race you describe — it would only close it same-instance while giving false confidence about the cross-executor case. Genuine duplicate-execution prevention has to live in the orchestrator's atomic claim, which is the real guarantee.

So the up-front assertRunNotInFlight is scoped as what it is: a best-effort local short-circuit to avoid a useless orchestrator round-trip when this instance already knows it's running the run. I added a comment to that effect, and the concurrent test now reads as "one execution because the orchestrator denies the second dispatch" (dedup is orchestrator-driven).

Re 400 vs 409: agreed 409 reads better, keeping 400 per the ticket.

@Scra3 Scra3 merged commit 5969656 into feat/prd-214-server-step-mapper Jun 9, 2026
30 checks passed
@Scra3 Scra3 deleted the feature/prd-468-executor-runs-the-same-workflow-run-twice-on-concurrent branch June 9, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants