fix(scheduler): make concurrency cap atomic with claim_execution#104
Merged
pratyush618 merged 1 commit intomasterfrom May 2, 2026
Merged
fix(scheduler): make concurrency cap atomic with claim_execution#104pratyush618 merged 1 commit intomasterfrom
pratyush618 merged 1 commit intomasterfrom
Conversation
Two distinct bugs in the per-task and per-queue concurrency gates: 1. The cap check ran *before* `claim_execution`, so two scheduler instances could read a count below the cap and both proceed past the gate before either recorded the new running job. Postgres (`SELECT FOR UPDATE SKIP LOCKED`) and Redis are fully concurrent and would hit this; SQLite was protected only by transaction serialization. 2. `dequeue_from` already transitions status to `Running` before the gate runs, so the running-count includes the just-dequeued job. The `>=` comparison meant `max_concurrent = N` actually allowed only N-1 jobs; `max_concurrent = 1` allowed zero. Move the cap check to *after* `claim_execution` succeeds, change `>=` to `>`, and add a rollback path that clears the claim row + retries the job when the post-claim gate rejects it. Split `try_dispatch` into named helpers so each step of the dispatch flow is self-documenting. Three regression tests cover the off-by-one and the rollback path.
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The pre-release audit (P0-1, see
.claude/reports/codebase-prerelease-2026-05-02.md) flagged a TOCTOU race on the per-task / per-queue concurrency gates. Investigating the surrounding code surfaced a second, distinct bug that was hiding in the same lines: an off-by-one that mademax_concurrent = Nallow only N-1 jobs (andmax_concurrent = 1allow none).Both bugs are fixed in this PR.
What was wrong
Race: the cap check ran before
claim_execution. Two scheduler instances could each read arunningcount below the cap and both proceed past the gate before either recorded the new running job. SQLite was masked by transaction serialization; Postgres (SELECT FOR UPDATE SKIP LOCKED) and Redis were fully exposed.Off-by-one:
dequeue_fromatomically transitions statusPending → Runningbefore the cap check runs, socount_running_by_task/stats_by_queue.runninginclude the just-dequeued job. The>=comparison therefore counted the current job against itself.What changed
try_dispatchis split into named helpers —active_queues,check_pre_claim_gates,claim_for_dispatch,check_post_claim_concurrency,rollback_claim_and_retry. Each step of the dispatch flow is now self-documenting and individually testable.claim_executionsucceeds, comparing with strict>. Two concurrent schedulers cannot now both pass the gate when capacity is full — at most one will see the count at-cap and proceed; the other observes count > cap and rolls back.complete_executionclears the claim row beforeretryresets status. Without this, the next dispatch attempt would hitclaim_execution → Ok(false)("already claimed") and the job would be stuck until the claim was reaped on timeout.Tests
Three new regression tests in
crates/taskito-core/src/scheduler/mod.rs:test_try_dispatch_per_task_concurrency_allows_exactly_max— enqueue 3 jobs withmax_concurrent = 2, verify exactly 2 dispatched and the third is back in Pending with no stale claim row.test_try_dispatch_per_task_max_one_dispatches_one— direct regression for the off-by-one.test_try_dispatch_per_queue_concurrency_allows_exactly_max— same shape on the queue-level cap.Test plan
cargo test --workspace— all 78 tests pass + 3 newcargo clippy --workspace --all-targets -- -D warningscleancargo check --workspace --features postgrescleancargo check --workspace --features rediscleanuv run python -m pytest tests/python/— 485 passed, 9 skippeduv run ruff check py_src/ tests/cleanuv run mypy py_src/taskito/clean