Skip to content

fix(scheduler): make concurrency cap atomic with claim_execution#104

Merged
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-concurrency-atomicity
May 2, 2026
Merged

fix(scheduler): make concurrency cap atomic with claim_execution#104
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-concurrency-atomicity

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

The pre-release audit (P0-1, see .claude/reports/codebase-prerelease-2026-05-02.md) flagged a TOCTOU race on the per-task / per-queue concurrency gates. Investigating the surrounding code surfaced a second, distinct bug that was hiding in the same lines: an off-by-one that made max_concurrent = N allow only N-1 jobs (and max_concurrent = 1 allow none).

Both bugs are fixed in this PR.

What was wrong

  1. Race: the cap check ran before claim_execution. Two scheduler instances could each read a running count below the cap and both proceed past the gate before either recorded the new running job. SQLite was masked by transaction serialization; Postgres (SELECT FOR UPDATE SKIP LOCKED) and Redis were fully exposed.

  2. Off-by-one: dequeue_from atomically transitions status Pending → Running before the cap check runs, so count_running_by_task / stats_by_queue.running include the just-dequeued job. The >= comparison therefore counted the current job against itself.

What changed

  • try_dispatch is split into named helpers — active_queues, check_pre_claim_gates, claim_for_dispatch, check_post_claim_concurrency, rollback_claim_and_retry. Each step of the dispatch flow is now self-documenting and individually testable.
  • The concurrency cap check moved to after claim_execution succeeds, comparing with strict >. Two concurrent schedulers cannot now both pass the gate when capacity is full — at most one will see the count at-cap and proceed; the other observes count > cap and rolls back.
  • New rollback path: when the post-claim gate rejects a job, complete_execution clears the claim row before retry resets status. Without this, the next dispatch attempt would hit claim_execution → Ok(false) ("already claimed") and the job would be stuck until the claim was reaped on timeout.
  • Pre-claim gates (rate limiter, circuit breaker) keep their existing position — they don't need atomic-with-claim semantics, and consuming a token / probing the breaker post-claim would waste them on rolled-back claims.

Tests

Three new regression tests in crates/taskito-core/src/scheduler/mod.rs:

  • test_try_dispatch_per_task_concurrency_allows_exactly_max — enqueue 3 jobs with max_concurrent = 2, verify exactly 2 dispatched and the third is back in Pending with no stale claim row.
  • test_try_dispatch_per_task_max_one_dispatches_one — direct regression for the off-by-one.
  • test_try_dispatch_per_queue_concurrency_allows_exactly_max — same shape on the queue-level cap.

Test plan

  • cargo test --workspace — all 78 tests pass + 3 new
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo check --workspace --features postgres clean
  • cargo check --workspace --features redis clean
  • uv run python -m pytest tests/python/ — 485 passed, 9 skipped
  • uv run ruff check py_src/ tests/ clean
  • uv run mypy py_src/taskito/ clean
  • CI green

Two distinct bugs in the per-task and per-queue concurrency gates:

1. The cap check ran *before* `claim_execution`, so two scheduler instances
   could read a count below the cap and both proceed past the gate before
   either recorded the new running job. Postgres (`SELECT FOR UPDATE SKIP
   LOCKED`) and Redis are fully concurrent and would hit this; SQLite was
   protected only by transaction serialization.

2. `dequeue_from` already transitions status to `Running` before the gate
   runs, so the running-count includes the just-dequeued job. The `>=`
   comparison meant `max_concurrent = N` actually allowed only N-1 jobs;
   `max_concurrent = 1` allowed zero.

Move the cap check to *after* `claim_execution` succeeds, change `>=` to
`>`, and add a rollback path that clears the claim row + retries the job
when the post-claim gate rejects it. Split `try_dispatch` into named
helpers so each step of the dispatch flow is self-documenting.

Three regression tests cover the off-by-one and the rollback path.
@pratyush618 pratyush618 merged commit 88e18c8 into master May 2, 2026
19 checks passed
@pratyush618 pratyush618 deleted the fix/scheduler-concurrency-atomicity branch May 2, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant