fix(scheduler): make concurrency cap atomic with claim_execution by pratyush618 · Pull Request #104 · ByteVeda/taskito

pratyush618 · 2026-05-02T05:37:52Z

Summary

The pre-release audit (P0-1, see .claude/reports/codebase-prerelease-2026-05-02.md) flagged a TOCTOU race on the per-task / per-queue concurrency gates. Investigating the surrounding code surfaced a second, distinct bug that was hiding in the same lines: an off-by-one that made max_concurrent = N allow only N-1 jobs (and max_concurrent = 1 allow none).

Both bugs are fixed in this PR.

What was wrong

Race: the cap check ran before claim_execution. Two scheduler instances could each read a running count below the cap and both proceed past the gate before either recorded the new running job. SQLite was masked by transaction serialization; Postgres (SELECT FOR UPDATE SKIP LOCKED) and Redis were fully exposed.
Off-by-one: dequeue_from atomically transitions status Pending → Running before the cap check runs, so count_running_by_task / stats_by_queue.running include the just-dequeued job. The >= comparison therefore counted the current job against itself.

What changed

try_dispatch is split into named helpers — active_queues, check_pre_claim_gates, claim_for_dispatch, check_post_claim_concurrency, rollback_claim_and_retry. Each step of the dispatch flow is now self-documenting and individually testable.
The concurrency cap check moved to after claim_execution succeeds, comparing with strict >. Two concurrent schedulers cannot now both pass the gate when capacity is full — at most one will see the count at-cap and proceed; the other observes count > cap and rolls back.
New rollback path: when the post-claim gate rejects a job, complete_execution clears the claim row before retry resets status. Without this, the next dispatch attempt would hit claim_execution → Ok(false) ("already claimed") and the job would be stuck until the claim was reaped on timeout.
Pre-claim gates (rate limiter, circuit breaker) keep their existing position — they don't need atomic-with-claim semantics, and consuming a token / probing the breaker post-claim would waste them on rolled-back claims.

Tests

Three new regression tests in crates/taskito-core/src/scheduler/mod.rs:

test_try_dispatch_per_task_concurrency_allows_exactly_max — enqueue 3 jobs with max_concurrent = 2, verify exactly 2 dispatched and the third is back in Pending with no stale claim row.
test_try_dispatch_per_task_max_one_dispatches_one — direct regression for the off-by-one.
test_try_dispatch_per_queue_concurrency_allows_exactly_max — same shape on the queue-level cap.

Test plan

cargo test --workspace — all 78 tests pass + 3 new
cargo clippy --workspace --all-targets -- -D warnings clean
cargo check --workspace --features postgres clean
cargo check --workspace --features redis clean
uv run python -m pytest tests/python/ — 485 passed, 9 skipped
uv run ruff check py_src/ tests/ clean
uv run mypy py_src/taskito/ clean
CI green

Two distinct bugs in the per-task and per-queue concurrency gates: 1. The cap check ran *before* `claim_execution`, so two scheduler instances could read a count below the cap and both proceed past the gate before either recorded the new running job. Postgres (`SELECT FOR UPDATE SKIP LOCKED`) and Redis are fully concurrent and would hit this; SQLite was protected only by transaction serialization. 2. `dequeue_from` already transitions status to `Running` before the gate runs, so the running-count includes the just-dequeued job. The `>=` comparison meant `max_concurrent = N` actually allowed only N-1 jobs; `max_concurrent = 1` allowed zero. Move the cap check to *after* `claim_execution` succeeds, change `>=` to `>`, and add a rollback path that clears the claim row + retries the job when the post-claim gate rejects it. Split `try_dispatch` into named helpers so each step of the dispatch flow is self-documenting. Three regression tests cover the off-by-one and the rollback path.

github-actions Bot added rust scheduler labels May 2, 2026

pratyush618 merged commit 88e18c8 into master May 2, 2026
19 checks passed

This was referenced May 2, 2026

chore(mixins): use inspect.iscoroutinefunction in decorators #107

Merged

fix: audit 2026-05-02 P1 follow-ups #108

Merged

pratyush618 deleted the fix/scheduler-concurrency-atomicity branch May 2, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): make concurrency cap atomic with claim_execution#104

fix(scheduler): make concurrency cap atomic with claim_execution#104
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-concurrency-atomicity

pratyush618 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented May 2, 2026

Summary

What was wrong

What changed

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant