fix(scheduler): reschedule job when worker channel is full or closed by pratyush618 · Pull Request #105 · ByteVeda/taskito

pratyush618 · 2026-05-02T06:10:53Z

Summary

P0-2 from the pre-release audit. try_dispatch was silently dropping jobs when try_send to the worker pool channel failed. Because the job was already in Running state with an execution claim, the stale-job reaper would eventually time it out and report a timeout failure to middleware and metrics — wrong outcome for a job that never ran.

What changed

crates/taskito-core/src/scheduler/poller.rs:

Distinguish TrySendError::Full (worker pool is behind — backpressure) from TrySendError::Closed (worker pool shutting down) with separate log lines, so operators can tell the two states apart.
Route both failure modes through the existing rollback_claim_and_retry helper — clears the execution-claim row and resets status to Pending with a 100ms delay (CHANNEL_BACKPRESSURE_RETRY_DELAY_MS).
Return Ok(false) from try_dispatch on dispatch failure so adaptive polling backs off, instead of returning Ok(true) as if the dispatch had succeeded.

The retry delay is short on purpose: 100ms is enough for the worker pool to drain a slot under steady-state backpressure, and the stale-job reaper window stays as the eventual safety net if rollback itself fails.

Tests

Two new regression tests in crates/taskito-core/src/scheduler/mod.rs:

test_try_dispatch_reschedules_on_closed_channel — drops the receiver before tick, verifies job returns to Pending with no stale claim row.
test_try_dispatch_reschedules_on_full_channel — pre-fills a capacity-1 channel with a sentinel job, verifies the same recovery on TrySendError::Full.

Test plan

cargo test --workspace — all 80 tests pass + 2 new
cargo clippy --workspace --all-targets -- -D warnings clean
cargo check --workspace --features postgres clean
cargo check --workspace --features redis clean
uv run python -m pytest tests/python/ — 485 passed, 9 skipped
uv run ruff check py_src/ tests/ clean
uv run mypy py_src/taskito/ clean
CI green

`try_dispatch` was logging the channel-send failure and returning `Ok(true)` as if the job had been dispatched. The job had already been moved to `Running` and its execution-claim row written, so it sat in that state until the stale-job reaper timed it out — at which point it was reported to middleware and metrics as a *timeout failure*, the wrong outcome for a job that never executed. Distinguish the two failure modes (channel full vs closed) with separate warnings so operators can tell backpressure from shutdown, and route both through `rollback_claim_and_retry` to clear the claim and reset status to `Pending` with a 100ms delay. The next tick will dispatch normally once the worker pool drains or restarts. Two regression tests cover the closed-channel and full-channel paths.

github-actions Bot added rust scheduler labels May 2, 2026

pratyush618 merged commit 415467b into master May 2, 2026
19 checks passed

pratyush618 deleted the fix/scheduler-channel-full-rollback branch May 2, 2026 06:47

This was referenced May 2, 2026

chore(mixins): use inspect.iscoroutinefunction in decorators #107

Merged

fix: audit 2026-05-02 P1 follow-ups #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): reschedule job when worker channel is full or closed#105

fix(scheduler): reschedule job when worker channel is full or closed#105
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-channel-full-rollback

pratyush618 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented May 2, 2026

Summary

What changed

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant