Skip to content

fix(scheduler): reschedule job when worker channel is full or closed#105

Merged
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-channel-full-rollback
May 2, 2026
Merged

fix(scheduler): reschedule job when worker channel is full or closed#105
pratyush618 merged 1 commit intomasterfrom
fix/scheduler-channel-full-rollback

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

P0-2 from the pre-release audit. try_dispatch was silently dropping jobs when try_send to the worker pool channel failed. Because the job was already in Running state with an execution claim, the stale-job reaper would eventually time it out and report a timeout failure to middleware and metrics — wrong outcome for a job that never ran.

What changed

crates/taskito-core/src/scheduler/poller.rs:

  • Distinguish TrySendError::Full (worker pool is behind — backpressure) from TrySendError::Closed (worker pool shutting down) with separate log lines, so operators can tell the two states apart.
  • Route both failure modes through the existing rollback_claim_and_retry helper — clears the execution-claim row and resets status to Pending with a 100ms delay (CHANNEL_BACKPRESSURE_RETRY_DELAY_MS).
  • Return Ok(false) from try_dispatch on dispatch failure so adaptive polling backs off, instead of returning Ok(true) as if the dispatch had succeeded.

The retry delay is short on purpose: 100ms is enough for the worker pool to drain a slot under steady-state backpressure, and the stale-job reaper window stays as the eventual safety net if rollback itself fails.

Tests

Two new regression tests in crates/taskito-core/src/scheduler/mod.rs:

  • test_try_dispatch_reschedules_on_closed_channel — drops the receiver before tick, verifies job returns to Pending with no stale claim row.
  • test_try_dispatch_reschedules_on_full_channel — pre-fills a capacity-1 channel with a sentinel job, verifies the same recovery on TrySendError::Full.

Test plan

  • cargo test --workspace — all 80 tests pass + 2 new
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo check --workspace --features postgres clean
  • cargo check --workspace --features redis clean
  • uv run python -m pytest tests/python/ — 485 passed, 9 skipped
  • uv run ruff check py_src/ tests/ clean
  • uv run mypy py_src/taskito/ clean
  • CI green

`try_dispatch` was logging the channel-send failure and returning
`Ok(true)` as if the job had been dispatched. The job had already been
moved to `Running` and its execution-claim row written, so it sat in
that state until the stale-job reaper timed it out — at which point it
was reported to middleware and metrics as a *timeout failure*, the wrong
outcome for a job that never executed.

Distinguish the two failure modes (channel full vs closed) with separate
warnings so operators can tell backpressure from shutdown, and route both
through `rollback_claim_and_retry` to clear the claim and reset status to
`Pending` with a 100ms delay. The next tick will dispatch normally once
the worker pool drains or restarts.

Two regression tests cover the closed-channel and full-channel paths.
@pratyush618 pratyush618 merged commit 415467b into master May 2, 2026
19 checks passed
@pratyush618 pratyush618 deleted the fix/scheduler-channel-full-rollback branch May 2, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant