Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a TOCTOU (time-of-check-time-of-use) race condition in InMemoryQueue.AbandonAsync that was causing RunUntilEmptyAsync to exit prematurely when items were being abandoned for retry. A new _pendingRetryCount counter bridges the gap between when an item is removed from _dequeued (Working count drops) and when it is re-enqueued for retry (Queued count rises), preventing RunUntilEmptyAsync's continuation check (Queued + Working > 0) from seeing a spurious zero.
Changes:
- Add
_pendingRetryCountfield, incremented at the start ofAbandonAsync(after the guard check) and decremented at each exit path. - Include
_pendingRetryCountin theQueuedstat returned byGetMetricsQueueStats, soRunUntilEmptyAsyncsees in-transit items. - Reset
_pendingRetryCountinDeleteQueueImplAsync; changevar unawaited = Run.DelayedAsync(...)to the canonical_ = Run.DelayedAsync(...)discarded pattern.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a244cee to
3be8fc0
Compare
…t prematurely During AbandonAsync, there is a window between removing an entry from _dequeued and re-enqueueing it for retry where the item exists in neither collection. If RunUntilEmptyAsync checks queue stats during this gap, it sees Queued=0 + Working=0 and terminates the job loop while retryable items are still in flight. Add a _pendingRetryCount that bridges the gap: incremented before TryRemove, decremented after the item lands in its destination (re-queued, deadlettered, or scheduled for delayed retry). The count is included in the Queued stat so the continuation callback sees items in transit. For delayed retries (RetryDelay > 0), the counter is decremented immediately after scheduling since the item is intentionally parked and RunUntilEmptyAsync should not spin-wait for it. Fixes flaky CanRunQueueJobWithLockFailAsync test. Made-with: Cursor
In the deadletter path, the item moves out of Queued/Working entirely, so decrementing _pendingRetryCount before enqueuing to the deadletter queue avoids a transient overcount where the item appears in both the Queued stat (via _pendingRetryCount) and the Deadletter stat (via _deadletterQueue.Count). The synchronous retry path intentionally keeps the current ordering (Retry then Decrement) because decrementing first would re-open the race window this PR fixes: between the decrement and _queue.Enqueue inside Retry(), both _pendingRetryCount and _queue.Count would be zero, allowing RunUntilEmptyAsync to exit prematurely. Made-with: Cursor
3be8fc0 to
5c4f7ac
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
InMemoryQueue.AbandonAsyncthat causesRunUntilEmptyAsyncto exit prematurely, making tests likeCanRunQueueJobWithLockFailAsyncflaky.Root Cause
During
AbandonAsync, there is a window between removing an entry from_dequeued(Working count drops) and re-enqueueing it for retry (Queued count rises) where the item exists in neither collection. If theRunUntilEmptyAsynccontinuation callback checksQueued + Workingduring this gap, it sees0 + 0and terminates the job loop while retryable items are still in flight.Fix
Add a
_pendingRetryCountfield that bridges the gap:TryRemovefrom_dequeuedQueuedstat soRunUntilEmptyAsyncsees items in transitFor delayed retries (
RetryDelay > 0), the counter is decremented immediately after scheduling theRun.DelayedAsynctask, since the item is intentionally parked and the job loop should not spin-wait for it.Test plan
CanRunQueueJobWithLockFailAsyncpasses 20/20 consecutive runs (previously flaky)CanAbandonQueueEntryOnceAsyncpasses (verifiesWorking == 0after abandon)CanRunBadWorkItempasses (verifies delayed retry path doesn't inflate stats)