Skip to content

Fix race condition in InMemoryQueue.AbandonAsync causing flaky RunUntilEmptyAsync#470

Merged
niemyjski merged 2 commits intomainfrom
fix/inmemory-queue-abandon-race
Mar 4, 2026
Merged

Fix race condition in InMemoryQueue.AbandonAsync causing flaky RunUntilEmptyAsync#470
niemyjski merged 2 commits intomainfrom
fix/inmemory-queue-abandon-race

Conversation

@niemyjski
Copy link
Member

Summary

  • Fix a TOCTOU race condition in InMemoryQueue.AbandonAsync that causes RunUntilEmptyAsync to exit prematurely, making tests like CanRunQueueJobWithLockFailAsync flaky.

Root Cause

During AbandonAsync, there is a window between removing an entry from _dequeued (Working count drops) and re-enqueueing it for retry (Queued count rises) where the item exists in neither collection. If the RunUntilEmptyAsync continuation callback checks Queued + Working during this gap, it sees 0 + 0 and terminates the job loop while retryable items are still in flight.

Timeline:
  _dequeued.TryRemove()  → Working=0, Queued=0  ← RACE WINDOW
  OnAbandonedAsync()
  _queue.Enqueue()       → Working=0, Queued=1

Fix

Add a _pendingRetryCount field that bridges the gap:

  • Incremented before TryRemove from _dequeued
  • Decremented after the item reaches its destination (re-queued synchronously, moved to deadletter, or handed off to delayed retry)
  • Included in the Queued stat so RunUntilEmptyAsync sees items in transit

For delayed retries (RetryDelay > 0), the counter is decremented immediately after scheduling the Run.DelayedAsync task, since the item is intentionally parked and the job loop should not spin-wait for it.

Test plan

  • CanRunQueueJobWithLockFailAsync passes 20/20 consecutive runs (previously flaky)
  • CanAbandonQueueEntryOnceAsync passes (verifies Working == 0 after abandon)
  • CanRunBadWorkItem passes (verifies delayed retry path doesn't inflate stats)
  • Full test suite: 1790 passed, 0 failed, 13 skipped

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a TOCTOU (time-of-check-time-of-use) race condition in InMemoryQueue.AbandonAsync that was causing RunUntilEmptyAsync to exit prematurely when items were being abandoned for retry. A new _pendingRetryCount counter bridges the gap between when an item is removed from _dequeued (Working count drops) and when it is re-enqueued for retry (Queued count rises), preventing RunUntilEmptyAsync's continuation check (Queued + Working > 0) from seeing a spurious zero.

Changes:

  • Add _pendingRetryCount field, incremented at the start of AbandonAsync (after the guard check) and decremented at each exit path.
  • Include _pendingRetryCount in the Queued stat returned by GetMetricsQueueStats, so RunUntilEmptyAsync sees in-transit items.
  • Reset _pendingRetryCount in DeleteQueueImplAsync; change var unawaited = Run.DelayedAsync(...) to the canonical _ = Run.DelayedAsync(...) discarded pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@niemyjski niemyjski force-pushed the fix/inmemory-queue-abandon-race branch from a244cee to 3be8fc0 Compare March 4, 2026 18:46
@niemyjski niemyjski requested a review from Copilot March 4, 2026 18:46
…t prematurely

During AbandonAsync, there is a window between removing an entry from
_dequeued and re-enqueueing it for retry where the item exists in
neither collection. If RunUntilEmptyAsync checks queue stats during
this gap, it sees Queued=0 + Working=0 and terminates the job loop
while retryable items are still in flight.

Add a _pendingRetryCount that bridges the gap: incremented before
TryRemove, decremented after the item lands in its destination
(re-queued, deadlettered, or scheduled for delayed retry). The count
is included in the Queued stat so the continuation callback sees
items in transit.

For delayed retries (RetryDelay > 0), the counter is decremented
immediately after scheduling since the item is intentionally parked
and RunUntilEmptyAsync should not spin-wait for it.

Fixes flaky CanRunQueueJobWithLockFailAsync test.

Made-with: Cursor
In the deadletter path, the item moves out of Queued/Working entirely,
so decrementing _pendingRetryCount before enqueuing to the deadletter
queue avoids a transient overcount where the item appears in both the
Queued stat (via _pendingRetryCount) and the Deadletter stat (via
_deadletterQueue.Count).

The synchronous retry path intentionally keeps the current ordering
(Retry then Decrement) because decrementing first would re-open the
race window this PR fixes: between the decrement and _queue.Enqueue
inside Retry(), both _pendingRetryCount and _queue.Count would be
zero, allowing RunUntilEmptyAsync to exit prematurely.

Made-with: Cursor
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@niemyjski niemyjski merged commit bb7146e into main Mar 4, 2026
4 checks passed
@niemyjski niemyjski deleted the fix/inmemory-queue-abandon-race branch March 4, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants