Fix race condition in InMemoryQueue.AbandonAsync causing flaky RunUntilEmptyAsync by niemyjski · Pull Request #470 · FoundatioFx/Foundatio

niemyjski · 2026-03-04T03:06:21Z

Summary

Fix a TOCTOU race condition in InMemoryQueue.AbandonAsync that causes RunUntilEmptyAsync to exit prematurely, making tests like CanRunQueueJobWithLockFailAsync flaky.

Root Cause

During AbandonAsync, there is a window between removing an entry from _dequeued (Working count drops) and re-enqueueing it for retry (Queued count rises) where the item exists in neither collection. If the RunUntilEmptyAsync continuation callback checks Queued + Working during this gap, it sees 0 + 0 and terminates the job loop while retryable items are still in flight.

Timeline:
  _dequeued.TryRemove()  → Working=0, Queued=0  ← RACE WINDOW
  OnAbandonedAsync()
  _queue.Enqueue()       → Working=0, Queued=1

Fix

Add a _pendingRetryCount field that bridges the gap:

Incremented before TryRemove from _dequeued
Decremented after the item reaches its destination (re-queued synchronously, moved to deadletter, or handed off to delayed retry)
Included in the Queued stat so RunUntilEmptyAsync sees items in transit

For delayed retries (RetryDelay > 0), the counter is decremented immediately after scheduling the Run.DelayedAsync task, since the item is intentionally parked and the job loop should not spin-wait for it.

Test plan

CanRunQueueJobWithLockFailAsync passes 20/20 consecutive runs (previously flaky)
CanAbandonQueueEntryOnceAsync passes (verifies Working == 0 after abandon)
CanRunBadWorkItem passes (verifies delayed retry path doesn't inflate stats)
Full test suite: 1790 passed, 0 failed, 13 skipped

Copilot

Pull request overview

This PR fixes a TOCTOU (time-of-check-time-of-use) race condition in InMemoryQueue.AbandonAsync that was causing RunUntilEmptyAsync to exit prematurely when items were being abandoned for retry. A new _pendingRetryCount counter bridges the gap between when an item is removed from _dequeued (Working count drops) and when it is re-enqueued for retry (Queued count rises), preventing RunUntilEmptyAsync's continuation check (Queued + Working > 0) from seeing a spurious zero.

Changes:

Add _pendingRetryCount field, incremented at the start of AbandonAsync (after the guard check) and decremented at each exit path.
Include _pendingRetryCount in the Queued stat returned by GetMetricsQueueStats, so RunUntilEmptyAsync sees in-transit items.
Reset _pendingRetryCount in DeleteQueueImplAsync; change var unawaited = Run.DelayedAsync(...) to the canonical _ = Run.DelayedAsync(...) discarded pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/Foundatio/Queues/InMemoryQueue.cs

…t prematurely During AbandonAsync, there is a window between removing an entry from _dequeued and re-enqueueing it for retry where the item exists in neither collection. If RunUntilEmptyAsync checks queue stats during this gap, it sees Queued=0 + Working=0 and terminates the job loop while retryable items are still in flight. Add a _pendingRetryCount that bridges the gap: incremented before TryRemove, decremented after the item lands in its destination (re-queued, deadlettered, or scheduled for delayed retry). The count is included in the Queued stat so the continuation callback sees items in transit. For delayed retries (RetryDelay > 0), the counter is decremented immediately after scheduling since the item is intentionally parked and RunUntilEmptyAsync should not spin-wait for it. Fixes flaky CanRunQueueJobWithLockFailAsync test. Made-with: Cursor

In the deadletter path, the item moves out of Queued/Working entirely, so decrementing _pendingRetryCount before enqueuing to the deadletter queue avoids a transient overcount where the item appears in both the Queued stat (via _pendingRetryCount) and the Deadletter stat (via _deadletterQueue.Count). The synchronous retry path intentionally keeps the current ordering (Retry then Decrement) because decrementing first would re-open the race window this PR fixes: between the decrement and _queue.Enqueue inside Retry(), both _pendingRetryCount and _queue.Count would be zero, allowing RunUntilEmptyAsync to exit prematurely. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

niemyjski requested a review from Copilot March 4, 2026 03:10

niemyjski self-assigned this Mar 4, 2026

niemyjski added the bug label Mar 4, 2026

Copilot started reviewing on behalf of niemyjski March 4, 2026 03:10 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

src/Foundatio/Queues/InMemoryQueue.cs Show resolved Hide resolved

src/Foundatio/Queues/InMemoryQueue.cs Outdated Show resolved Hide resolved

niemyjski force-pushed the fix/inmemory-queue-abandon-race branch from a244cee to 3be8fc0 Compare March 4, 2026 18:46

niemyjski requested a review from Copilot March 4, 2026 18:46

niemyjski added 2 commits March 4, 2026 12:47

Copilot started reviewing on behalf of niemyjski March 4, 2026 18:47 View session

niemyjski force-pushed the fix/inmemory-queue-abandon-race branch from 3be8fc0 to 5c4f7ac Compare March 4, 2026 18:48

Copilot AI reviewed Mar 4, 2026

View reviewed changes

niemyjski merged commit bb7146e into main Mar 4, 2026
4 checks passed

niemyjski deleted the fix/inmemory-queue-abandon-race branch March 4, 2026 18:52

niemyjski mentioned this pull request Mar 4, 2026

[DO NOT MERGE] FastCloner as DeepCloner replacement #469

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition in InMemoryQueue.AbandonAsync causing flaky RunUntilEmptyAsync#470

Fix race condition in InMemoryQueue.AbandonAsync causing flaky RunUntilEmptyAsync#470
niemyjski merged 2 commits intomainfrom
fix/inmemory-queue-abandon-race

niemyjski commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

niemyjski commented Mar 4, 2026

Summary

Root Cause

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants