Fix race conditions in endpoint Stop and expose lifecycle stopper for advanced scenarios#7747
Merged
Conversation
…ore usage in endpoint lifecycle methods
Contributor
Author
andreasohlund
approved these changes
May 11, 2026
DavidBoike
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this fix is needed
While adding support for the ServiceControl disconnected endpoint count acceptance test, we introduced a
Func<CancellationToken, Task>into the acceptance testing framework that allows tests to stop an endpoint mid-lifecycle. This revealed two race conditions inRunningEndpointInstance.Stopand an existing bug that causesObjectDisposedExceptiononstoppingTokenSource.Bug: ObjectDisposedException when Stop is called with an already-cancelled token
When
Stopis called with a pre-cancelled token (e.g. test timeout),stoppingTokenSource.CancelAsync()ran before the semaphore, andWaitAsync(cancellationToken)threw immediately. The outerfinallythen disposedstoppingTokenSourceeven though shutdown never actually happened andstatusremainedRunning. A subsequent call toStopwould reachstoppingTokenSource.CancelAsync()on the already-disposed CTS and throwObjectDisposedException. This is the root cause of failures in NServiceBus.Metrics.ServiceControl acceptance tests.Race 1: Log slot scope disposed on the wrong thread
BeginSlotScopewas established before entering the semaphore. When two callers (the hosted service and the acceptance testing framework) invokedStopconcurrently, both acquired the sameAsyncLocalslot scope. The second caller's scope would overwritecurrentSlot, and when the first caller's scope disposed, it restored the previous (null) value, causing the second caller to lose its logging context.Race 2: CancellationToken registration disposed on the wrong thread
tokenRegistrationwas created before the semaphore and disposed in an outerfinally. If both callers enteredStop, the registration could fire or be disposed on the wrong thread'sAsyncLocalcontext.Additional issue: unserialized cancellation signal
stoppingTokenSource.CancelAsync()was called before acquiring the semaphore, meaning both concurrent callers could observe the stopping signal simultaneously and attempt to enter the shutdown path.What changed
RunningEndpointInstance.StopBeginSlotScope,tokenRegistration,CancelAsync, and the inner shutdown logic inside the semaphore-protected region, afterstatus = Stopping, so only the owning thread has scope and registrationawait usingfortokenRegistrationso it disposes before the innerfinallytears down the service providerstoppingTokenSource.Dispose()into the innerfinallyblock next tostatus = Stopped, so it is only disposed when shutdown actually completesWaitAsyncthrows due to cancellation, neitherCancelAsyncnorDisposeare called onstoppingTokenSource, leaving it intact for a subsequent callUnregisterSlot,serviceProviderLease.DisposeAsync(),stoppingTokenSource.Dispose()BaseEndpointLifecycleStartandStopto prevent races between the hosted service'sStopAsyncandDisposeAsynccreateSemaphoretolifeCycleSemaphoreto reflect it now guards all lifecycle transitionsEndpointBehavior(acceptance testing)Func<CancellationToken, Task>singleton with key"Stopper"that resolvesIEndpointLifecycleand callsStopon it. This is a hidden backdoor intentionally not exposed in Core. It exists solely for advanced acceptance testing scenarios where a test needs to stop an endpoint and then continue asserting, which the standard hosted service lifecycle doesn't support. TheStartableEndpointInstance.Startmethod dogfoods this stopper function instead of calling lifecycle methods directly.Simplified usage
With the stopper now available as a keyed service, the ServiceControl test pattern can be simplified. Instead of using
ToCreateInstancewith manually wired callbacks just to capture the stop function:The test can use
WithServiceResolveafter start to acquire the stopper viaKeyedServiceKey, removing the need forToCreateInstanceentirely. This follows the same pattern asWhen_resolving_endpoint_specific_keyed_service_globally:Then in the
Donecallback:This approach has several advantages over the
ToCreateInstancepattern:IEndpointLifecycle.Stop, which properly serializes shutdownKeyedServiceKeycomposition follows the established keyed DI conventions