Skip to content

Fix race conditions in endpoint Stop and expose lifecycle stopper for advanced scenarios#7747

Merged
danielmarbach merged 4 commits into
masterfrom
improve-lifecycle
May 11, 2026
Merged

Fix race conditions in endpoint Stop and expose lifecycle stopper for advanced scenarios#7747
danielmarbach merged 4 commits into
masterfrom
improve-lifecycle

Conversation

@danielmarbach
Copy link
Copy Markdown
Contributor

@danielmarbach danielmarbach commented May 10, 2026

Why this fix is needed

While adding support for the ServiceControl disconnected endpoint count acceptance test, we introduced a Func<CancellationToken, Task> into the acceptance testing framework that allows tests to stop an endpoint mid-lifecycle. This revealed two race conditions in RunningEndpointInstance.Stop and an existing bug that causes ObjectDisposedException on stoppingTokenSource.

Bug: ObjectDisposedException when Stop is called with an already-cancelled token

When Stop is called with a pre-cancelled token (e.g. test timeout), stoppingTokenSource.CancelAsync() ran before the semaphore, and WaitAsync(cancellationToken) threw immediately. The outer finally then disposed stoppingTokenSource even though shutdown never actually happened and status remained Running. A subsequent call to Stop would reach stoppingTokenSource.CancelAsync() on the already-disposed CTS and throw ObjectDisposedException. This is the root cause of failures in NServiceBus.Metrics.ServiceControl acceptance tests.

Race 1: Log slot scope disposed on the wrong thread

BeginSlotScope was established before entering the semaphore. When two callers (the hosted service and the acceptance testing framework) invoked Stop concurrently, both acquired the same AsyncLocal slot scope. The second caller's scope would overwrite currentSlot, and when the first caller's scope disposed, it restored the previous (null) value, causing the second caller to lose its logging context.

Race 2: CancellationToken registration disposed on the wrong thread

tokenRegistration was created before the semaphore and disposed in an outer finally. If both callers entered Stop, the registration could fire or be disposed on the wrong thread's AsyncLocal context.

Additional issue: unserialized cancellation signal

stoppingTokenSource.CancelAsync() was called before acquiring the semaphore, meaning both concurrent callers could observe the stopping signal simultaneously and attempt to enter the shutdown path.

What changed

RunningEndpointInstance.Stop

  • Moved BeginSlotScope, tokenRegistration, CancelAsync, and the inner shutdown logic inside the semaphore-protected region, after status = Stopping, so only the owning thread has scope and registration
  • Used await using for tokenRegistration so it disposes before the inner finally tears down the service provider
  • Moved stoppingTokenSource.Dispose() into the inner finally block next to status = Stopped, so it is only disposed when shutdown actually completes
  • If WaitAsync throws due to cancellation, neither CancelAsync nor Dispose are called on stoppingTokenSource, leaving it intact for a subsequent call
  • Disposal order is now: token registration, slot scope, UnregisterSlot, serviceProviderLease.DisposeAsync(), stoppingTokenSource.Dispose()

BaseEndpointLifecycle

  • Added semaphore-protected double-checked locking to Start and Stop to prevent races between the hosted service's StopAsync and DisposeAsync
  • Renamed createSemaphore to lifeCycleSemaphore to reflect it now guards all lifecycle transitions

EndpointBehavior (acceptance testing)

  • Added a keyed Func<CancellationToken, Task> singleton with key "Stopper" that resolves IEndpointLifecycle and calls Stop on it. This is a hidden backdoor intentionally not exposed in Core. It exists solely for advanced acceptance testing scenarios where a test needs to stop an endpoint and then continue asserting, which the standard hosted service lifecycle doesn't support. The StartableEndpointInstance.Start method dogfoods this stopper function instead of calling lifecycle methods directly.

Simplified usage

With the stopper now available as a keyed service, the ServiceControl test pattern can be simplified. Instead of using ToCreateInstance with manually wired callbacks just to capture the stop function:

.WithEndpoint<MonitoredEndpoint>(b => b
    .ToCreateInstance(
        (services, configuration) => EndpointWithExternallyManagedContainer.Create(configuration, services),
        async (startableEndpoint, provider, ct) =>
        {
            context.FirstInstance = await startableEndpoint.Start(provider, ct);
            return context.FirstInstance;
        }))

The test can use WithServiceResolve after start to acquire the stopper via KeyedServiceKey, removing the need for ToCreateInstance entirely. This follows the same pattern as When_resolving_endpoint_specific_keyed_service_globally:

.WithEndpoint<MonitoredEndpoint>(b => b
    .CustomConfig(c => c.EnableMetrics()
        .SendMetricDataToServiceControl(Settings.DEFAULT_INSTANCE_NAME, TimeSpan.FromMilliseconds(200), "First"))
    .WithServiceResolve((provider, context, _) =>
    {
        var endpointName = Conventions.EndpointNamingConvention(typeof(MonitoredEndpoint));
        context.Stopper = provider.GetRequiredKeyedService<Func<CancellationToken, Task>>(
            new KeyedServiceKey($"{endpointName}0", "Stopper"));
        return Task.CompletedTask;
    }, afterStart: true))

Then in the Done callback:

await context.Stopper(CancellationToken.None);
context.StoppedFirstInstance = true;

This approach has several advantages over the ToCreateInstance pattern:

  • No need to manually wire endpoint creation and start callbacks
  • The stopper goes through IEndpointLifecycle.Stop, which properly serializes shutdown
  • The KeyedServiceKey composition follows the established keyed DI conventions

@danielmarbach danielmarbach changed the title Improve lifecycle Fix race conditions in endpoint Stop and expose lifecycle stopper for advanced scenarios May 10, 2026
@danielmarbach danielmarbach marked this pull request as ready for review May 10, 2026 13:47
@danielmarbach
Copy link
Copy Markdown
Contributor Author

@danielmarbach danielmarbach merged commit 5786654 into master May 11, 2026
4 checks passed
@danielmarbach danielmarbach deleted the improve-lifecycle branch May 11, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants