Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

Closed
jahmai-ca opened this issue Dec 10, 2023 · 8 comments · Fixed by #4220
Assignees
Labels
customer-reported Issue created by a customer

Comments

@jahmai-ca
Copy link

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug
Instances of System.Threading.CancellationTokenSource+Linked1CancellationTokenSource are leaked by the PartitionSupervisorCore.RunAsync method via calls to CancellationTokenSource.CreateLinkedTokenSource which are stored in the processorCancellation member.
The caller, PartitionControllerCore.ProcessPartitionAsync, has a catch clause that handles FeedRangeGoneException, and calls PartitionControllerCore.HandlePartitionGoneAsync, which eventually loops back around to PartitionControllerCore.ProcessPartitionAsync again, which calls PartitionSupervisorCore.RunAsync, when in turn allocates another CancellationTokenSource, but without cleaning up the previous instance.

To Reproduce
I don't know how our application got into this state, but by the time we captured a dump, there were 6.7 million instances of
System.Threading.CancellationTokenSource+Linked1CancellationTokenSource allocated along the call stack mentioned above.
I suspect the leak was a side effect of some kind of infinite loop from FeedRangeGoneException being thrown over and over again, but I have no idea why this happened or how to reproduce it.

Expected behavior
The SDK doesn't leak memory.

Actual behavior
The SDK leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource instances in the extreme.

Environment summary
SDK Version: 3.32 (I know this isn't the latest but there is no change to memory management that I can see in the most recent code).
Windows Server 2022
net 7.0.304

Additional context
image

@ealsur
Copy link
Member

ealsur commented Dec 11, 2023

@jahmai-ca Thanks for the report. I'm trying to follow the path you described but I'm having a hard time.

PartitionSupervisorCore has a single CreateLinkedTokenSource which populates the this.processorCancellation and that variable never gets overwritten, it is readonly. It gets disposed when the PartitionSupervisorCore gets disposed.

PartitionControllerCore will catch the FeedRangeGoneException and what it does is handle the split, and release the lease. At this moment, the PartitionSupervisorCore instance is no longer used. The only thing I can think of is that the GC is not disposing it due to some environment condition? Maybe calling Dispose explicitly there would fix it.

PartitionControllerCore does not create any LinkedCancellationToken so the only one that could be in play I believe is the one in the Supervisor.

One clarification: It is not a loop. Once the split is handled, the lease is released. The current or another ChangeFeedProcessor instance can Acquire it and start processing.

Getting continuous FeedRangeGoneException however means something is off, this is meant to signal that a partition has split and normally the resolution is in 1 step. Did you by any chance delete and recreate the monitored container with the processor running?

@jahmai-ca
Copy link
Author

@ealsur

Not sure we're looking at the same code here...

this.processorCancellation and that variable never gets overwritten, it is readonly

It's not readonly:


It can't be because it's being assigned outside the constructor by RunAsync:
public override async Task RunAsync(CancellationToken shutdownToken)
{
await this.observer.OpenAsync(this.lease.CurrentLeaseToken).ConfigureAwait(false);
this.processorCancellation = CancellationTokenSource.CreateLinkedTokenSource(shutdownToken);

PartitionControllerCore will catch the FeedRangeGoneException and what it does is handle the split, and release the lease.

It also throws it up to the caller:

catch (FeedRangeGoneException)
{
closeReason = ChangeFeedObserverCloseReason.LeaseGone;
throw;
}

The caller (PartitionControllerCore.ProcessPartitionAsync) catches it and calls HandlePartitionGoneAsync:

private async Task ProcessPartitionAsync(PartitionSupervisor partitionSupervisor, DocumentServiceLease lease)
{
try
{
await partitionSupervisor.RunAsync(this.shutdownCts.Token).ConfigureAwait(false);
}
catch (FeedRangeGoneException ex)
{
await this.HandlePartitionGoneAsync(lease, ex.LastContinuation).ConfigureAwait(false);
}

HandlePartitionGoneAsync calls AddOrUpdateLeaseAsync:

private async Task HandlePartitionGoneAsync(DocumentServiceLease lease, string lastContinuationToken)
{
try
{
lease.ContinuationToken = lastContinuationToken;
(IEnumerable<DocumentServiceLease> addedLeases, bool shouldDeleteGoneLease) = await this.synchronizer.HandlePartitionGoneAsync(lease).ConfigureAwait(false);
Task[] addLeaseTasks = addedLeases.Select(l =>
{
l.Properties = lease.Properties;
return this.AddOrUpdateLeaseAsync(l);
}).ToArray();

AddOrUpdateLeaseAsync calls ProcessPartitionAsync.

ProcessPartitionAsync calls PartitionSupervisorCode.RunAsync and we're back at square one.

I might be reading it wrong, but it seems to me that there is definitely a code loop there. Not sure what precise conditions are required for it to go around and around though.

@ealsur
Copy link
Member

ealsur commented Dec 13, 2023

@jahmai-ca You are right, the readonly is only on the renewerCancellation.

The loop whoever what it does is create new PartitionSupervisors right? Let's assume there is 1 lease in the beginning.

  1. LoadLeasesAsync calls AddOrRemoveLeaseAsync once, it creates a Supervisor for that lease.
  2. Calls ProcessPartitionAsync which calls RunAsync on the Supervisor.
  3. RunAsync on the Supervisor creates a linked CT.
  4. At some point, there is a Partition Gone.
  5. Supervisor stops with LeaseGone, RunAsync exits.
  6. Controller calls HandlePartitionGoneAsync.
  7. HandlePartitionGoneAsync calls Synchronizer.HandlePartitionGoneAsync which returns 2 new leases.
  8. AddOrRemoveLeaseAsync is called on the 2 new leases, 2 new Supervisors are created.
  9. Each call calls ProcessPartitionAsync, which calls RunAsync on each Supervisor.
  10. The original HandlePartitoinGoneAsync call from step 6 finishes
  11. The original ProcessPartitionAsync (step 2) calls RemoveLeaseAsync and exits. At this point the Supervisor instance is no longer referenced and will be Garbage Collected, when it does, it will call Dispose and remove the processorCancellation and renewerCancellation.

So yes there is a loop, the loop however all it does is create new Supervisors, the old Supervisor is not used anymore and would eventually be Disposed. RunAsync is not called on the same Supervisor instance again.

This is what I mean that I don't see a leak. The Controller instance remains the same always, it is the Controller who spawns Supervisors and lets them run independently.

The only scenario where the CTs might leak is because the GC is not collecting/disposing the unreferenced Supervisor that are not used anymore?

@jahmai-ca
Copy link
Author

@ealsur

Thanks for going through that detail, appreciated.

I'm still sifting through the memory dump, and it seems like all the retention is at the top level of PartitionControllerCore, specifically the shutdownCts member.

image
(The Next reference chain continues on like this for a very long time for this instance).

Counter example, this looks like a normal healthy instance:
image

Do you think its possible that the shutdownCts is holding strong references to the linked tokens even after the PartitionSupervisorCore instance is disposed / recreated?

Is there anything I should look at in this dump to help figure out what the cause is?

@ealsur
Copy link
Member

ealsur commented Dec 14, 2023

Do you think its possible that the shutdownCts is holding strong references to the linked tokens even after the PartitionSupervisorCore instance is disposed / recreated?

My guess is only if the GC is not Disposing the PartitionSupervisor instances after they go out of scope. CFP has been around for 6+ years and this is the same code in the V2 version: https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet/blob/master/src/DocumentDB.ChangeFeedProcessor/PartitionManagement/PartitionController.cs

We have never had any reports of this behavior, even on containers with thousands of partitions (hence thousands of supervisors). That's why this is kind of weird. Like I said, we are not explicitly calling Dispose (which I guess is something we could do) but the normal behavior is that Dispose on the PartitionSupervisor is called by GC. Unless in your case, GC is not running?

@jahmai-ca
Copy link
Author

jahmai-ca commented Dec 14, 2023

@ealsur

GC is definitely running. If it wasn't our memory dump would contain many other allocations that our app does.

The GC isn't actually responsible for calling Dispose. It only calls Finalize (which may in turn call Dispose), but I don't see any finalizers defined for the types in question?

We've been using CosmosDB SDK for years too. On reflection this may have happened a few times but it is exceedingly rare. Normally we'd just restart the affected services and the problem would go away for a long time. On this occasion I took the time to capture a memory dump.

@ealsur
Copy link
Member

ealsur commented Dec 15, 2023

What we can do is explicitly call Dispose on the Supervisor, that should at least take care of disposing the CTs no longer in use.

@ealsur
Copy link
Member

ealsur commented Dec 15, 2023

See linked PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issue created by a customer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants