PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

jahmai-ca · 2023-12-10T23:28:10Z

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug
Instances of System.Threading.CancellationTokenSource+Linked1CancellationTokenSource are leaked by the PartitionSupervisorCore.RunAsync method via calls to CancellationTokenSource.CreateLinkedTokenSource which are stored in the processorCancellation member.
The caller, PartitionControllerCore.ProcessPartitionAsync, has a catch clause that handles FeedRangeGoneException, and calls PartitionControllerCore.HandlePartitionGoneAsync, which eventually loops back around to PartitionControllerCore.ProcessPartitionAsync again, which calls PartitionSupervisorCore.RunAsync, when in turn allocates another CancellationTokenSource, but without cleaning up the previous instance.

To Reproduce
I don't know how our application got into this state, but by the time we captured a dump, there were 6.7 million instances of
System.Threading.CancellationTokenSource+Linked1CancellationTokenSource allocated along the call stack mentioned above.
I suspect the leak was a side effect of some kind of infinite loop from FeedRangeGoneException being thrown over and over again, but I have no idea why this happened or how to reproduce it.

Expected behavior
The SDK doesn't leak memory.

Actual behavior
The SDK leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource instances in the extreme.

Environment summary
SDK Version: 3.32 (I know this isn't the latest but there is no change to memory management that I can see in the most recent code).
Windows Server 2022
net 7.0.304

Additional context

The text was updated successfully, but these errors were encountered:

ealsur · 2023-12-11T15:41:30Z

@jahmai-ca Thanks for the report. I'm trying to follow the path you described but I'm having a hard time.

PartitionSupervisorCore has a single CreateLinkedTokenSource which populates the this.processorCancellation and that variable never gets overwritten, it is readonly. It gets disposed when the PartitionSupervisorCore gets disposed.

PartitionControllerCore will catch the FeedRangeGoneException and what it does is handle the split, and release the lease. At this moment, the PartitionSupervisorCore instance is no longer used. The only thing I can think of is that the GC is not disposing it due to some environment condition? Maybe calling Dispose explicitly there would fix it.

PartitionControllerCore does not create any LinkedCancellationToken so the only one that could be in play I believe is the one in the Supervisor.

One clarification: It is not a loop. Once the split is handled, the lease is released. The current or another ChangeFeedProcessor instance can Acquire it and start processing.

Getting continuous FeedRangeGoneException however means something is off, this is meant to signal that a partition has split and normally the resolution is in 1 step. Did you by any chance delete and recreate the monitored container with the processor running?

jahmai-ca · 2023-12-12T23:20:06Z

@ealsur

Not sure we're looking at the same code here...

this.processorCancellation and that variable never gets overwritten, it is readonly

It's not readonly:

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionSupervisorCore.cs

Line 22 in 07aa28e

private CancellationTokenSource processorCancellation;

It can't be because it's being assigned outside the constructor by RunAsync:

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionSupervisorCore.cs

Lines 32 to 36 in 07aa28e

    
           public override async Task RunAsync(CancellationToken shutdownToken) 
        
           { 
        
               await this.observer.OpenAsync(this.lease.CurrentLeaseToken).ConfigureAwait(false); 
        
               this.processorCancellation = CancellationTokenSource.CreateLinkedTokenSource(shutdownToken);

PartitionControllerCore will catch the FeedRangeGoneException and what it does is handle the split, and release the lease.

It also throws it up to the caller:

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionSupervisorCore.cs

Lines 57 to 61 in 07aa28e

    
           catch (FeedRangeGoneException) 
        
           { 
        
               closeReason = ChangeFeedObserverCloseReason.LeaseGone; 
        
               throw; 
        
           }

The caller (PartitionControllerCore.ProcessPartitionAsync) catches it and calls HandlePartitionGoneAsync:

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionControllerCore.cs

Lines 149 to 158 in 07aa28e

    
           private async Task ProcessPartitionAsync(PartitionSupervisor partitionSupervisor, DocumentServiceLease lease) 
        
           { 
        
               try 
        
               { 
        
                   await partitionSupervisor.RunAsync(this.shutdownCts.Token).ConfigureAwait(false); 
        
               } 
        
               catch (FeedRangeGoneException ex) 
        
               { 
        
                   await this.HandlePartitionGoneAsync(lease, ex.LastContinuation).ConfigureAwait(false); 
        
               }

HandlePartitionGoneAsync calls AddOrUpdateLeaseAsync:

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionControllerCore.cs

Lines 172 to 182 in 07aa28e

    
                   private async Task HandlePartitionGoneAsync(DocumentServiceLease lease, string lastContinuationToken) 
        
                   { 
        
                       try 
        
                       { 
        
                           lease.ContinuationToken = lastContinuationToken; 
        
                           (IEnumerable<DocumentServiceLease> addedLeases, bool shouldDeleteGoneLease) = await this.synchronizer.HandlePartitionGoneAsync(lease).ConfigureAwait(false); 
        
                           Task[] addLeaseTasks = addedLeases.Select(l => 
        
                               { 
        
                                   l.Properties = lease.Properties; 
        
                                   return this.AddOrUpdateLeaseAsync(l); 
        
                               }).ToArray();

AddOrUpdateLeaseAsync calls ProcessPartitionAsync.

azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionControllerCore.cs

Line 93 in 07aa28e

this.ProcessPartitionAsync(supervisor, lease).LogException();

ProcessPartitionAsync calls PartitionSupervisorCode.RunAsync and we're back at square one.

I might be reading it wrong, but it seems to me that there is definitely a code loop there. Not sure what precise conditions are required for it to go around and around though.

ealsur · 2023-12-13T15:52:40Z

@jahmai-ca You are right, the readonly is only on the renewerCancellation.

The loop whoever what it does is create new PartitionSupervisors right? Let's assume there is 1 lease in the beginning.

LoadLeasesAsync calls AddOrRemoveLeaseAsync once, it creates a Supervisor for that lease.
Calls ProcessPartitionAsync which calls RunAsync on the Supervisor.
RunAsync on the Supervisor creates a linked CT.
At some point, there is a Partition Gone.
Supervisor stops with LeaseGone, RunAsync exits.
Controller calls HandlePartitionGoneAsync.
HandlePartitionGoneAsync calls Synchronizer.HandlePartitionGoneAsync which returns 2 new leases.
AddOrRemoveLeaseAsync is called on the 2 new leases, 2 new Supervisors are created.
Each call calls ProcessPartitionAsync, which calls RunAsync on each Supervisor.
The original HandlePartitoinGoneAsync call from step 6 finishes
The original ProcessPartitionAsync (step 2) calls RemoveLeaseAsync and exits. At this point the Supervisor instance is no longer referenced and will be Garbage Collected, when it does, it will call Dispose and remove the processorCancellation and renewerCancellation.

So yes there is a loop, the loop however all it does is create new Supervisors, the old Supervisor is not used anymore and would eventually be Disposed. RunAsync is not called on the same Supervisor instance again.

This is what I mean that I don't see a leak. The Controller instance remains the same always, it is the Controller who spawns Supervisors and lets them run independently.

The only scenario where the CTs might leak is because the GC is not collecting/disposing the unreferenced Supervisor that are not used anymore?

jahmai-ca · 2023-12-13T22:13:53Z

@ealsur

Thanks for going through that detail, appreciated.

I'm still sifting through the memory dump, and it seems like all the retention is at the top level of PartitionControllerCore, specifically the shutdownCts member.

(The Next reference chain continues on like this for a very long time for this instance).

Counter example, this looks like a normal healthy instance:

Do you think its possible that the shutdownCts is holding strong references to the linked tokens even after the PartitionSupervisorCore instance is disposed / recreated?

Is there anything I should look at in this dump to help figure out what the cause is?

ealsur · 2023-12-14T15:38:27Z

Do you think its possible that the shutdownCts is holding strong references to the linked tokens even after the PartitionSupervisorCore instance is disposed / recreated?

My guess is only if the GC is not Disposing the PartitionSupervisor instances after they go out of scope. CFP has been around for 6+ years and this is the same code in the V2 version: https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet/blob/master/src/DocumentDB.ChangeFeedProcessor/PartitionManagement/PartitionController.cs

We have never had any reports of this behavior, even on containers with thousands of partitions (hence thousands of supervisors). That's why this is kind of weird. Like I said, we are not explicitly calling Dispose (which I guess is something we could do) but the normal behavior is that Dispose on the PartitionSupervisor is called by GC. Unless in your case, GC is not running?

jahmai-ca · 2023-12-14T23:38:56Z

@ealsur

GC is definitely running. If it wasn't our memory dump would contain many other allocations that our app does.

The GC isn't actually responsible for calling Dispose. It only calls Finalize (which may in turn call Dispose), but I don't see any finalizers defined for the types in question?

We've been using CosmosDB SDK for years too. On reflection this may have happened a few times but it is exceedingly rare. Normally we'd just restart the affected services and the problem would go away for a long time. On this occasion I took the time to capture a memory dump.

ealsur · 2023-12-15T15:56:16Z

What we can do is explicitly call Dispose on the Supervisor, that should at least take care of disposing the CTs no longer in use.

ealsur · 2023-12-15T16:22:41Z

See linked PR

jahmai-ca added the needs-investigation label Dec 10, 2023

microsoft-github-policy-service bot added the customer-reported Issue created by a customer label Dec 10, 2023

ealsur added the needs-more-information label Dec 11, 2023

ealsur removed needs-investigation needs-more-information labels Dec 15, 2023

ealsur self-assigned this Dec 15, 2023

ealsur mentioned this issue Dec 15, 2023

Change Feed Processor: Fixes disposal of unused CancellationTokenSource #4220

Merged

microsoft-github-policy-service bot closed this as completed in #4220 Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

jahmai-ca commented Dec 10, 2023

ealsur commented Dec 11, 2023

jahmai-ca commented Dec 12, 2023

ealsur commented Dec 13, 2023 •

edited

jahmai-ca commented Dec 13, 2023

ealsur commented Dec 14, 2023

jahmai-ca commented Dec 14, 2023 •

edited

ealsur commented Dec 15, 2023

ealsur commented Dec 15, 2023

PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208

Comments

jahmai-ca commented Dec 10, 2023

ealsur commented Dec 11, 2023

jahmai-ca commented Dec 12, 2023

ealsur commented Dec 13, 2023 • edited

jahmai-ca commented Dec 13, 2023

ealsur commented Dec 14, 2023

jahmai-ca commented Dec 14, 2023 • edited

ealsur commented Dec 15, 2023

ealsur commented Dec 15, 2023

ealsur commented Dec 13, 2023 •

edited

jahmai-ca commented Dec 14, 2023 •

edited