Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemory exceptions and function execution stops at 2GB memory #1003

Closed
lukasvosyka opened this issue Oct 30, 2019 · 11 comments
Closed

OutOfMemory exceptions and function execution stops at 2GB memory #1003

lukasvosyka opened this issue Oct 30, 2019 · 11 comments

Comments

@lukasvosyka
Copy link

lukasvosyka commented Oct 30, 2019

I experienced an OutOfMemoryException while I was working with the fantastic Durable Functions.

At the end I was able to reproduce the issue (not with an OutOfMemory exception, but it just stops processing, and it does stop always at 2GB memory consumption) with essentially the approach I am trying to solve. The OutOfMemory comes into play when using the MongoDb driver (I omitted that part in the repro) and it happens at the same memory consumption level. Maybe I am doing something wrong or Durable functions is not the right tool for the job I am intending.

I am attaching a repro for my issue, as I see there are other issues out there that might be related to that:

#340 #687 #412

I am running locally on runtime 2.0.12763.0

No real fancy stuff going on here. What I am trying to solve is a kind of data import on many entities, each having 500k data to process. So what I want to achive is (abstract):

orchestator
{
   entities = getEntities,
   foreach ( entity in entities ) {
       ids = getEntityIds ( entity )
       processEntityIds ( entity, ids )
}

processEntities ( entity, ids ) 
{
    foreach ( batch in ids.Batch( 10000 ) 
    {
         processBatch ( entity, batch )
    }
}

processBatch ( entity, batch ) 
{
    await callDb( entity, batch )
} 

I really would like to solve this issue with Durable functions, because otherwise I would have to go the Service Bus / Event Hub way, but I think Durable functions can substitute, which is cool :)

DurableFunctionOutOfMemory.zip

@cgillum Hope it helps or you can give me a hint what I am doing wrong?

@ghost ghost added the Needs: Triage 🔍 label Oct 30, 2019
@ConnorMcMahon
Copy link
Contributor

Make sure to set your application to use a 64 bit process. 32 bit processes are limited to 2 gb of memory per application if my memory serves correctly.

@ConnorMcMahon ConnorMcMahon added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Triage 🔍 labels Oct 30, 2019
@lukasvosyka
Copy link
Author

Hi @ConnorMcMahon ,

thanks for the link. I haven't tried yet to run the Azure Function on Azure directly, instead I am trying to run in locally in VS2017 with the Debugger. I tried to switch to x64, but quickly realized, that it does not seem to work properly (Could not load file or assembly 'DurableFunctionOutOfMemory, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null')

As I see, there is an pending issue out there about this: Azure/azure-functions-vs-build-sdk#131

Now, what can I do? Anyway, is this high memory usage expected to be there with Durable Functions?

@ghost ghost added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Oct 31, 2019
@ConnorMcMahon
Copy link
Contributor

Hmm, that issue is unfortunate.

High memory usage can be an issue depending on your usage patterns. I don't know how large your data, or how many of your activities/orchestrations are concurrent, so it is difficult to say whether it is expected for your scenario.

In general, we cannot get around the 2GB limit for 32-bit processes. There are some settings you can set to mitigate the memory usage of the extension, but in general I think it sounds more like your workload is driving the memory usage.

Instead, until that issue is resolved, when testing locally you may need to test locally with smaller workloads, and enabled 64 bit mode when running in Azure.

@lukasvosyka
Copy link
Author

lukasvosyka commented Oct 31, 2019

Hmm, maybe you can have a look into the repro I posted, it is in essence a 100 LOC repro, and does nothing special. As I said, I am wondering whether Azure Durable Functions is suitable for this kind of work.

I believe that there is no Garbage Collection happening during Durable Functions execution, and hence the raise in Memory consumption. But maybe I am wrong and this is by design? I could not find any helpful information about this kind of scenario in context of Durable Functions.

The docs always talk about super simple things like printing out city names and stuff, but I am wondering about real world scenarios. This would be one :-)

@ConnorMcMahon
Copy link
Contributor

@lukasvosyka

I apologize for the delay here.

Durable Functions in general does a lot of prefetching and caching in memory to improve performance. If you are using larger chunks of data, it is very possible to run out of memory, because we may be storing many of those chunks of data in memory even when we are not actively executing the method itself.

Thankfully, we do have some knobs that can be used that heavily impact memory usage, at the potential cost of throughput and/or latency. Many of them are documented here.

I would take a look at the following:

  • controlQueueBatchSize
  • controlQueueBufferThreshold
  • controlQueueVisibilityTimeout
  • workItemQueueVisibilityTimeout
  • partitionCount (This only helps if you are scaling out to more workers than your current partition count)
  • maxConcurrentActivityFunctions
  • maxConcurrentOrchestratorFunctions

I realize that this is a lot of knobs to play with, but each of these has some impact on memory usage. I would probably start with the maxConcurrent*Functions, as those will likely have the largest impact. Note that lowering them will also substantially impact your throughput (and likely latency). That being said, if you are running on a consumption/premium plan, we should sufficiently scale you out if latency starts increasing too much.

@ConnorMcMahon ConnorMcMahon added the Needs: Author Feedback Waiting for the author of the issue to respond to a question label Feb 19, 2020
@lukasvosyka
Copy link
Author

Hi @ConnorMcMahon ,

I think (though I haven't tested yet) that the repro would run just fine when deployed to Azure. But when I run locally (for example to debug issues), it doesn't and stops working after it hits the 2GB memory usage. None of the above mentioned knobs helped. So again the question.. is Durable Functions suitable/meant/intedended for this kind of work? It seems to me that the given use case is not suitable for DF.

@ghost ghost removed the Needs: Author Feedback Waiting for the author of the issue to respond to a question label Feb 23, 2020
@anthonychu
Copy link
Member

@lukasvosyka How many entities does getEntities return and approximately how big is each? Same thing for getEntityIds.

@lukasvosyka
Copy link
Author

@anthonychu there is a repro attached to my initial post, that exactly reproduces the problem.

To answer your question:

  • getEntities returns a list of 7 strings, each of which is of 7 characters.
  • getEntityIds returns a list of 500000 strings, each of which is of 22 characters.

@lukasvosyka
Copy link
Author

Maybe it is easier to reference things, when I post the repro here inline.

public class RebuildProductsFunction
{
	private readonly ILogger _logger;
	private readonly SomeService _service;
	private readonly SomeIdGenerator _idGen;
	private readonly List<string> _someEntities = new List<string>
	{
		"entity1",
		"entity2",
		"entity3",
		"entity4",
		"entity5",
		"entity6",
		"entity7"
	};

	public RebuildProductsFunction(
		ILogger<RebuildProductsFunction> logger,
		SomeService service,
		SomeIdGenerator idGen)
	{
		_logger = logger;
		_service = service;
		_idGen = idGen;
	}


	[FunctionName(nameof(DurableHttpEntry))]
	public async Task<HttpResponseMessage> DurableHttpEntry(
		[HttpTrigger(AuthorizationLevel.Function, "post", Route = "entry")] HttpRequestMessage req,
		[OrchestrationClient] DurableOrchestrationClient client)
	{
		var instanceId = await client.StartNewAsync(nameof(ExecuteEntities), null);
		_logger.LogDebug("Starting DurableHttpEntry");
		return await client.WaitForCompletionOrCreateCheckStatusResponseAsync(req, instanceId);
	}

	[FunctionName(nameof(ExecuteEntities))]
	public async Task<Dictionary<string, long>> ExecuteEntities(
		[OrchestrationTrigger] DurableOrchestrationContext context)
	{
		var writtenEntries = new Dictionary<string, long>();
		long count = 0;

		// get all entities that need to be processed. this can be anything, like a store or similar..
		var entities = await context.CallActivityAsync<List<string>>(nameof(ActivityGetEntities), string.Empty);
		var idx = 0;
		var total = entities.Count;
		foreach (string entity in entities)
		{
			_logger.LogDebug("Starting ExecuteEntities for {entity}", entity);
			var result = await context.CallSubOrchestratorAsync<long>(nameof(ExecuteEntity), entity);
			idx++;
			writtenEntries[entity] = result;
			context.SetCustomStatus((idx * 1.0m / total).ToString("P"));
			count += result;
		}

		return writtenEntries;
	}

	[FunctionName(nameof(ExecuteEntity))]
	public async Task<long> ExecuteEntity(
		[OrchestrationTrigger] DurableOrchestrationContext context)
	{
		var model = context.GetInput<string>();
		// each entity has 500k items, and they have to be processed...
		var items = await context.CallActivityAsync<List<string>>(nameof(ActivityGetEntityIds), model);
		long count = 0;
		int idx = 0;
		// processing happens in batches, because all 500k items would take too long for a function lifetime..
		foreach (var batch in items.Batch(10000))
		{
			_logger.LogDebug("Starting ExecuteEntity for {entity} index: {idx}", model, idx);
			
			var result = await context.CallSubOrchestratorAsync<long>(nameof(ExecuteOnEntity), new ExecuteEntityModel
			{
				Entity = model,
				Ids = batch.ToList(),
				Index = idx
			});
			idx++;
			count += result;
		}
		return count;
	}

	[FunctionName(nameof(ExecuteOnEntity))]
	public async Task<long> ExecuteOnEntity(
		[OrchestrationTrigger] DurableOrchestrationContext context)
	{
		var model = context.GetInput<ExecuteEntityModel>();
		_logger.LogDebug("Starting ExecuteOnEntity for {entity} with item count: {count}", model.Entity, model.Ids.Count);
		var result = await context.CallActivityAsync<long>(nameof(ActivityExecuteOnEntity), model);
		return result;
	}

	[FunctionName(nameof(ActivityExecuteOnEntity))]
	public async Task<long> ActivityExecuteOnEntity([ActivityTrigger] ExecuteEntityModel model)
	{
		_logger.LogDebug("Activity for {entity}, Count: {count}, Index: {idx}", model.Entity, model.Ids.Count, model.Index);
		var affectedIds = await _service.Process(model.Entity, model.Ids);
		return affectedIds;
	}

	[FunctionName(nameof(ActivityGetEntities))]
	public Task<List<string>> ActivityGetEntities([ActivityTrigger] string model)
	{
		return Task.FromResult(_someEntities);
	}


	[FunctionName(nameof(ActivityGetEntityIds))]
	public async Task<List<string>> ActivityGetEntityIds([ActivityTrigger] string model)
	{
		return await _idGen.GetIds(model);
	}
}

public class SomeIdGenerator
{
	public Task<List<string>> GetIds(string entity)
	{
		return Task.FromResult(Enumerable.Range(1, 500000).Select(x => $"{entity}_{x.ToString("00000000000000")}").ToList());
	}
}

public class SomeService
{
	public async Task<long> Process(string entity, IEnumerable<string> ids)
	{
		await Task.Delay(500);
		return ids.Count();
	}
}


public class ExecuteEntityModel
{
	public string Entity { get; set; }
	public List<string> Ids { get; set; }
	public int Index { get; set; }
}

@lukasvosyka
Copy link
Author

In case the question comes up on why I was using the sub-orchestrations :-) I was just testing around..
The same behavior as described is also observable, when using 1 orchestrator triggering just activities, like in this simplified repro:

public class RebuildProductsFunction
{
   private readonly ILogger _logger;
   private readonly SomeService _service;
   private readonly SomeIdGenerator _idGen;
   private readonly List<string> _someEntities = new List<string>
   {
      "entity1",
      "entity2",
      "entity3",
      "entity4",
      "entity5",
      "entity6",
      "entity7"
   };

   public RebuildProductsFunction(
      ILogger<RebuildProductsFunction> logger,
      SomeService service,
      SomeIdGenerator idGen)
   {
      _logger = logger;
      _service = service;
      _idGen = idGen;
   }


   [FunctionName(nameof(DurableHttpEntry))]
   public async Task<HttpResponseMessage> DurableHttpEntry(
      [HttpTrigger(AuthorizationLevel.Function, "post", Route = "entry")] HttpRequestMessage req,
      [OrchestrationClient] DurableOrchestrationClient client)
   {
      var instanceId = await client.StartNewAsync(nameof(ExecuteEntities), null);
      _logger.LogDebug("Starting DurableHttpEntry");
      return await client.WaitForCompletionOrCreateCheckStatusResponseAsync(req, instanceId);
   }

   [FunctionName(nameof(ExecuteEntities))]
   public async Task<Dictionary<string, long>> ExecuteEntities(
      [OrchestrationTrigger] DurableOrchestrationContext context)
   {
      var writtenEntries = new Dictionary<string, long>();
      long count = 0;

      // get all entities that need to be processed. this can be anything, like a store or similar..
      var entities = await context.CallActivityAsync<List<string>>(nameof(ActivityGetEntities), string.Empty);
      var idx = 0;
      var total = entities.Count;
      foreach (string entity in entities)
      {
         _logger.LogDebug("Starting ExecuteEntities for {entity}", entity);

         var items = await context.CallActivityAsync<List<string>>(nameof(ActivityGetEntityIds), entity);
         
         // processing happens in batches, because all 500k items would take too long for a function lifetime..
         foreach (var batch in items.Batch(10000))
         {
            _logger.LogDebug("Starting ExecuteEntity for {entity} index: {idx}", entity, idx);

            var result = await context.CallActivityAsync<long>(nameof(ActivityExecuteOnEntity), new ExecuteEntityModel
            {
               Entity = entity,
               Ids = batch.ToList(),
               Index = idx
            });
            idx++;
            count += result;
         }
      }

      return writtenEntries;
   }

   [FunctionName(nameof(ActivityExecuteOnEntity))]
   public async Task<long> ActivityExecuteOnEntity([ActivityTrigger] ExecuteEntityModel model)
   {
      _logger.LogDebug("Activity for {entity}, Count: {count}, Index: {idx}", model.Entity, model.Ids.Count, model.Index);
      var affectedIds = await _service.Process(model.Entity, model.Ids);
      return affectedIds;
   }

   [FunctionName(nameof(ActivityGetEntities))]
   public Task<List<string>> ActivityGetEntities([ActivityTrigger] string model)
   {
      return Task.FromResult(_someEntities);
   }


   [FunctionName(nameof(ActivityGetEntityIds))]
   public async Task<List<string>> ActivityGetEntityIds([ActivityTrigger] string model)
   {
      return await _idGen.GetIds(model);
   }
}

public class SomeIdGenerator
{
   public Task<List<string>> GetIds(string entity)
   {
      return Task.FromResult(Enumerable.Range(1, 500000).Select(x => $"{entity}_{x.ToString("00000000000000")}").ToList());
   }
}

public class SomeService
{
   public async Task<long> Process(string entity, IEnumerable<string> ids)
   {
      await Task.Delay(500);
      return ids.Count();
   }
}


public class ExecuteEntityModel
{
   public string Entity { get; set; }
   public List<string> Ids { get; set; }
   public int Index { get; set; }
}

@lukasvosyka
Copy link
Author

Update

So after some futher tests, there was indeed a knob, that made it work with at least the last approach (1 orchestrator calling activities). Adding to host.json

 "extensions": {
    "durableTask": {
        "extendedSessionsEnabled": true,
        "extendedSessionIdleTimeoutInSeconds": 30
    }
}

as mentioned by @ConnorMcMahon pointing to the docs here made me aware of this (new?) setting (although there is no description about those (new?) properties in the following table of the documentation) prevented the function from stopping to work and memory usage seemed to be at about 200MB, while CPU usage was at the lower level.

There were also "funny" moments which started, when I tried doing the very same with the initial repro, so starting sub-orchestarations. Although it didn't seem to be dying as well with the extended sessions being enabled, it used way more memory and way more CPU than the 1-orchestator-calling-just-activities approach. But ok, I understand that sub-orchestrations are meant for something different, and replayability is a big design concept (as said, I was just playing around).

I am closing this now, since it seems to work with enabling sessions. A little bit of documentation needs to be done anyways, I guess.

Thanks to everyone involved.

Side note
All of the aforementioned observations were made by looking at the Diagnostic Tool in Visual Studio 2017 15.9.18 while debugging the app. The funny part was, that the Diagnostic Tool didn't show any progress when using the sub-orchestration approach.. i literally had to click into the console window to pause the process, so the Diagnostic Tool progressed in recording. Once I allowed the process to continue, Diagnostic Tool stopped agian, but that is probably another issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants