Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The system lacked sufficient buffer space or because a queue was full #1127

Open
zmarty opened this issue Feb 13, 2019 · 19 comments
Open

The system lacked sufficient buffer space or because a queue was full #1127

zmarty opened this issue Feb 13, 2019 · 19 comments

Comments

@zmarty
Copy link

zmarty commented Feb 13, 2019

Summary: All of the Azure Functions under a shared App Service plan suddenly started failing in production with error: "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full"

@tohling - opening a bug similar to #1112 . We are a team inside Microsoft facing issues in production, so any timely help on this would be most appreciated. We already have an open support ticket number. for reference: 119021321000198

Azure Functions version: v2 (.NET core)
App Service plan: FastPath-prod-east-us
Affected Function Apps: All of them, but if you want a specific example, use FastPathIceHockey-prod-east-us, function: NHL-Game-GameEndNotification

We already do follow the best practices in using static clients when making outbound connections, as described here: https://github.com/Azure/azure-functions-host/wiki/Managing-Connections

Here are some examples on how we define the clients:

In each function we define static clients like so:

private static BlobStorage blobStorageClient;
private static CosmosDbClient cosmosDbClient;
private static HttpClient httpClient;

Then inside static constructors for each function we make instances:

blobStorageClient = new BlobStorage(config.BlobStorageConnectionString);
cosmosDbClient = new CosmosDbClient(endpointUrl: config.CosmosDbEndpointUrl, primaryKey: config.CosmosDbPrimaryKey);

var clientHandler = new HttpClientHandler();
httpClient = new HttpClient(clientHandler);

Is there anything else we can do to mitigate this? Having a bug suddenly take down all of our app functions sounds like a P0 bug to me.

You have our permission to poke around all the logs etc. You can find the exception in Application Insights. Please feel free to bring in more folks to look into this. I will also separately send you an e-mail with our entire source code.

Thanks!

@zmarty
Copy link
Author

zmarty commented Feb 13, 2019

Just a shot in the dark here, but would changing the code to play with the values below make any difference to this situation? We use the lines below in an unrelated project before to speed up writes to Azure Storage.

Note we currently do not use the code below, this is just a question.

// https://stackoverflow.com/questions/12750302/how-to-achive-more-10-inserts-per-second-with-azure-storage-tables/12750535#12750535
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;

@zmarty
Copy link
Author

zmarty commented Feb 14, 2019

Steps we have taken so far:

  • Reviewed the entire code base to make sure we use static clients. We found that we use 4 different kinds of clients (HTTP, Azure Storage, CosmosDB, EventGrid). For 3 of them we already used static clients per function.
  • We moved to using one singleton per client, across all of them functions (not per function!). So we now only have 4 instances basically - one for each type of client, per app domain.
  • We ensured the last of the 4 clients also used a singleton
  • Redeployed application

So far it's working, will continue monitoring.

@zmarty
Copy link
Author

zmarty commented Feb 14, 2019

Useful articles on this subject:
https://docs.microsoft.com/en-us/azure/architecture/antipatterns/improper-instantiation/
https://docs.microsoft.com/en-us/azure/azure-functions/manage-connections

@zmarty
Copy link
Author

zmarty commented Feb 14, 2019

How to monitor the number of connections made by each app service in your app plan, using Azure portal:

Monitor > Metrics > Select a resource > Select subscription, resource group, then under Resource type make sure you only select App Services > Select your App service > Then in the graph under Metric select Connections

@ColbyTresness
Copy link

@alexkarcher-msft can you look into this?

@ColbyTresness ColbyTresness added this to the Active Questions milestone Feb 21, 2019
@zmarty
Copy link
Author

zmarty commented Feb 21, 2019

Update: It has been about 7 days since we took the steps in the post above "Steps we have taken so far". So far after making those changes we have not seen any spikes.

I am however still worried because we did not find any explanation for the fact that one of the instance servers had CPU pegged at 100% constantly before making the changes while the others were not doing much. It sounds like bad instance management to me.

@zmarty
Copy link
Author

zmarty commented Feb 27, 2019

Yet another update: Whenever we manually queue refreshing more items using Azure Functions, we consistently still hit this port limit. The machines are otherwise fine, CPU usage and RAM is low.

It seems we are hitting some artificial and small limit for the number of outgoing connection. How can we adjust this limit?

Potentially useful reference for Azure Functions team: https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t

@alexkarcher-msft @tohling

@zmarty
Copy link
Author

zmarty commented Feb 27, 2019

The graph below shows the maximum number of connections. Each line is one of our apps, all apps are in the same App Service Plan. As you can see, it all goes sideways when we hit 4K max connections. I assume this is a hard limit that you could increase with the directions in the link I provided above?

Max Connections graph

@zmarty
Copy link
Author

zmarty commented Feb 27, 2019

More interesting findings. In the graph below, each line is the max number of connections for a server instance. All the data in the graph is for a single app in the app service.

The way we queue a bunch of manual work is to have an Azure Function generate, let's say, 1K Event Grid messages. Then we have another function in the same App that receives and processes all those incoming Event Grid messages.

The graph seems to suggest that when we receive a bunch of events from Event Grid, for some reason only a single machine receives and processes most of the event traffic. Is this expected? If the traffic would be balanced across all of the machines as expected, then maybe this problem would not be so apparent.

To summarize, my current suggestions are:

  • Increase number of possible connections from 4-5K to 65K
  • Figure out why (I assume incoming) Event Grid traffic is not balanced across all of our machines. This one was quite surprising.

Max connections per server instance

@alexkarcher-msft
Copy link
Contributor

alexkarcher-msft commented Feb 27, 2019

These limits all line up with the published connection limits of the platform.

The maximum connection limits are the following:

  • 1,920 connections per B1/S1/P1 instance
  • 3,968 connections per B2/S2/P2 instance
  • 8,064 connections per B3/S3/P3 instance

This limit cannot be raised in App Service as the system is multi-tenant and relies on a series of load balancers shared by many customers.

It looks like the load balancing behavior of eventgrid is more suspect. I'm going to poke around and try to find out if there is anything you can change, but a cursory look through our host.json reference and the eventgrid docs, it looks like there is very little you can do to control where messages go, as eventgrid is pushing messages to the Function Apps. The polling message services like Service Bus and Event Hub both have batch size controls that will ensure that messages are load balanced and not all given to one instance.

@zmarty
Copy link
Author

zmarty commented Feb 27, 2019

Thanks @alexkarcher-msft.

  1. If I understand correctly, even though we picked the App Service plans and even though we therefore use "dedicated" VMs, we are still limited by the load balancers which sit outside our App Service plan and feed traffic to our machines? If yes, that is problematic since even when we have a large number of incoming connections, our machines are pretty idle. So we cannot really use the capacity that we pay for due to an external limitation of the entire system. I will follow up with you internally on this.

  2. That link you shared talks about outgoing connections. Are the limits set on the sum of incoming+outgoing, or just on outgoing?

  3. Yes, please check on the Event Grid load balancing. It seems to me that if the incoming connections would be correctly balanced across all of our machines, we would probably not hit this 4K limit.

@zmarty
Copy link
Author

zmarty commented Feb 28, 2019

It seems the Event Grid load balancing affinity is for some reason done per function name. If I have two functions that receive Event Grid events, it seems that all the calls for one of the functions have affinity to a machine, then all of the calls for the second function have affinity to another machine in our pool. This is still obviously not good, but maybe it helps in debugging the issue.

@zmarty
Copy link
Author

zmarty commented Mar 8, 2019

Regarding the Event Grid load balancing issue, Azure support has asked me to try disabling ARR Affinity inside each of our App Services. This feature tries to direct traffic from the same "user" to the same machines, which is exactly what we DON'T want.

To do that, click on on App Service, then click on Application Settings, then switch ARR Affinity to Off and click Save.

@vludax
Copy link

vludax commented Apr 11, 2019

Thanks for documenting it all @zmarty!

We've run into the same error, although we are using storage queue trigger instead of Event Grid.

@mcupito
Copy link

mcupito commented Apr 16, 2019

We are also experiencing this. It's difficult to pinpoint the exact root cause without being able to view outbound connections in the Portal (I think you used to be able to but I can't figure out how anymore), but mostly seeing it via our Service Bus Topic Triggers.

@ggondim
Copy link

ggondim commented Mar 6, 2020

These limits all line up with the published connection limits of the platform.

The maximum connection limits are the following:

  • 1,920 connections per B1/S1/P1 instance
  • 3,968 connections per B2/S2/P2 instance
  • 8,064 connections per B3/S3/P3 instance

This limit cannot be raised in App Service as the system is multi-tenant and relies on a series of load balancers shared by many customers.

It looks like the load balancing behavior of eventgrid is more suspect. I'm going to poke around and try to find out if there is anything you can change, but a cursory look through our host.json reference and the eventgrid docs, it looks like there is very little you can do to control where messages go, as eventgrid is pushing messages to the Function Apps. The polling message services like Service Bus and Event Hub both have batch size controls that will ensure that messages are load balanced and not all given to one instance.

We've been experiencing the same error, but using another resource, Azure Functions Proxies.

I'm sure this error is related to connections limits. In our case, Functions Proxies pointing to static SPA's cause many HTTP requests, specially when published over the Internet (spiders and crawlers make even more requests). When proxies are active, they make a lot of TCP connections for forwarding HTTP responses, so, we reach at this connection limits every time.

We've tested upgrading the plan to E3 (Elastic Premium) and the issue was gone.

@gabrieljoelc
Copy link

gabrieljoelc commented Nov 24, 2020

@zmary are any of your functions using function bindings to your resources (e.g. service bus trigger, cosmos input binding)? we are having similar issues (we're getting socket exceptions though like #1112 mentions) and are already using static clients for our non-binding references. we also looked through the cosmos input binding source code and it looks to be using static clients as well. We're at a loss at what could be using up all the ports or connections.

@valdisiljuconoks
Copy link

We had a similar error case. But turned out - some instance field (connection to service bus) slipped through our hands... More info: https://blog.tech-fellow.net/2022/01/28/how-to-easily-exhaust-snat-sockets-in-your-application/

@v-bafa
Copy link

v-bafa commented Jun 8, 2023

Looks like all reproducing in the discussion are from .Net Function. Would like to check if anyone hit this issue in PowerShell Function?
In our function app, we have 15 functions running every 5 or 10 minutes. There are reading KV, as well as invoking remote script on hybrid connections in each function.
The connections are eventually hit the maximum limitation (2K in our case), and throws exceptions. After restart the function app, then everything is back, but this will happen again after around 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants