New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The system lacked sufficient buffer space or because a queue was full #1127
Comments
Just a shot in the dark here, but would changing the code to play with the values below make any difference to this situation? We use the lines below in an unrelated project before to speed up writes to Azure Storage. Note we currently do not use the code below, this is just a question. // https://stackoverflow.com/questions/12750302/how-to-achive-more-10-inserts-per-second-with-azure-storage-tables/12750535#12750535
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false; |
Steps we have taken so far:
So far it's working, will continue monitoring. |
How to monitor the number of connections made by each app service in your app plan, using Azure portal: Monitor > Metrics > Select a resource > Select subscription, resource group, then under Resource type make sure you only select App Services > Select your App service > Then in the graph under Metric select Connections |
@alexkarcher-msft can you look into this? |
Update: It has been about 7 days since we took the steps in the post above "Steps we have taken so far". So far after making those changes we have not seen any spikes. I am however still worried because we did not find any explanation for the fact that one of the instance servers had CPU pegged at 100% constantly before making the changes while the others were not doing much. It sounds like bad instance management to me. |
Yet another update: Whenever we manually queue refreshing more items using Azure Functions, we consistently still hit this port limit. The machines are otherwise fine, CPU usage and RAM is low. It seems we are hitting some artificial and small limit for the number of outgoing connection. How can we adjust this limit? Potentially useful reference for Azure Functions team: https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t |
The graph below shows the maximum number of connections. Each line is one of our apps, all apps are in the same App Service Plan. As you can see, it all goes sideways when we hit 4K max connections. I assume this is a hard limit that you could increase with the directions in the link I provided above? |
More interesting findings. In the graph below, each line is the max number of connections for a server instance. All the data in the graph is for a single app in the app service. The way we queue a bunch of manual work is to have an Azure Function generate, let's say, 1K Event Grid messages. Then we have another function in the same App that receives and processes all those incoming Event Grid messages. The graph seems to suggest that when we receive a bunch of events from Event Grid, for some reason only a single machine receives and processes most of the event traffic. Is this expected? If the traffic would be balanced across all of the machines as expected, then maybe this problem would not be so apparent. To summarize, my current suggestions are:
|
These limits all line up with the published connection limits of the platform. The maximum connection limits are the following:
This limit cannot be raised in App Service as the system is multi-tenant and relies on a series of load balancers shared by many customers. It looks like the load balancing behavior of eventgrid is more suspect. I'm going to poke around and try to find out if there is anything you can change, but a cursory look through our host.json reference and the eventgrid docs, it looks like there is very little you can do to control where messages go, as eventgrid is pushing messages to the Function Apps. The polling message services like Service Bus and Event Hub both have batch size controls that will ensure that messages are load balanced and not all given to one instance. |
Thanks @alexkarcher-msft.
|
It seems the Event Grid load balancing affinity is for some reason done per function name. If I have two functions that receive Event Grid events, it seems that all the calls for one of the functions have affinity to a machine, then all of the calls for the second function have affinity to another machine in our pool. This is still obviously not good, but maybe it helps in debugging the issue. |
Regarding the Event Grid load balancing issue, Azure support has asked me to try disabling ARR Affinity inside each of our App Services. This feature tries to direct traffic from the same "user" to the same machines, which is exactly what we DON'T want. To do that, click on on App Service, then click on Application Settings, then switch ARR Affinity to Off and click Save. |
Thanks for documenting it all @zmarty! We've run into the same error, although we are using storage queue trigger instead of Event Grid. |
We are also experiencing this. It's difficult to pinpoint the exact root cause without being able to view outbound connections in the Portal (I think you used to be able to but I can't figure out how anymore), but mostly seeing it via our Service Bus Topic Triggers. |
We've been experiencing the same error, but using another resource, Azure Functions Proxies. I'm sure this error is related to connections limits. In our case, Functions Proxies pointing to static SPA's cause many HTTP requests, specially when published over the Internet (spiders and crawlers make even more requests). When proxies are active, they make a lot of TCP connections for forwarding HTTP responses, so, we reach at this connection limits every time. We've tested upgrading the plan to E3 (Elastic Premium) and the issue was gone. |
@zmary are any of your functions using function bindings to your resources (e.g. service bus trigger, cosmos input binding)? we are having similar issues (we're getting socket exceptions though like #1112 mentions) and are already using static clients for our non-binding references. we also looked through the cosmos input binding source code and it looks to be using static clients as well. We're at a loss at what could be using up all the ports or connections. |
We had a similar error case. But turned out - some instance field (connection to service bus) slipped through our hands... More info: https://blog.tech-fellow.net/2022/01/28/how-to-easily-exhaust-snat-sockets-in-your-application/ |
Looks like all reproducing in the discussion are from .Net Function. Would like to check if anyone hit this issue in PowerShell Function? |
Summary: All of the Azure Functions under a shared App Service plan suddenly started failing in production with error: "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full"
@tohling - opening a bug similar to #1112 . We are a team inside Microsoft facing issues in production, so any timely help on this would be most appreciated. We already have an open support ticket number. for reference: 119021321000198
Azure Functions version: v2 (.NET core)
App Service plan: FastPath-prod-east-us
Affected Function Apps: All of them, but if you want a specific example, use FastPathIceHockey-prod-east-us, function: NHL-Game-GameEndNotification
We already do follow the best practices in using static clients when making outbound connections, as described here: https://github.com/Azure/azure-functions-host/wiki/Managing-Connections
Here are some examples on how we define the clients:
In each function we define static clients like so:
Then inside static constructors for each function we make instances:
Is there anything else we can do to mitigate this? Having a bug suddenly take down all of our app functions sounds like a P0 bug to me.
You have our permission to poke around all the logs etc. You can find the exception in Application Insights. Please feel free to bring in more folks to look into this. I will also separately send you an e-mail with our entire source code.
Thanks!
The text was updated successfully, but these errors were encountered: