-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intensive functions can bring down infrastructure #899
Comments
@brettsam will do investigation on this. |
The underlying issue with this and #822 is that we do not control the code being run in the function. And because the function and host run in the same process, it means the function can monopolize the CPU time, causing our important host operations to fail. We've analyzed quite a few cases and have been able to fix the issue by having the customer improve their function code, but this is hard to see when all you get is sporadic host errors. The approach we'll move towards in the near future is having a "canary" timer running in the host. If that timer starts firing late, we know there's a problem somewhere. We'll log an explicit message that can guide the user towards a solution. Right now we have no great place to log it -- the log will get buried in host logs somewhere. We hope the Application Insights work will give us a nice place to log warnings like this. |
We have a similar issue with Azure Web job sdk sometimes the lock is not getting released and due to that function goes into ‘Never finished’ status. Only by restarting the web job this problem can be resolved. Do we have any update on the fix or any recommendations to solve this problem? |
We need to give customers a clear pattern for doing intensive (I/O or CPU) functions. If it’s intermittent network glitch, then retry will solve and that’s great. But if the user function is really starving it (which we’re seeing happen at least in my case), retry will lead to infinite looping and denial-of-service.
The canonical example is:
public async Task Drown([QueueTrigger] Payload x)
{
// Read 1 million rows from Azure Tables
}
This is a general problem because many kinds of functions need some sort of "keep alive" making network calls while the function runs. These heartbeats are what tell other workers that this node is still alive (as opposed to orphaned).
For example, QueueTrigger needs to keep the queue message visibility timeout. ServicBus needs long polling. [Singleon] needs to own the lease. EventHub's EHP needs to own a lease.
The text was updated successfully, but these errors were encountered: