Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intensive functions can bring down infrastructure #899

Open
MikeStall opened this issue Nov 4, 2016 · 3 comments
Open

Intensive functions can bring down infrastructure #899

MikeStall opened this issue Nov 4, 2016 · 3 comments
Assignees
Milestone

Comments

@MikeStall
Copy link
Contributor

We need to give customers a clear pattern for doing intensive (I/O or CPU) functions. If it’s intermittent network glitch, then retry will solve and that’s great. But if the user function is really starving it (which we’re seeing happen at least in my case), retry will lead to infinite looping and denial-of-service.

The canonical example is:
public async Task Drown([QueueTrigger] Payload x)
{
// Read 1 million rows from Azure Tables
}

This is a general problem because many kinds of functions need some sort of "keep alive" making network calls while the function runs. These heartbeats are what tell other workers that this node is still alive (as opposed to orphaned).
For example, QueueTrigger needs to keep the queue message visibility timeout. ServicBus needs long polling. [Singleon] needs to own the lease. EventHub's EHP needs to own a lease.

@lindydonna
Copy link
Contributor

@brettsam will do investigation on this.

@brettsam
Copy link
Member

brettsam commented Apr 4, 2017

The underlying issue with this and #822 is that we do not control the code being run in the function. And because the function and host run in the same process, it means the function can monopolize the CPU time, causing our important host operations to fail. We've analyzed quite a few cases and have been able to fix the issue by having the customer improve their function code, but this is hard to see when all you get is sporadic host errors.

The approach we'll move towards in the near future is having a "canary" timer running in the host. If that timer starts firing late, we know there's a problem somewhere. We'll log an explicit message that can guide the user towards a solution. Right now we have no great place to log it -- the log will get buried in host logs somewhere. We hope the Application Insights work will give us a nice place to log warnings like this.

@paulbatum paulbatum modified the milestones: April 2017, May 2017 May 2, 2017
@paulbatum paulbatum modified the milestones: May 2017, June 2017 Jun 20, 2017
@brettsam brettsam modified the milestones: Next, June 2017 Jun 29, 2017
@pratap284
Copy link

We have a similar issue with Azure Web job sdk sometimes the lock is not getting released and due to that function goes into ‘Never finished’ status. Only by restarting the web job this problem can be resolved. Do we have any update on the fix or any recommendations to solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants