Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chance of getting 409 while renewing the primary host lock #1864

Open
brettsam opened this issue Sep 5, 2017 · 51 comments
Open

Chance of getting 409 while renewing the primary host lock #1864

brettsam opened this issue Sep 5, 2017 · 51 comments

Comments

@brettsam
Copy link
Member

brettsam commented Sep 5, 2017

While investigating another issue I stumbled upon a lot of these 409s coming from renewing the primary host lock. I did some investigation and I believe that, because we do a 'fire and forget' with the host disposal, it's possible that a new host comes up and acquires the lock just before the old host releases it. That means that the next attempt to renew fails, as the lock has been released.

The error looks like: Singleton lock renewal failed for blob 'locks/AppName/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2017-09-02T18:38:31.526Z (111006 milliseconds ago) with a duration of 57508 milliseconds. The lease period was 15000 milliseconds.

I think it's happening with something like below.
Note 1 -- this is all on a single instance and references to hosts are about the hosts being cycled via the ScriptHostManager.
Note 2 -- All of the host locks use the current machine id as the proposed lease, which allows multiple hosts on the same machine modify the lease.

Host1 -> acquire lock
Host1 -> renew lock
something triggers host recycle and creates a new host while disposing of the old host
Host2 -> acquire lock
Host1 -> release lock
Host2 -> renew lock -> throw

It's actually possible for several other combinations to happen (I've seen a race where Host1 renews right as it releases, also causing a throw).

Ultimately, these errors seem benign as they'll eventually recover and all will be well, but it causes a lot of errors in the logs (I see 100k+ 409 logs that involve the host lock over the last week).

@christopheranderson christopheranderson added this to the Triaged milestone Sep 20, 2017
@paulbatum paulbatum assigned alrod and unassigned alrod Oct 4, 2017
@paulbatum paulbatum modified the milestones: Triaged, Next Oct 4, 2017
@vovikdrg
Copy link

For now solution is to add settings.job with

{ "is_singleton": true }

It will not scale but at least it will not fail..

@paulbatum paulbatum modified the milestones: Next, Sprint 11 Nov 17, 2017
alrod added a commit to alrod/azure-webjobs-sdk-script that referenced this issue Nov 22, 2017
@mathewc
Copy link
Member

mathewc commented Dec 11, 2017

@brettsam You said "acquires the lock just before the old host releases it" - how can that happen? The lease can only be held by one. Perhaps what you're saying is something like:

  • h1 has lock and is renewing it regularly
  • host restart is triggered
  • h1 is orphaned (on background thread)
  • h1 stopasync is called, and while that is progressing h2 is starting up
  • h2 attempts to acquire lock, before h1 has released it (before disposal happens)

However, PrimaryHostCoordinator.AcquireOrRenewLeaseAsync seems to already handle this case correctly - if it doesn't have the lease and fails to acquire it, no error is thrown.

@alrod
Copy link
Member

alrod commented Dec 12, 2017

@mathewc, the lease can be held by multiple hosts inside single instance as we use the same leaseId (instanceId) to acquire the lease. It's flow @brettsam mentioned:
Host1 -> acquire lock
Host1 -> renew lock
something triggers host recycle and creates a new host while disposing of the old host
Host2 -> acquire lock
Host1 -> release lock
Host2 -> renew lock -> throw

@mathewc
Copy link
Member

mathewc commented Dec 12, 2017

I see, thanks for clarifying. If the problem is that the renewal fails because we already have the lock then can't we handle that condition by ignoring that specific error?

@fabiocav
Copy link
Member

Work has been merged in 1.x. Need to make sure this isn't an issue in 2.0.

@mathewc
Copy link
Member

mathewc commented Jan 23, 2018

Related to #2326

@christopheranderson christopheranderson added this to the Triaged milestone Feb 15, 2018
@paulbatum paulbatum modified the milestones: Triaged, Backlog Apr 6, 2018
@pseabury
Copy link

@brettsam

Any lease renewals that end with /host are likely nothing to worry about. Instances can swap between "primary" hosts without affecting the running functions. Let me know if you see anything that may be leading to failures and I can investigate.

2020-04-15 09:35:27,550 Host.Singleton RD00155DE250FA ERROR Singleton lock renewal failed for blob 'fr911prodadapterprocessorfunc/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-04-15T09:34:52.747Z (34714 milliseconds ago) with a duration of 140 milliseconds. The lease period was 15000 milliseconds.

Overnight I saw this error a number of times, but it caused enough of a delay (5-10mins!) in consumption of ServiceBus messages that it triggered several application-level errors for us as processing these messages in a timely fashion is critical. Currently these are v2 functions with ServiceBus triggers deployed to an AppService that is scaled out on 2-N instances.

Paul

@pseabury
Copy link

Disregard my previous - after further investigation these errors were just coincidental to another network issue outside of Azure.

@ThomasArdal
Copy link

I'm also seeing this error. This is the most recent instance:

Singleton lock renewal failed for blob 'myfunctionhere/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-04-26T09:50:57.017Z (18690 milliseconds ago) with a duration of 4 milliseconds. The lease period was 15000 milliseconds

I'm running on Functions v2 and all of the functions inside this app are HTTP triggered and short living.

I agree with the other people in this thread. If this isn't an issue, it shouldn't be logged as an error. Errors are something that we need to look at and not simply ignore (IMO).

@ThomasArdal
Copy link

Idea! Maybe use structured logging properties for the 2020-04-26T09:50:57.017Z (18690 milliseconds ago) part of the message. This would make it a lot easier for people to ignore errors of this type 👍

@wpitallo
Copy link

I have a webapp running azure functions and I am getting this everytime I restart the webapp or swap slots.

2020-09-15T10:01:12Z [Error] Singleton lock renewal failed for blob '.../host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-09-15T10:00:47.566Z (24432 milliseconds ago) with a duration of 3 milliseconds. The lease period was 15000 milliseconds.

@fabiocav
Copy link
Member

@wpitallo any other side effect? This may happen if the primary host changes and the lease is stolen by another instance.

@wpitallo
Copy link

Not that I am aware of at this stage.

@shaw-system1
Copy link

shaw-system1 commented Jan 5, 2021

Same issue here with an Azure (~3) timer triggered function;

Singleton lock renewal failed for blob '.../host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 0001-01-01T00:00:00Z (-2147483648 milliseconds ago) with a duration of 0 milliseconds. The lease period was 15000 milliseconds.

Last successful lock time on this is weird. Doesn't seem to cause problems, just fills app insights with sev 3 errors

@ketanbembi
Copy link

Hi All

I am facing same issue in my Azure Function app.

It seems like the real problem is that in non prod environments we are using single blob storage used by multiple Azure function apps. Hence when renew lease operation is performed by one of the Azure function app, then due multiple azure function apps using blob storage, the error of "Singleton lock renewal failed for blob" happens. Which in turns lead to restart of job host and hence azure function abruptly stops without any function error.

I can conclude the above fact based on the my investigation that our production environment doesn't have same setup of using single blob storage used by multiple azure function apps. And we have never face "Singleton lock renewal failed for blob" error in that environment.

So as per my understanding the solution to this error is avoid sharing single blob storage with multiple Azure functions.

TIA

@ThomasArdal
Copy link

I can confirm that this happens in environments with individual storage accounts for each function app as well. I always create a new storage account for every new function app I create. Sharing storage accounts introduce some other problems why I wouldn't recommend sharing storage accounts between function apps ever 👍

@beaubgosse
Copy link

I'll be following this, since we also see this problem with a consumption plan Azure Function App

@hiattp
Copy link

hiattp commented Jan 4, 2022

We receive this error every time our durable function executes (we are using dedicated storage for this function and it is running in an isolated application service environment). Unfortunately this appears to halt the function execution after 80 seconds. So we have a durable function that will run for ~80 seconds, then we get the "Singleton lock renewal failed" message in the log traces and no further logging from our application code until it retries the durable function activity 30 minutes later. It cycles like this indefinitely, running for ~80 seconds every 30 minutes but never completing.

It doesn't seem to be directly related to our application logic because within that 80 second period the application logic can get further along (e.g. if we bump the thread count). And if we run the durable activity logic directly on a local instance of the docker container it will run to completion. So it seems to be time bound, and prevents our application code from executing despite no errors in the application logic itself.

We are running on version ~4, but downgrading to ~3 didn't seem to help.

@kaluznyt
Copy link

kaluznyt commented Jan 21, 2022

I started seeing this after updating (function) apps (version ~4 / .net 6) and updating all of dependencies, there are only time triggers (multiple) and service bus triggers (multiple) there is 1 storage account. on .net3.1 and some older library versions I havent seen those logging, perhaps it's just some noise, I havent spotted yet if that affects function execution in any way.

Request [xxxxxxxxxxx] PUT https://name.blob.core.windows.net/azure-webjobs-hosts/locks/fname/host?comp=lease
x-ms-lease-action:acquire
x-ms-lease-duration:15
x-ms-proposed-lease-id:xxxxxxxxxxxx
x-ms-version:2020-08-04
Accept:application/xml

Error response [xxxxxxxxxxxx] 409 There is already a lease present. (00.0s)
Server:Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-version:2020-08-04
x-ms-error-code:LeaseAlreadyPresent

Those are logs of severity "Warning" in insights

@shaeney
Copy link

shaeney commented Jan 24, 2022

I also started seeing this after updating (function) apps (version ~4 / .net 6) and updating all of dependencies.

A single Service Bus Trigger. 1 Function, 1 Storage account.

Didnt see this before update to Function v4 and net 6

@Bpflugrad
Copy link

This was affecting my Azure Functions V4 apps with .NET 6 with deployment slots.

What I ended up discovering was that one of my swap slot was using the same Storage Account connection string as my production Slot. Once I changed this the lease problem stopped appearing in Application Insights.

@paul-datatech911
Copy link

@Bpflugrad - that's good info, thanks!

@kacperwojtyniak
Copy link

I am also seeing this error, although I am using a single Storage Account for a Function App. The issue appears when scaling out to multiple instances. I have just one BlobTrigger function and two TimerTrigger functions. Is there a way to avoid this error or at least make it not log as an exception in App Insights?

@framewerks
Copy link

I also started seeing this after updating (function) apps (version ~4 / .net 6) and updating all of dependencies.

A single Service Bus Trigger. 1 Function, 1 Storage account.

Didnt see this before update to Function v4 and net 6

Same for me, we discovered after our log analytics costs increased (!). Like @Bpflugrad I found I had slots sharing the same storage account, but even after fixing this the issue remains.

@jedmonsanto
Copy link

Creating in Home Storage caused the same problem, just giving my two cents
image
but when creating in local path, works fine

@chralph
Copy link

chralph commented Dec 7, 2022

Adding, also get this issue with any trigger type at random intervals. We tried to scale up and out but this did not provide a solution. Happening on multiple function apps with no correlation. Advised that this is is just noise and too ignore but it still shows as an error in insights which is not ideal.

@LeszekKalibrate
Copy link

It happens in function apps (version ~3 / .net core 3.1) too.

@magnusjak
Copy link

Also seeing this error (409 with 'LeaseIdMismatchWithLeaseOperation') on one of our function apps (Runtime version ~4).
It doesn't happen every run though, just once in a while. Still annoying to have this pop up in our exceptions dashboard.

Are there no easy way to silence these errors if they don't really matter?

@gopala000
Copy link

I see this event and another thread start processing the message. Any ideas how we can avoid the issue? Unless you code for idempotency this will be an issue. For example, people will get invitation emails twice if not coded to exclude resending emails if sent in last minute.

More details here. https://stackoverflow.com/questions/77070373/azure-functions-service-bus-trigger-settings-maxautolockrenewalduration-or-maxau

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests