-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cold start of Azure Function with large npm module trees is slow in Consumption plan #298
Comments
Currently, there is a short idle timeout on Function apps, so if they don't get requests for 5 minutes the next request causes a cold start which takes longer. We are looking into increasing this timeout, which will make the situation much less common. |
Ive seen delays up to 2 minutes, or even timeout with Azure Function app and node.js on a dynamic hosting plan. This is terrible! running functions app version .5 |
Since upgrading to 0.5 delays have increased from 15 secs up to 50 secs on my side. |
I'm currently mitigating the cold-start issue with an additional pre-warming Azure function that periodically calls the other Azure function like so.
This function is called every 4 minutes through a timer-triggered function with the following |
So, my classic plan app is running Function App v .4 and it seems that the debugging messages may work better? I get errors showing that don't get in the dynamic app (dynamic app just hangs for long periods). Could faults on one Function affect the performance of the whole app? Sometimes I even get "BAD GATEWAY" |
This is strange. To make sure that this is indeed related to 0.5 and not some other factors, can you try going back to 0.4 (instructions here)? |
For me personally, it is not only about the cold start/idle issue, but as I wrote in #529 I think it would be good to have some transparency of the scenarios that are targeted by Azure functions. If I understand things correctly, the start up time of a function is not only occuring on a cold start of a single container, but also affects new scaled instances if the dynamic plan wants to scale out. This means that if I have an API HTTP triggered function, and one instance of that function is already running (let's say an execution takes 100ms), it could wait until the execution finishes and then cater the new request, leading to a maximum execution time of 200ms (which is kind of fine). But instead, the dynamic plan could decide to scale out, and now we have the cold start penalty again. This means that instead of a maximum execution time of 200ms, this request will go above and beyond 5 seconds. Could you explain how you are handling these cases? |
@peterjuras let's keep this issue focused on the potential cold start regression which is causing cold start of over 1 minute for some users. We can use a separate thread for a more general discussion of how the system work. For users running into these extreme cold starts, various questions that would help investigating:
|
So I did a long running test again where I ping every 3 minutes, the response times are all under 1000 MS unless one of the scheduled jobs failed to fire. Then after the six minutes lapse, the response time goes to a whopping 150,000-280,000 MS. I will try reverting to .4 and see what happens. |
@vance so you mean you are measuring http functions, but the app happens to also have a scheduled function? What's the function's schedule? Where does 'six minute lapse' come from? |
I mean I have a timer triggered function doing a HTTP request to my other function. It is set to 3 minutes. However, sometimes it does not fire, and six minutes elapse. The next time it runs, the execution time is extreme. Sometimes it's so slow, it creates a ripple effect where it goes from 280K to 140K to 30K or so, before it finally returns to under 1k MS for the next 10-20 calls. |
That's very odd behavior. If you can share your Function App name (either directly or indirectly), and the UTC time when you saw this, we can hopefully analyse the situation. |
Sure, here's a dummy app in the same region. fscape-dynamic3. Any function endpoint shows the extreme time in the logs, I think. The screenshot is dated 05 Sep 2016 19:12:35 GMT so it should be around this time. |
So I reverted to v 0.4 similar results. Also, don't know if it's related, but sometimes the "pinged" services even timeout, or even issue a CORS error, when other requests work. I tried just having a local script run a query every 10 minutes. See that result in the second image. No server changes happened at this time. |
I'm seeing similar symptoms. Although this might be an unrelated issue, in my case most of the time seems to be spent loading the azure-storage library. Any ideas if this is normal or how this can be fixed? context.log('requiring azure');
const azure = require('azure-storage');
context.log('required azure');
|
Actually, this might explain a lot of the delay. I think it may be multiplied if you hit a bunch of functions simultaneously when coming out of sleep. Maybe it pegs the I/O for the VM. I do know that this dependency is seems rather large... I have no idea how this virtualization is engineered, but I wonder if there's some way to snapshot these static resources as globals in a shared memory location that's immutable. Just a thought, don't know if that's even possible. |
Yeah keeping it in memory might not be possible is some scenarios (at least not within the host process, which might restart etc.). I think the main question though is why does even fresh initialization take so long? Azure-storage indeed has tons of dependencies (I was shocked when I first saw it, way too many for a library that effectively is just a simple rest API wrapper). But when I run the same code locally (in a new node process) it executes instantly without any issues, which makes me think it could be an Azure runtime issue... |
Wonder if it's the network or storage used. Maybe when these instances come back from idle, the data must get loaded over the network since they are not bound to any particular hardware (or VM)? I did not see these long startup times when hitting Classic Service Plan app (with AlwaysOn turned off). |
Good point @vance . Deleting an "average" node_modules folder on a dynamic plan takes well over 5 minutes in many cases. IOPS seems to be very low... |
Any way to have the container manager or whatever app pool manager cache the resources on the container disk, so they at least come back to life without going over the network? ... at least for the first instance. |
Okay, I'm having some issues with the Webpack technique and I'm wondering what other people are doing. I've mentioned this before, but I created a build-script. It goes through each one of the index.js for each function and extracts every require. These then go into a single, central javascript file like @1N50MN14 mentioned earlier. e.g. The final file is WebPacked and uglified, each index.js is set to include this file, and the references for each library are updated to reference the appropriate require. e.g. The problem I'm running into is that the final JavaScript file, while only 700kb, is large enough that if I add it to a VERY simple test function; the cold start takes longer than 5 minutes. Since I'm on a consumption plan, you can't set the timer above 5 minutes. For example, the following will take over 5 minutes to run:
Now, in the above example I don't even need the require, so it can be taken out. However, just as an example if I drop out the require, it of course runs really quickly. I'm sure if I ever got it cold started, the warm function would not have any issue with the 5 minutes. However, with the 5 minute timer it just dies before it can ever get warm. What I don't understand is I don't think my dependency list is too terribly long:
As a result it seems like there should be some way to get this to work, and in fact if anyone is getting anything to run; I feel like I should be able to get my code to run. |
@securityvoid 700k uglified code sounds like a lot are you sure you're doing it right? My entry file (named global.deps = {
Promise: require('bluebird'),
jwt: require('jsonwebtoken'),
uuid: require('uuid')
} then I have a shell script (that would be .bat on Windows I guess?) that I use to bundle and minifiy which looks like this: rm index.bundle.min.js
rm index.bundle.js
npm install
browserify --node index.src.js > index.bundle.js
uglify -s index.bundle.js -o index.bundle.min.js
rm -rf ./node_modules And in my function code I use the deps like so: require('../SDK/index.bundle.min')
var Promise = global.deps.Promise;
var uuid = global.deps.uuid;
module.exports = function(context)
{
//code
} Hope some of this helps ;) |
@1N50MN14 I'll go back and double-check all the dependencies to be sure; but I think I have it all setup right. I can tell you that I'm gitignoring node_modules, and that the code runs fine locally. As a result it should be right. Since I'm doing this as a deploy script my code actually has the "normal" requires prior to the script running that does all the updates. Only the optimized code ever makes it into the wwwroot (no node_modules). |
@1N50MN14 I went back and did a print-out of the JSON Stats from Webpack to see everything that was being packed up. It looks like the ~700k is correct. A good 3/4 of which are associated with the jsonwebtoken package. That package does do a lot, so it sort of makes sense. @davidebbo Are there plans on how to handle this? I can effectively undo all the Webpacking, but then I'm still back to the slow response times. |
With my technique, I have around a 10 second cold start (which is equivalent to the average C# cold start from my tests). Some of my files are 3MB which is huge, but I don't run uglify and it doesn't seem to matter. (It seems that the cold-start issue with Azure Functions has to do with latency not bandwidth. It seems the number of file dependencies is the problem and large files don't slow it down much.) Some observations:
Folder structure:
Process:
|
@ricklove Thank's for sharing. If you have 3mb files, the 700k definitely isn't why my stuff is so slow. I'm going to try without uglifying and see if there is a difference. BTW, are you using the standard consumption plan? |
@securityvoid Yes, I'm using the consumption plan. For your case, I suspect there is something in your deployments folder that has many files that is getting loaded during cold start. Also, I just extended my cold-start timing tests so I can directly compare a JavaScript run with no dependencies against a 3MB function that uses azure-storage. My test runs every 12 minutes alternating which function it calls (from an external function app). So it's a good test from the perspective of a client - measuring the actual turn around time from initial call to client receipt of the response (the monitor page of the function itself only shows ~2 seconds, when this test shows ~10 seconds). I'll give a report tomorrow. |
@ricklove I would think if it was something generically in my deployments folder then the function would not behave different whether I had that require or not. This times out every time:
This runs every time:
It doesn't make any sense for me to have issues with the 700k file when you have a 3M file, but it seems like the file itself is somehow the issue. I suppose the alternative is that the file name is just the last little bit that pushes it over the 5m mark. In any case, removing uglify did not really help. |
@ricklove Arg! I think I just noticed my own typo. If you notice in the file path of the require is "repository". That is the non-webpacked version of the file. The webpacked version is in the wwwroot. Since the file existed, it didn't error out, and in fact would work... given enough time. I'm redeploying now to confirm that solves things. |
@davidebbo @1N50MN14 I've made the Browserify technique work without the use of global... The shared/library module is as follows...
And the Function code file is...
The Build/Browserify commands are...
|
@phillipharding thanks for sharing! As an aside, note that we've added support to automatically run funcpack on Kudu git deployments. To enable it, simply set an Azure App Setting: |
@phillipharding Awesomes thanks for sharing this! it seems that browerify's |
I've been running into this issue for some time now and it's very frustrating as I'm seeing 60 second startup times on a regular basis. I'll give azure-functions-pack a shot at some point, and I think that the azure-functions-pack tool is a decent workaround for the short term, but is there anything on the roadmap to address the root cause? (i.e. the underlying file system being prohibitively slow) Specifically, I'm using the Slack Events API, and if Slack determines that a request to my API takes more than 5 seconds to complete (which happens regularly whenever whatever orchestrates Azure Functions decides to spin down our API), it'll automatically be retried over and over. This behavior in combination with Functions' filesystem issues means I can't really work with Slack's API without a hack that ignores "retry" messages based on a certain request header value, which isn't ideal for various reasons. Thanks |
I'm having the exact same issue with the Shopify Webhook API. It will retry if the request doesn't finish in 5 seconds. I usually get about 3-4 requests before one returns in time for it to stop trying. |
Microsoft needs to fix this ASAP. Node.js runs fine on AWS Lamba. Disappointing when you see articles like this: seven reasons why so many Node.js devs are excited about Microsoft Azure |
Forget the webpack flag, a huge number of modules wont compress properly with it. So basically no solution still, and a pretty useless product if you want serverless node. |
@tony-gutierrez agree - this is something I would have expected Azure Functions to launch with, not something that still isn't fixed a full year after the original issue was filed. 🤔 |
We started with functions trying to use the consumption plan. Not even close to working. Then we ran the app service plan for a while, and at least it worked. Then we realized we basically had no benefits of a serverless host, and all the horrid things function apps incur (like no access to console.log, super slow storage, not easy to dev locally, etc.) Did a quick modify to a normal node app and deployed to a VM and app service. The perf increase is pretty major on both. |
We used webpack to bundle the npm module dependencies into a single file to reduce the cold time, refer this medium article |
See also the newly announced Run-From-Zip feature, which also helps with cold start. |
Closing this issue as Run-From-Zip is working well enough for most people (and some others are using funcpack). It doesn't mean that there can't be more optimizations, but we can always track specific work separately, as this issue is too open-ended. |
Hi--,
I've notice that the initial call of an Azure function (i.e. a couple of minutes after the last call) takes a significant amount of time (up to 15 secs). Is there any way to pre-warm a function other than initially calling it? Will the startup performance be improved in the near future? I'm using it on the Dynamic plan. Thanks!
Best regards
--Oliver.
The text was updated successfully, but these errors were encountered: