New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃悶 dagger v0.11.2 possible memory leak #7258
Comments
This actually sounds like something really similar to what we've been experiencing with our Dagger setup in K8s. Our team has been really vocal with the Dagger people and I believe they are on this. It's a bummer this is more wide spread, but it gives me some solace that it's not just my K8s setup seeing this issue. |
Gonna take a look at this to see if we can repro and squeeze in a fix for next release 馃 |
Can repro. Found one cause which is a goroutine leak upstream, fix here: moby/buildkit#4902 cc @jedevc Re-running (reproing by executing full module test suite) with that fix in place does fix the goroutine leak and help the memory usage somewhat, but the RSS is still flat at ~1GB after everything has disconnected, which is still too high.
|
It does seem that at least part of the remaining RSS is just due to Linux not returning memory pages until under memory pressure: golang/go#39779 (comment)
That being said, looking at the heap usage after re-running tests repeatedly does show some usage creeping up (though pretty slowly). One heap snapshot:
I think some of it may be from Go pools (e.g. the json literal store), which is fine since that should still be freeable when needed (afaik, worth confirmation). But I am a bit confused why Overall, I suspect that the worst of the leak is fixed by that upstream PR mentioned in previous comment, but there does still seem to be some stuff to look into. |
The worst parts of this look like they should ideally be solved by #7295. I'm going to take this out of the milestone (since we've now merged that PR), but at least leave this open until the upstream issue has merged. |
Awesome @sipsma |
Aha thanks for bumping this @kjuulh - the upstream fix for this is merged now, so closing this 馃帀 If you find any more performance regressions like this, we can re-open / open a new issue, whatever works! Thanks! |
What is the issue?
We've seen that since bumping to 0.11 + some of our containers are facing out of memory issues. Currently our dagger engines sit at around 7GB of memory, when before 0.11 they were at 2GB. It continues to climp until kubernetes choses to preempty the pods.
We initially saw it because quite a few barebones debian:12.5-slim execs with apt-get update + apt-get install -y ca-certificates gave a 137 exit code.
Which from what I could gather is when buildkit sends a SIGKILL to the process because it runs out of memory. As 0.10 -> 0.11 was quite a big change, it is possibly quite difficult to find the regression.
apt-get is also cut off somewhere when it runs, it isn't always in the start, it can be during the Reading package lists, or when it is about to finish. It isn't entirely consistent. I am also quite unsure why buildkit only kills this exact process
We get a few every hour.
Dagger version
dagger v0.11.2
Steps to reproduce
We simply let our CI system run for a few hours (probably around 100-200 builds of our golang build pipeline).
Log output
The text was updated successfully, but these errors were encountered: